Facebook：BigGraph 中文文档-评估（PyTorch）

图嵌入是一种从图中生成无监督节点特征（node features）的方法，生成的特征可以应用在各类机器学习任务上。现代的图网络，尤其是在工业应用中，通常会包含数十亿的节点（node）和数万亿的边（edge）。这已经超出了已知嵌入系统的处理能力。Facebook开源了一种嵌入系统，PyTorch-BigGraph（PBG），系统对传统的多关系嵌入系统做了几处修改让系统能扩展到能处理数十亿节点和数万亿条边的图形。

本系列为翻译的pytouch的官方手册，希望能帮助大家快速入门GNN及其使用，全文十五篇，文中如果有勘误请随时联系。

（一）Facebook开源图神经网络-Pytorch Biggraph

（二）Facebook：BigGraph 中文文档-数据模型（PyTorch）

（三）Facebook：BigGraph 中文文档-从实体嵌入到边分值（PyTorch）

（四）Facebook：BigGraph 中文文档-I/O格式化（PyTorch）

（五）Facebook：BigGraph 中文文档-批预处理

（六）Facebook：BigGraph 中文文档-分布式模式（PyTorch）

（七）Facebook：BigGraph 中文文档-损失计算（PyTorch）

（八）Facebook：BigGraph 中文文档-评估（PyTorch）

Evaluation 评估

During training, the average loss is reported for each edge bucket at each pass. Evaluation metrics can be computed on held-out data during or after training to measure the quality of trained embeddings.

在训练过程中，为每个边块每次传入的平均损失报告。评估指标在训练中或者训练结束时计算并用于评估被训练好的嵌入的质量。

Offline evaluation 离线评估

The torchbiggraph_eval command will perform an offline evaluation of trained PBG embeddings on a validation dataset. This dataset should contain held-out data not included in the training dataset. It is invoked in the same way as the training command and takes the same arguments.

torchbiggraph_eval命令将在验证集上为已训练好的PBG嵌入执行离线评估。这个数据集应该包含在held-out数据集并且不包含在训练数据集中。命令行的调用和训练命令用同样的方式，并且使用同样的参数。

It is generally advisable to have two versions of the config file, one for training and one for evaluation, with the same parameters except for the edge paths, in order to evaluate a separate (and often smaller) set of edges. (It’s also possible to use a single config file and have it produce different output based on environment variables or other context). Training-specific config parameters (e.g., the learning rate, loss function, …) will be ignored during evaluation.

通常来说建议配置文件中包含两个版本，一个用于训练，一个用于评估，除了边的路径之外，参数相同，以便让评估一个独立的（通常来说更小）的边集合上进行。（也可以使用单个配置文件，并根据环境变量或其他上下文生成不同的输出）。评估时将忽略训练特定配置参数（例如，学习率、损失函数等）。

The metrics are first reported on each bucket, and a global average is computed at the end. (If multiple edge paths are in use, metrics are computed separately for each of them but still ultimately averaged).

评估值的计算现在每个块上计算，然后计算全局的平均值（如果使用了多边路径，则分别计算每个边路径的度量值，最后依旧使用平均值）。

Many metrics are statistics based on the “ranks” of the edges of the validation set. The rank of a positive edge is determined by the rank of its score against the scores of a certain number of negative edges. A rank of 1 is the “best” outcome as it means that the positive edge had a higher score than all the negatives. Higher values are “worse” as they indicate that the positive didn’t stand out.

许多度量是居于验证集的边的排序做的统计。正白案的排序是由其相对于一定数量的负边的得分的排序来确定的。排名为1是“最好”的结果，因为它意味着正边的得分比所有负边的得分都要高。越高的数值代表“更差”，这说明正向样本表现并不突出。

It may happen that some of the negative samples used in the rank computation are in fact other positive samples, which are expected to have a high score and may thus cause adverse effects on the rank. This effect is especially visible on smaller graphs, in particular when all other entities are used to construct the negatives. To fix it, and to match what is typically done in the literature, a so-called “filtered” rank is used in the FB15k demo script (and there only), where positive samples are filtered out when computing the rank of an edge. It is hard to scale this technique to large graphs, and thus it is not enabled globally. However, filtering is less important on large graphs as it’s less likely to see a training edge among the sampled negatives.

在一些情况下，使用的负样本在排序计算实际上可能是其他正样本，而本身这些正样本期望具有较高的分值。这会引起对排序造成不利的影响。这种影响在图相较较小的情况下比较明显，尤其是当所有的其他实体都被用来构造负样本的情况下。为了解决这个问题并和文档中所做的工作相匹配，FB15k演示脚本（仅该demo）中使用了一个叫“过滤”的排序，在计算边缘排序时过滤出正样本。这种技术很难扩展到大型图，因此无法全局启用。然而，对于大型图来说过滤并不重要，因为他不太可能在采样的负样本中看到训练边缘。

The metrics are:

计算指标包括：

Mean Rank: the average of the ranks of all positives (lower is better, best is 1).

平均排序:所有正样本的平均排序等级（越低越好，最好是1）

Mean Reciprocal Rank (MRR): the average of the reciprocal of the ranks of all positives (higher is better, best is 1).

平均倒数排序：所有正向排序的平均值（越高越好，最好是1）

Hits@1: the fraction of positives that rank better than all their negatives, i.e., have a rank of 1 (higher is better, best is 1).

命中@1：排名好于所有负样本的正样本的比例，即排名为1（越高越好，最好是1）

Hits@10: the fraction of positives that rank in the top 10 among their negatives (higher is better, best is 1).

命中@10：排名在前10的正样本的比例（越高越好，最好是1）

Hits@50: the fraction of positives that rank in the top 50 among their negatives (higher is better, best is 1).

命中@50：排名在前50的正样本的比例（越高越好，最好是1）

Area Under the Curve (AUC): an estimation of the probability that a randomly chosen positive scores higher than a randomly chosen negative (any negative, not only the negatives constructed by corrupting that positive).

曲线下面积（auc）：对随机选择的正分数高于随机选择的负分数的概率的估计。（任何负样本，不仅是通过正样本生成的负样本）

Evaluation during training 线上评估

Offline evaluation is a slow process that is intended to be run after training is complete to evaluate the final model on a held-out set of edges constructed by the user. However, it’s useful to be able to monitor overfitting as training progresses. PBG offers this functionality, by calculating the same metrics as the offline evaluation before and after each pass on a small set of training edges. These stats are printed to the logs.

离线评估是一个缓慢的过程，目标是在训练完成后运行，用来完成对最终模型在held-out集合的边上的结果评估。然而，随着训练的进行，能监控过拟合是很有用的。PBG提供了这样的特性，每次计算一组小的训练边的集合，然后通过计算于离线评估是否相同来度量，这些数据被打印到日志中。

The metrics are computed on a set of edges that is held out automatically from the training set. To be more explicit: using this feature means that training happens on fewer edges, as some are excluded and reserved for this evaluation. The holdout fraction is controlled by the eval_fraction config parameter (setting it to zero thus disables this feature). The evaluations before and after each training iteration happen on the same set of edges, thus are comparable. Moreover, the evaluations for the same edge chunk, edge path and bucket at different epochs also use the same set of edges.

评估值是在一个边集合中在持有的训练集合上自动计算得出的，更明确的说：这个特性标识训练在较少的边上进行，应为有些变被预留用于此评估。持有集合的分数由eval_fraction config参数来控制（如果要禁用，将其置为0）。每次训练迭代前后的评价都发生在同一组边上，这让结果具有可比性。此外，对于不同迭代的同一边缘块、边路径和桶的评价也使用相同的边集和。

Evaluation metrics are computed both before and after training each edge bucket because it provides insight into whether the partitioned training is working. If the partitioned training is converging, then the gap between the “before” and “after” statistics should go to zero over time.On the other hand, if the partitioned training is causing the model to overfit on each edge bucket (thus decreasing performance for other edge buckets) then there will be a persistent gap between the “before” and “after” statistics.

在训练每个边的块前后都会计算评估值，这样可以观察训练是否有效。如果分区训练正在收敛，那随着时间推移，“before”和“after”统计数据之间的差值应该为0。另外一方面，如果分区训练导致模型在每个边桶上过拟合（这样会降低其他边缘桶的性能），则“before”和“after”统计之前将存在持续的差。

It’s possible to use different batch sizes for same-batch and uniform negative sampling by tuning the eval_num_batch_negs and the eval_num_uniform_negs config parameters.

通过调整eval_num_batch_negs 和 eval_num_uniform_negs这两配置，可以在同批次和均匀负采样中使用不同的大小批次。