Facebook：BigGraph 中文文档-损失计算（PyTorch）

图嵌入是一种从图中生成无监督节点特征（node features）的方法，生成的特征可以应用在各类机器学习任务上。现代的图网络，尤其是在工业应用中，通常会包含数十亿的节点（node）和数万亿的边（edge）。这已经超出了已知嵌入系统的处理能力。Facebook开源了一种嵌入系统，PyTorch-BigGraph（PBG），系统对传统的多关系嵌入系统做了几处修改让系统能扩展到能处理数十亿节点和数万亿条边的图形。

本系列为翻译的pytouch的官方手册，希望能帮助大家快速入门GNN及其使用，全文十五篇，文中如果有勘误请随时联系。

（一）Facebook开源图神经网络-Pytorch Biggraph

（二）Facebook：BigGraph 中文文档-数据模型（PyTorch）

（三）Facebook：BigGraph 中文文档-从实体嵌入到边分值（PyTorch）

（四）Facebook：BigGraph 中文文档-I/O格式化（PyTorch）

（五）Facebook：BigGraph 中文文档-批预处理

（六）Facebook：BigGraph 中文文档-分布式模式（PyTorch）

（七）Facebook：BigGraph 中文文档-损失计算（PyTorch）

Loss calculation 损失计算

The training process aims at finding the embeddings for the entities so that the scores of the positive edges are higher than the scores of the negative edges. When unpacking what this means, three different aspects come into play:

One must first determine which edges are to be considered as positive and negative samples.

Then, once the scores of all the samples have been determined, one must decide how to aggregate them in a single loss value.

Finally, one must decide how to go about optimizing that loss.

This chapter will dive into each of these issues.

训练过程的目标是寻找实体的嵌入向量使正边的得分高于负边的得分。当分析这意味着什么的时候，有以下三个方面需要关注：

1）必须首先确定把哪些边将被视为正样本和哪些视为负样本。

2）然后，当所有样本的分值确定的时候，我们需要决定如何汇总到一个损失loss value上。

3）最后，我们需要决定如何优化optimizing这个损失。

本章将深入讨论这里的每一个问题。

Negative sampling 负采样

The edges provided in the input data are known to be positives but, as PBG operates under the open-world assumption, the edges that are not in the input are not necessarily negatives. However, as PBG is designed to perform on large sparse graphs, it relies on the approximation that any random edge is a negative with very high probability.

输入数据中提供的边是已知的正样本，但是由于PBG假设operates在开放世界中运行，故而输入中不存在的边不一定是负的。然而，PBG的设计是用于处理大稀疏图，依赖于随机一个边是负边的概率很高。

The goal of sampling negatives is to produce a set of negative edges for each positive edge of a batch. Usual downstream applications (ranking, classification, …) are interested in comparing the score of an edge (x,r,y1) with the score of an edge (x,r,y2). Therefore, PBG produces negative samples for a given positive edge by corrupting the entity on one of its sides, keeping the other side and the relation type intact. This makes the sampling more suited to the task.

负采样的目标是为每个batch的正边生成一个负边的集合。通常下游的应用（排序、分类等）会更关心边 (x,r,y1) 和 (x,r,y2) 得分的比较。因此，PBG通过保持另一侧实体，消损一个关系中一侧的实体来为一个给定的正边生成负样本，这样可以保证关系是正确的。这样可以得到一个更适用于任务的采样。

For performance reasons, the set of entities used to corrupt the positive edges in order to produce the negative samples may be shared across several positive edges. The way this usually happens is that positive edges are split into “chunks”, a single set of entities is sampled for each chunk, and all edges in that chunk are corrupted using that set of entities.

处于性能原因，用于生成负样本的消解正边实体集合可能会跨多个正边共享。这种方式通常应用于当正边被分割到不同的“chunks”中时，一个单独的实体集合在每个chunk中被采样，并且在该chunk中的所有边都被这个集合中的实体消解过。

PBG supports several ways of sampling negatives:

PBG支持几种样本抽样的方法：

All negatives 全负采样

The most straightforward way is to use, for each positive edge, all its possible negatives. What this means is that for a positive (x,r,y)(x,r,y) (where x and y are the left- and right-hand side negatives respectively and r is the relation type), its negatives will be (x′,r,y) for all x′of the same entity type as x and (x,r,y′) for y′ of the same entity type as y. (Due to technical reasons this is in fact restricted to only the x′ in the same partition as xx, and similarly for y′, as negative sampling always operates within the current bucket.)

最直接的方法是，对每一个正边，使用所有可能的负样本。也就是说对于一个正的（x，r，y）（其中x和y分别是左边和右边的负因素，r是关系类型），对于x，负采样是（x′，r，y），x′是和x相同类型的实体，对于y，（x，r，y′）实体类型相同的所有x′（x，r，y′），y′也是和y相同的实体类型。（出于技术原因，由于负采样始终在当前桶内工作，所以实际上仅限于与x在同一分区中的x′，有y,y′也是处于同一分区）。

As one can imagine, this method generates a lot of negatives and thus doesn’t scale to graphs of any significant size. It should not be used in practice, and is provided in PBG mainly for “academic” reasons. It is mainly useful to get more accurate results during evaluation on small graphs.

可以而知，这种方法会产生大量的负边，并无法扩展到任意有意义的大小的图。实践中通常不会采用使用，PBG中提供它主要是出于“学术”的原因，主要用于在对小图进行评价时获得更准确的结果。

This method is activated on a per-relation basis, by turning on the all_negs config flag. When it’s enabled, this mode takes precedence and overrides any other mode.

此方法是通过一个预置关系激活，方法是打开all_negs 配置项。启用时，此模式优先并覆盖任何其他模式。

Same-batch negatives 同批次负采样

This negative sampling method produces negatives for a given positive edge of a batch by sampling from the other edges of the same batch. This is done by first splitting the batch into so-called chunks (beware that the name “chunks” is overloaded, and these chunks are different than the edge chunks explained in Batch preparation). Then the set of negatives for a positive edge (x,r,y) contains the edges (x′,r,y)for all entities x′ that are on left-hand side of another edge in the chunk, and the edges (x,r,y′) with y′ satisfying the same condition for the right-hand side.

对一个给定的正向边，从同批次的其他边采样的方式来负采样的方法生成负边。这个通过第一次分割批次到分块中完成（注意名称“chunks”是重载的，并且这些块与批处理准备中解释的边块不同）。然后对正边（x，r，y）的负采样包含了边（x，r，y′），其中y′是满足同样条件的右节点。

For a single positive edge, this means that the entities used to construct its negatives are sampled from the current partition proportionally to their degree, a.k.a., according to the data distribution. This helps counteract the effects of a very skewed distribution of degrees, which might cause the embeddings to just capture that distribution.

对于一个单独的正向边，这意味着这些用户构造其负边的实体从当前分区中按照数据的分布比例（即a.k.a）来采样。这有助于抵消分布不均匀的影响，从而避免了嵌入只捕获单分布。

The size of the chunks is controlled by the global num_batch_negs parameter. To disable this sampling mode, set that parameter to zero.

块的大小是由全局配置项num_batch_negs控制，如果要禁用该采样模式，可以将该参数置为0

Uniformly-sampled negatives 均匀负采样

This last method is perhaps the most natural approximation of the “all negatives” method that scales to arbitrarily large graphs. Instead of using all the entities on either side to produce negative edges (thus having the number of negatives scale linearly withe the size of the graph), a fixed given number of these entities is sampled uniformly with replacement. Thus the set of negatives remains of constant size no matter how large the graph is. As with the “all negatives” method, the sampling here is restricted to the entities that have the same type and that belong to the same partition as the entity of the positive edge.

最后一种方法是通过任意大小的图进行缩放的方法来取“所有负边”的最大似然。有别于使用all实体所在的边来生成负边（因此负边的数量与图的大小成线性比例），而是固定一个具体的数量，并在采样时保持同分布。因此，无论图有多大，负样本的集合大小都保持不变。与“all negatives”方法一样，这里的采样仅限于与正边的实体属于同一分期并具有相同类型的样本。

This method interacts with the same-batch method, as all the edges in a chunk receive the same set of uniformly sampled negatives. This caveat means that the uniform negatives of two different positives are independent and uncorrelated only if they belong to different chunks.

这种方法与same-batch方法正交，因为所有在一个块中的边都接收同一分布均匀采样的集合。只有两个不同的正边属于不同块，他们的负采样才是独立和不相关的。

This method is controlled by the num_uniform_negs parameter, which controls how many negatives are sampled for each chunk. If num_batch_negs is zero, the batches will be split into chunks of size num_uniform_negs.

该方法由num_uniform_negs参数控制，该参数控制每个块采样多少个负样本。如果num_batch_negs为零，则批将被拆分为大小为num_uniform_negs的块。

Loss functions 损失函数

Once positive and negative samples have been determined and their scores have been computed by the model, the scores’ suitability for a certain application must be assessed, which is done by aggregating them into a single real value, the loss. What loss function is most appropriate depends on what operations the embeddings will be used for.

当确认了正负样本并通过模型来计算了他们的打分，我们需要评估打分是否适用于一些特定应用，这通过将他们聚合成一个独立的值即损失来完成。使用什么类型的损失函数取决于嵌入将被用于哪一类操作。

In all cases, the loss function’s input will be a series of scores for positive samples and, for each of them, a set of scores for corresponding negative samples. For simplicity, suppose all these sets are of the same size (if they are not, they can be padded with “negative infinity” values, as these are the “ideal” scores for negative edges and should thus induce no loss).

一般情况下，损失函数的输入是一组正样本的分数，对每一个样本，对应一组负样本的分数。为了简单起见，假设这些集合的大小是相同的（如果不是，则用“负无穷大”填充，因为这是负边的“理想”分数，因此不会导致任何损失）

Ranking loss 排序损失

The ranking loss compares each positive score with each of its corresponding negatives. For each such pair, no loss is introduced if the positive score is greater than the negative one by at least a given margin. Otherwise the incurred loss is the amount by which that inequality is violated. This is the hinge loss on the difference between the positive and negative score. Formally, for a margin m, a positive score si and a negative score ti,jti,j, the loss is max(0,m−si+ti,j). The total loss is the sum of the losses over all pairs of positive and negative scores, i.e., over all i and j.

排序损失将比较每个正向分数和其对应负样本集内的每个样本。对每个比较对，如果正样本分数至少比负样本分数大出一个给定的差值就不会有损失，否则，产生的损失就是就是这个差值。这是正、负样本不同的关键损失。形式上来说，对于一条边m，一个正分Si和一个负分Ti,j，则损失为max（0,, m-Si+Ti,j）.整体的损失是所有正负对损失的加和，即，所有i和j。

This loss function is chosen by setting the loss_fn parameter to ranking, and the target margin is specified by the margin parameter.

通过将loss_fn参数设置为ranking来选择此损失函数，并通过margin参数指定目标差额。

This loss function is suitable when the setting requires to rank some entities by how likely they are to be related to another given entity.

此损失函数适用于当要求根据某些实体对于另一实体进行排序时。

Logistic loss 逻辑损失

The logistic loss instead interprets the scores as the probabilities that the edges exist. It does so by first passing each score (whose domain is the entire real line) through the logistic function (x↦1/(1+e−x), which maps it to a value between 0 and 1). This value is taken as the probability p and the loss will be its binary cross entropy with the “target” probability, i.e., 1 for positive edges and 0 for negative ones. In formulas, the loss for positives is −logp whereas for negatives it’s −log(1−p). The total loss of due to the negatives is renormalized so it compares with the one of the positives.

逻辑损失将分数解释为边缘存在的概率，首先通过逻辑回归函数 (x↦1/(1+e−x)将分数（其域是整个实线）映射到0到1之间的值。其值取概率p，损失为其与“目标”概率的二元交叉熵，即正边为1，负边为0。在公式中，正数的损失是-log p，而负数的损失是-log（1-p）。由于负样本而造成的总损失被重新范化，因此它与正样本中的一个进行比较。

One can see this as the cross entropy between two distributions on the values “edge exists” and “edge doesn’t exist”. One is given by the score (passed through the logistic function), the other has all the mass on “exists” for positives or all the mass on “doesn’t exist” for negatives.

人们可以看到这两个分布之间的交叉熵的值“边缘存在”和“边缘不存在”。我们可以看到这两个分布之间的交叉熵的值“边缘存在”和“边缘不存在”。一个是由分数给出的（通过逻辑函数），另一个则是关于“存在”的所有的预测值，或者所有关于否定的“不存在”的预测值。

This loss function is parameterless and is enabled by setting loss_fn to logistic.

此损失函数是无参数的，通过将 loss_fn 设置为logistic来启用。

Softmax loss Softmax 损失

The last loss function is designed for when one wants a distribution on the probabilities of some entities being related to a given entity (contrary to just wanting a ranking, as with the ranking loss). For a certain positive i, its score si and the score ti,j of all the corresponding negatives j are first converted to probabilities by performing a softmax: pi∝esi and qi,j∝eti,j normalized so that they sum up to 1. Then the loss is the cross entropy between this distribution and the “target” one, i.e., the one that puts all the mass on the positive sample. So, in full, the loss for a single i is −logpi, i.e., −si+log⁡∑jeti,j.

最后一个损失函数设计用于当在一个想要于某个给定实体相关的概率分布的情况（与仅需要排序如排序损失相反）。对于一个确定的正样本i，他的得分Si和所有的j个负样本的得分Ti,j 首先通过softmax被转化成一个概率分布： pi∝esi和 qi,j∝eti,j，该分部相加等于1。然后损失是这个分布和“目标”分布之间的交叉熵，也就是说，把所有分值放在正样本上。所以，整体而言，一个单独的i的损失是-logPi，即−si+log⁡∑jeti,j.

This loss is activated by setting loss_fn to softmax.

通过将loss_fn设置为softmax可激活该损失函数。

Optimizers 优化器

The Adagrad optimization method is used to update all model parameters. Adagrad performs stochastic gradient descent with an adaptive learning rate applied to each parameter inversely proportional to the inverse square magnitude of all previous updates. In practice, Adagrad updates lead to an order of magnitude faster convergence for typical PBG models.

采用adagrad优化方法对模型参数进行更新。adagrad执行随机梯度下降，自适应学习率应用于每个参数，与所有先前更新的平方反比大小成反比。在实际应用中，adagrad更新使典型pbg模型的收敛速度提高了一个数量级。

The initial learning rate for Adagrad is specified by the lr config parameter.A separate learning rate can also be set for non-embeddings using the relation_lr parameter.

adagrad的初始学习速率由lr config参数指定。也可以使用relational_lr参数为非嵌入设置单独的学习速率。

Standard Adagrad requires an equal amount of memory for optimizer state as the size of the model, which is prohibitive for the large models targeted by PBG. To reduce optimizer memory usage, a modified version of Adagrad is used that uses a common learning rate for each entity embedding. The learning rate is proportional to the inverse sum of the squared gradients from each element of the embedding, divided by the dimension. Non-embedding parameters (e.g. relation operator parameters) use standard Adagrad.

标准的adagrad需要与模型大小相等的内存用于优化器状态，这对于pbg所针对的大型模型是禁止的。为了减少优化器的内存使用，使用了修改后的adagrad版本，该版本对每个实体嵌入使用通用的学习速率。学习率与嵌入的每个元素的梯度平方和除以维数的反比成正比。非嵌入参数（如关系运算符参数）使用标准adagrad。

Adagrad parameters are updated asynchronously across worker threads with no explicit synchronization. Asynchronous updates to the Adagrad state (the total squared gradient) appear stable, likely because each element of the state tensor only accumulates positives updates. Optimization is further stabilized by performing a short period of training with a single thread before beginning Hogwild! training, which is tuned by the hogwild_delay parameter.

adagrad参数在没有显式同步的情况下跨工作线程异步更新。对adagrad状态（总平方梯度）的异步更新看起来是稳定的，可能是因为状态张量的每个元素只累积正更新。优化是进一步稳定，通过执行一个单一线程前开始Hogwild 训练，并由hogwild_delay 参数调整。

In distributed training, the Adagrad state for shared parameters (e.g. relation operator parameters) are shared via the parameter server using the same asynchronous gradient update as the parameters themselves. Similar to inter-thread synchronization, these asynchronous updates are stable after an initial burn-in period because the total squared gradient strictly accumulates positive values.

在分布式训练中，共享参数（如关系运算符参数）的adagrad状态通过参数服务器共享，使用与参数本身相同的异步梯度更新。与线程间同步类似，这些异步更新在初始磨合期之后是稳定的，因为总平方梯度严格累积正值。