SASRec 《Self-Attentive Sequential Recommendation》

本篇主要介绍了 Transformer 在序列推荐中的应用,我也没有太抓住重点,就简单结合代码跟着模型架构走一遍吧。

1. 模型架构

需要注意的是在 Self Attention Layer 中的箭头,即在每一个时间步,只考虑前面时间步的 item。

2. Embedding Layer

由于 Self Attention 模块不包含RNN或CNN模块,因此它不能感知到之前item的位置信息,所以将位置信息 (P) 加入到了 embedding 中。
\widehat{E} = \begin{bmatrix} M_{s_1} + P_1\\ M_{s_2} + P_2\\ \cdots \\ M_{s_n} + P_n \end{bmatrix}
其中,M 表示 item 的 embedding 矩阵,M \in R^{|I| \times d}, |I| 表示 item 的数量,d 为 embedding 的维度;s = (s_1, s_2, \cdots , s_n) 表示 item 序列;P 为相对位置的 embedding 矩阵,P \in R^{n \times d}, n 表示序列长度,增加相对位置的 embedding 矩阵对稠密数据有提升,而对于稀疏数据不加相对位置矩阵更好。

# sequence embedding, item embedding table
self.seq, item_emb_table = embedding(self.input_seq,
                                     vocab_size=itemnum + 1,

# Positional Encoding
t, pos_emb_table = embedding(
    # [batch_size, maxlen]
    tf.tile(tf.expand_dims(tf.range(tf.shape(self.input_seq)[1]), 0), [tf.shape(self.input_seq)[0], 1]),
self.seq += t

3. Self-Attention Block

3.1 Self-Attention layer

\widehat{E} 作为输入,通过三个矩阵进行线性投影,得到 Self-Attention 的 query,key,value,最后得到输出。
S = SA(\widehat{E}) = Attention(\widehat{E}W^Q, \widehat{E}W^K, \widehat{E}W^V) = Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d}})V
其中 W^Q, W^K, W^V \in d \times d,投影变换能使模型更加灵活,比如模型可以学习到非对称的交互(<query i, key j> 和 <query j, key i>)。
这一步有个注意点:为避免数据泄露,需要屏蔽掉 Q_iK_jj>i 部分的连接。

def multihead_attention(queries, 
    '''Applies multihead attention.
      queries: A 3d tensor with shape of [N, T_q, C_q].
      keys: A 3d tensor with shape of [N, T_k, C_k].
      num_units: A scalar. Attention size.
      dropout_rate: A floating point number.
      is_training: Boolean. Controller of mechanism for dropout.
      causality: Boolean. If true, units that reference the future are masked. 
      num_heads: An int. Number of heads.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
      A 3d tensor with shape of (N, T_q, C)  
    with tf.variable_scope(scope, reuse=reuse):
        # Set the fall back option for num_units
        if num_units is None:
            num_units = queries.get_shape().as_list[-1]
        # Linear projections
        # Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
        # K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        # V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        Q = tf.layers.dense(queries, num_units, activation=None) # (N, T_q, C)
        K = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
        V = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 

        # Multiplication
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
        # Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
        paddings = tf.ones_like(outputs)*(-2**32+1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
        # Causality = Future blinding
        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
            tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
            paddings = tf.ones_like(masks)*(-2**32+1)
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
        # Activation
        outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
        # Query Masking
        query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
        outputs *= query_masks # broadcasting. (N, T_q, C)
        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        # Weighted sum
        outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
        # Residual connection
        outputs += queries
        # Normalize
        #outputs = normalize(outputs) # (N, T_q, C)
    if with_qk: return Q,K
    else: return outputs
3.2 Point-Wise Feed-Forward Network

通过 self attention 我们可以得到经过自适应权重加权的序列 item 的 embedding 的和,但其本质仍然是线性模型,为了使模型具备非线性的表达能力,SASRec 考虑在 self attention 后添加了前馈神经网络:
F_i = FFN(S_i) = Relu(S_iW^{(1)} + b^{(1)})W^{(2)} + b^{(2)}
其中 W^{(1)}, W^{(2)} \in R^{d \times d}; b^{(1)},b^{(2)} \in R^d
这里也同样需要注意 S_iS_j (i \neq j) 之间是没有交互的。

def feedforward(inputs, 
                num_units=[2048, 512],
    '''Point-wise feed forward net.
      inputs: A 3d tensor with shape of [N, T, C].
      num_units: A list of two integers.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
      A 3d tensor with the same shape and dtype as inputs
    with tf.variable_scope(scope, reuse=reuse):
        # Inner layer
        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
                  "activation": tf.nn.relu, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        # Readout layer
        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
                  "activation": None, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        # Residual connection
        outputs += inputs
        # Normalize
        #outputs = normalize(outputs)
    return outputs

4. Stacking Self-Attention Blocks

S^{(b)} = SA(F^{(b-1)})
F_i^{(b)} = FFN(S_i^{(b)}), \forall i \in \{1, 2, \cdots , n\}
其中,b 表示第几个 block,S^{(1)} = S, F^{(1)} = F

# Build blocks
for i in range(args.num_blocks):
    with tf.variable_scope("num_blocks_%d" % i):
        # Self-attention
        self.seq = multihead_attention(queries=normalize(self.seq),
        # Feed forward
        self.seq = feedforward(normalize(self.seq), num_units=[args.hidden_units, args.hidden_units],
                               dropout_rate=args.dropout_rate, is_training=self.is_training)
        self.seq *= mask

为了减缓梯度消失和过拟合,模型对 self attention layer 和 feed forward network layer 进行了 Layer Normalization 和 Dropout,同时使用了残差连接:
g(x) = x + Dropout(g(LayerNorm(x)))
其中,g(x) 表示self attention layer 或 the feed- forward network layer。
LayerNorm(x) = \alpha \odot \frac{x-\mu }{\sqrt{\sigma ^2+\epsilon }} + \beta
Layer normalization 是在样本的各特征间进行归一化。

def normalize(inputs, 
              epsilon = 1e-8,
    '''Applies layer normalization.
      inputs: A tensor with 2 or more dimensions, where the first dimension has
      epsilon: A floating number. A very small number for preventing ZeroDivision Error.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
      A tensor with the same shape and data dtype as `inputs`.
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta
    return outputs

5. Prediction Layer

r_{i, t} 表示给定 t 个 items (s_1, s_2, \cdots, s_t) 后预测的下一个 item i 的分数
r_{i, t} = F_t^{(b)}N_i^T
其中,N\in R^{|I|\times d} 表示 item 的 embedding 矩阵。

为了减小模型大小和避免过拟合,作者尝试了共享 embedding 矩阵,效果有所提高:

r_{i, t} = F_t^{(b)}M_i^T
其中,M\in R^{n\times d}

5. Network Training

label 的定义如下:
o_t = \left\{\begin{matrix} <pad> & if\,s_t\,is\,a\,padding\,item\\ s_{t+1} 1 \leq & t < n\\ S^{u}_{S^{|u|}} & t = n \end{matrix}\right.
\sum_{S^u\in S}\sum_{t\in [1,2,\cdots,n]} [log(\sigma (r_{o_t}, t)) + \sum_{j\neq S^u}log(1-\sigma (r_j, t))]
需要忽略掉 o_t=<pad> 的部分。

5. 参考文献

