SASRec 《Self-Attentive Sequential Recommendation》

本篇主要介绍了 Transformer 在序列推荐中的应用，我也没有太抓住重点，就简单结合代码跟着模型架构走一遍吧。

1. 模型架构

需要注意的是在 Self Attention Layer 中的箭头，即在每一个时间步，只考虑前面时间步的 item。

2. Embedding Layer

由于 Self Attention 模块不包含RNN或CNN模块，因此它不能感知到之前item的位置信息，所以将位置信息 ( $P$ ) 加入到了 embedding 中。
$\widehat{E} = \begin{bmatrix} M_{s_1} + P_1\\ M_{s_2} + P_2\\ \cdots \\ M_{s_n} + P_n \end{bmatrix}$
其中， $M$ 表示 $item$ 的 embedding 矩阵， $M \in R^{|I| \times d}, |I|$ 表示 $item$ 的数量， $d$ 为 embedding 的维度； $s = (s_1, s_2, \cdots , s_n)$ 表示 item 序列； $P$ 为相对位置的 embedding 矩阵， $P \in R^{n \times d}, n$ 表示序列长度，增加相对位置的 embedding 矩阵对稠密数据有提升，而对于稀疏数据不加相对位置矩阵更好。

# sequence embedding, item embedding table
self.seq, item_emb_table = embedding(self.input_seq,
                                     vocab_size=itemnum + 1,
                                     num_units=args.hidden_units,
                                     zero_pad=True,
                                     scale=True,
                                     l2_reg=args.l2_emb,
                                     scope="input_embeddings",
                                     with_t=True,
                                     reuse=reuse
                                     )

# Positional Encoding
t, pos_emb_table = embedding(
    # [batch_size, maxlen]
    tf.tile(tf.expand_dims(tf.range(tf.shape(self.input_seq)[1]), 0), [tf.shape(self.input_seq)[0], 1]),
    vocab_size=args.maxlen,
    num_units=args.hidden_units,
    zero_pad=False,
    scale=False,
    l2_reg=args.l2_emb,
    scope="dec_pos",
    reuse=reuse,
    with_t=True
)
self.seq += t

3. Self-Attention Block

3.1 Self-Attention layer

将 $\widehat{E}$ 作为输入，通过三个矩阵进行线性投影，得到 Self-Attention 的 query，key，value，最后得到输出。
$S = SA(\widehat{E}) = Attention(\widehat{E}W^Q, \widehat{E}W^K, \widehat{E}W^V) = Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d}})V$
其中 $W^Q, W^K, W^V \in d \times d$ ，投影变换能使模型更加灵活，比如模型可以学习到非对称的交互(<query $i$ , key $j$ > 和 <query $j$ , key $i$ >)。
这一步有个注意点：为避免数据泄露，需要屏蔽掉 $Q_i$ 和 $K_j$ 的 $j>i$ 部分的连接。

def multihead_attention(queries, 
                        keys, 
                        num_units=None, 
                        num_heads=8, 
                        dropout_rate=0,
                        is_training=True,
                        causality=False,
                        scope="multihead_attention", 
                        reuse=None,
                        with_qk=False):
    '''Applies multihead attention.
    
    Args:
      queries: A 3d tensor with shape of [N, T_q, C_q].
      keys: A 3d tensor with shape of [N, T_k, C_k].
      num_units: A scalar. Attention size.
      dropout_rate: A floating point number.
      is_training: Boolean. Controller of mechanism for dropout.
      causality: Boolean. If true, units that reference the future are masked. 
      num_heads: An int. Number of heads.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns
      A 3d tensor with shape of (N, T_q, C)  
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Set the fall back option for num_units
        if num_units is None:
            num_units = queries.get_shape().as_list[-1]
        
        # Linear projections
        # Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
        # K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        # V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
        Q = tf.layers.dense(queries, num_units, activation=None) # (N, T_q, C)
        K = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
        V = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
        
        # Split and concat
        Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h) 
        K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 
        V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h) 

        # Multiplication
        outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
        
        # Scale
        outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
        
        # Key Masking
        key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)
        key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
        key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
        
        paddings = tf.ones_like(outputs)*(-2**32+1)
        outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  
        # Causality = Future blinding
        if causality:
            diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
            tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
            masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
   
            paddings = tf.ones_like(masks)*(-2**32+1)
            outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
  
        # Activation
        outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
         
        # Query Masking
        query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q)
        query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
        query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
        outputs *= query_masks # broadcasting. (N, T_q, C)
          
        # Dropouts
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
               
        # Weighted sum
        outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
        
        # Restore shape
        outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
              
        # Residual connection
        outputs += queries
              
        # Normalize
        #outputs = normalize(outputs) # (N, T_q, C)
 
    if with_qk: return Q,K
    else: return outputs

3.2 Point-Wise Feed-Forward Network

通过 self attention 我们可以得到经过自适应权重加权的序列 item 的 embedding 的和，但其本质仍然是线性模型，为了使模型具备非线性的表达能力，SASRec 考虑在 self attention 后添加了前馈神经网络：
$F_i = FFN(S_i) = Relu(S_iW^{(1)} + b^{(1)})W^{(2)} + b^{(2)}$
其中 $W^{(1)}, W^{(2)} \in R^{d \times d}; b^{(1)},b^{(2)} \in R^d$
这里也同样需要注意 $S_i$ 和 $S_j (i \neq j)$ 之间是没有交互的。

def feedforward(inputs, 
                num_units=[2048, 512],
                scope="feed_forward", 
                dropout_rate=0.2,
                is_training=True,
                reuse=None):
    '''Point-wise feed forward net.
    
    Args:
      inputs: A 3d tensor with shape of [N, T, C].
      num_units: A list of two integers.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
        
    Returns:
      A 3d tensor with the same shape and dtype as inputs
    '''
    with tf.variable_scope(scope, reuse=reuse):
        # Inner layer
        params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
                  "activation": tf.nn.relu, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        # Readout layer
        params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
                  "activation": None, "use_bias": True}
        outputs = tf.layers.conv1d(**params)
        outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
        
        # Residual connection
        outputs += inputs
        
        # Normalize
        #outputs = normalize(outputs)
    
    return outputs

4. Stacking Self-Attention Blocks

$S^{(b)} = SA(F^{(b-1)})$
$F_i^{(b)} = FFN(S_i^{(b)}), \forall i \in \{1, 2, \cdots , n\}$
其中， $b$ 表示第几个 block， $S^{(1)} = S, F^{(1)} = F$ 。

# Build blocks
for i in range(args.num_blocks):
    with tf.variable_scope("num_blocks_%d" % i):
        # Self-attention
        self.seq = multihead_attention(queries=normalize(self.seq),
                                       keys=self.seq,
                                       num_units=args.hidden_units,
                                       num_heads=args.num_heads,
                                       dropout_rate=args.dropout_rate,
                                       is_training=self.is_training,
                                       causality=True,
                                       scope="self_attention")
        # Feed forward
        self.seq = feedforward(normalize(self.seq), num_units=[args.hidden_units, args.hidden_units],
                               dropout_rate=args.dropout_rate, is_training=self.is_training)
        self.seq *= mask

为了减缓梯度消失和过拟合，模型对 self attention layer 和 feed forward network layer 进行了 Layer Normalization 和 Dropout，同时使用了残差连接：
$g(x) = x + Dropout(g(LayerNorm(x)))$
其中， $g(x)$ 表示self attention layer 或 the feed- forward network layer。
$LayerNorm(x) = \alpha \odot \frac{x-\mu }{\sqrt{\sigma ^2+\epsilon }} + \beta$
Layer normalization 是在样本的各特征间进行归一化。

def normalize(inputs, 
              epsilon = 1e-8,
              scope="ln",
              reuse=None):
    '''Applies layer normalization.
    
    Args:
      inputs: A tensor with 2 or more dimensions, where the first dimension has
        `batch_size`.
      epsilon: A floating number. A very small number for preventing ZeroDivision Error.
      scope: Optional scope for `variable_scope`.
      reuse: Boolean, whether to reuse the weights of a previous layer
        by the same name.
      
    Returns:
      A tensor with the same shape and data dtype as `inputs`.
    '''
    with tf.variable_scope(scope, reuse=reuse):
        inputs_shape = inputs.get_shape()
        params_shape = inputs_shape[-1:]
    
        mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
        beta= tf.Variable(tf.zeros(params_shape))
        gamma = tf.Variable(tf.ones(params_shape))
        normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
        outputs = gamma * normalized + beta
        
    return outputs

5. Prediction Layer

$r_{i, t}$ 表示给定 $t$ 个 items ( $s_1, s_2, \cdots, s_t$ ) 后预测的下一个 item $i$ 的分数
$r_{i, t} = F_t^{(b)}N_i^T$
其中， $N\in R^{|I|\times d}$ 表示 item 的 embedding 矩阵。

为了减小模型大小和避免过拟合，作者尝试了共享 embedding 矩阵，效果有所提高：

$r_{i, t} = F_t^{(b)}M_i^T$
其中， $M\in R^{n\times d}$ 。

5. Network Training

label 的定义如下：
$o_t = \left\{\begin{matrix} <pad> & if\,s_t\,is\,a\,padding\,item\\ s_{t+1} 1 \leq & t < n\\ S^{u}_{S^{|u|}} & t = n \end{matrix}\right.$
损失函数为二分类交叉熵损失：
$\sum_{S^u\in S}\sum_{t\in [1,2,\cdots,n]} [log(\sigma (r_{o_t}, t)) + \sum_{j\neq S^u}log(1-\sigma (r_j, t))]$
需要忽略掉 $o_t=<pad>$ 的部分。