本篇主要介绍了 Transformer 在序列推荐中的应用,我也没有太抓住重点,就简单结合代码跟着模型架构走一遍吧。
1. 模型架构
需要注意的是在 Self Attention Layer 中的箭头,即在每一个时间步,只考虑前面时间步的 item。
2. Embedding Layer
由于 Self Attention 模块不包含RNN或CNN模块,因此它不能感知到之前item的位置信息,所以将位置信息 () 加入到了 embedding 中。
其中, 表示 的 embedding 矩阵, 表示 的数量, 为 embedding 的维度; 表示 item 序列; 为相对位置的 embedding 矩阵, 表示序列长度,增加相对位置的 embedding 矩阵对稠密数据有提升,而对于稀疏数据不加相对位置矩阵更好。
# sequence embedding, item embedding table
self.seq, item_emb_table = embedding(self.input_seq,
vocab_size=itemnum + 1,
num_units=args.hidden_units,
zero_pad=True,
scale=True,
l2_reg=args.l2_emb,
scope="input_embeddings",
with_t=True,
reuse=reuse
)
# Positional Encoding
t, pos_emb_table = embedding(
# [batch_size, maxlen]
tf.tile(tf.expand_dims(tf.range(tf.shape(self.input_seq)[1]), 0), [tf.shape(self.input_seq)[0], 1]),
vocab_size=args.maxlen,
num_units=args.hidden_units,
zero_pad=False,
scale=False,
l2_reg=args.l2_emb,
scope="dec_pos",
reuse=reuse,
with_t=True
)
self.seq += t
3. Self-Attention Block
3.1 Self-Attention layer
将 作为输入,通过三个矩阵进行线性投影,得到 Self-Attention 的 query,key,value,最后得到输出。
其中 ,投影变换能使模型更加灵活,比如模型可以学习到非对称的交互(<query , key > 和 <query , key >)。
这一步有个注意点:为避免数据泄露,需要屏蔽掉 和 的 部分的连接。
def multihead_attention(queries,
keys,
num_units=None,
num_heads=8,
dropout_rate=0,
is_training=True,
causality=False,
scope="multihead_attention",
reuse=None,
with_qk=False):
'''Applies multihead attention.
Args:
queries: A 3d tensor with shape of [N, T_q, C_q].
keys: A 3d tensor with shape of [N, T_k, C_k].
num_units: A scalar. Attention size.
dropout_rate: A floating point number.
is_training: Boolean. Controller of mechanism for dropout.
causality: Boolean. If true, units that reference the future are masked.
num_heads: An int. Number of heads.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns
A 3d tensor with shape of (N, T_q, C)
'''
with tf.variable_scope(scope, reuse=reuse):
# Set the fall back option for num_units
if num_units is None:
num_units = queries.get_shape().as_list[-1]
# Linear projections
# Q = tf.layers.dense(queries, num_units, activation=tf.nn.relu) # (N, T_q, C)
# K = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
# V = tf.layers.dense(keys, num_units, activation=tf.nn.relu) # (N, T_k, C)
Q = tf.layers.dense(queries, num_units, activation=None) # (N, T_q, C)
K = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
V = tf.layers.dense(keys, num_units, activation=None) # (N, T_k, C)
# Split and concat
Q_ = tf.concat(tf.split(Q, num_heads, axis=2), axis=0) # (h*N, T_q, C/h)
K_ = tf.concat(tf.split(K, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
V_ = tf.concat(tf.split(V, num_heads, axis=2), axis=0) # (h*N, T_k, C/h)
# Multiplication
outputs = tf.matmul(Q_, tf.transpose(K_, [0, 2, 1])) # (h*N, T_q, T_k)
# Scale
outputs = outputs / (K_.get_shape().as_list()[-1] ** 0.5)
# Key Masking
key_masks = tf.sign(tf.abs(tf.reduce_sum(keys, axis=-1))) # (N, T_k)
key_masks = tf.tile(key_masks, [num_heads, 1]) # (h*N, T_k)
key_masks = tf.tile(tf.expand_dims(key_masks, 1), [1, tf.shape(queries)[1], 1]) # (h*N, T_q, T_k)
paddings = tf.ones_like(outputs)*(-2**32+1)
outputs = tf.where(tf.equal(key_masks, 0), paddings, outputs) # (h*N, T_q, T_k)
# Causality = Future blinding
if causality:
diag_vals = tf.ones_like(outputs[0, :, :]) # (T_q, T_k)
tril = tf.linalg.LinearOperatorLowerTriangular(diag_vals).to_dense() # (T_q, T_k)
masks = tf.tile(tf.expand_dims(tril, 0), [tf.shape(outputs)[0], 1, 1]) # (h*N, T_q, T_k)
paddings = tf.ones_like(masks)*(-2**32+1)
outputs = tf.where(tf.equal(masks, 0), paddings, outputs) # (h*N, T_q, T_k)
# Activation
outputs = tf.nn.softmax(outputs) # (h*N, T_q, T_k)
# Query Masking
query_masks = tf.sign(tf.abs(tf.reduce_sum(queries, axis=-1))) # (N, T_q)
query_masks = tf.tile(query_masks, [num_heads, 1]) # (h*N, T_q)
query_masks = tf.tile(tf.expand_dims(query_masks, -1), [1, 1, tf.shape(keys)[1]]) # (h*N, T_q, T_k)
outputs *= query_masks # broadcasting. (N, T_q, C)
# Dropouts
outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
# Weighted sum
outputs = tf.matmul(outputs, V_) # ( h*N, T_q, C/h)
# Restore shape
outputs = tf.concat(tf.split(outputs, num_heads, axis=0), axis=2 ) # (N, T_q, C)
# Residual connection
outputs += queries
# Normalize
#outputs = normalize(outputs) # (N, T_q, C)
if with_qk: return Q,K
else: return outputs
3.2 Point-Wise Feed-Forward Network
通过 self attention 我们可以得到经过自适应权重加权的序列 item 的 embedding 的和,但其本质仍然是线性模型,为了使模型具备非线性的表达能力,SASRec 考虑在 self attention 后添加了前馈神经网络:
其中
这里也同样需要注意 和 之间是没有交互的。
def feedforward(inputs,
num_units=[2048, 512],
scope="feed_forward",
dropout_rate=0.2,
is_training=True,
reuse=None):
'''Point-wise feed forward net.
Args:
inputs: A 3d tensor with shape of [N, T, C].
num_units: A list of two integers.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns:
A 3d tensor with the same shape and dtype as inputs
'''
with tf.variable_scope(scope, reuse=reuse):
# Inner layer
params = {"inputs": inputs, "filters": num_units[0], "kernel_size": 1,
"activation": tf.nn.relu, "use_bias": True}
outputs = tf.layers.conv1d(**params)
outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
# Readout layer
params = {"inputs": outputs, "filters": num_units[1], "kernel_size": 1,
"activation": None, "use_bias": True}
outputs = tf.layers.conv1d(**params)
outputs = tf.layers.dropout(outputs, rate=dropout_rate, training=tf.convert_to_tensor(is_training))
# Residual connection
outputs += inputs
# Normalize
#outputs = normalize(outputs)
return outputs
4. Stacking Self-Attention Blocks
其中, 表示第几个 block,。
# Build blocks
for i in range(args.num_blocks):
with tf.variable_scope("num_blocks_%d" % i):
# Self-attention
self.seq = multihead_attention(queries=normalize(self.seq),
keys=self.seq,
num_units=args.hidden_units,
num_heads=args.num_heads,
dropout_rate=args.dropout_rate,
is_training=self.is_training,
causality=True,
scope="self_attention")
# Feed forward
self.seq = feedforward(normalize(self.seq), num_units=[args.hidden_units, args.hidden_units],
dropout_rate=args.dropout_rate, is_training=self.is_training)
self.seq *= mask
为了减缓梯度消失和过拟合,模型对 self attention layer 和 feed forward network layer 进行了 Layer Normalization 和 Dropout,同时使用了残差连接:
其中, 表示self attention layer 或 the feed- forward network layer。
Layer normalization 是在样本的各特征间进行归一化。
def normalize(inputs,
epsilon = 1e-8,
scope="ln",
reuse=None):
'''Applies layer normalization.
Args:
inputs: A tensor with 2 or more dimensions, where the first dimension has
`batch_size`.
epsilon: A floating number. A very small number for preventing ZeroDivision Error.
scope: Optional scope for `variable_scope`.
reuse: Boolean, whether to reuse the weights of a previous layer
by the same name.
Returns:
A tensor with the same shape and data dtype as `inputs`.
'''
with tf.variable_scope(scope, reuse=reuse):
inputs_shape = inputs.get_shape()
params_shape = inputs_shape[-1:]
mean, variance = tf.nn.moments(inputs, [-1], keep_dims=True)
beta= tf.Variable(tf.zeros(params_shape))
gamma = tf.Variable(tf.ones(params_shape))
normalized = (inputs - mean) / ( (variance + epsilon) ** (.5) )
outputs = gamma * normalized + beta
return outputs
5. Prediction Layer
表示给定 个 items () 后预测的下一个 item 的分数
其中, 表示 item 的 embedding 矩阵。
为了减小模型大小和避免过拟合,作者尝试了共享 embedding 矩阵,效果有所提高:
其中,。
5. Network Training
label 的定义如下:
损失函数为二分类交叉熵损失:
需要忽略掉 的部分。