
bert_config.josn 模型中参数的配置

"attention_probs_dropout_prob": 0.1, #乘法attention时,softmax后dropout概率 
"hidden_act": "gelu", #激活函数 
"hidden_dropout_prob": 0.1, #隐藏层dropout概率 
"hidden_size": 768, #隐藏单元数 
"initializer_range": 0.02, #初始化范围 
"intermediate_size": 3072, #升维维度
"max_position_embeddings": 512,#一个大于seq_length的参数,用于生成position_embedding "num_attention_heads": 12, #每个隐藏层中的attention head数 
"num_hidden_layers": 12, #隐藏层数 
"type_vocab_size": 2, #segment_ids类别 [0,1] 
"vocab_size": 30522 #词典中词数


  def __init__(self,
    """Constructs BertConfig.
      vocab_size: Vocabulary size of `inputs_ids` in `BertModel`.字典大小
      hidden_size: Size of the encoder layers and the pooler layer.隐层节点个数
      num_hidden_layers: Number of hidden layers in the Transformer encoder.隐层层数
      num_attention_heads: Number of attention heads for each attention layer in
        the Transformer encoder.有多少个muiti-attention head
      intermediate_size: The size of the "intermediate" (i.e., feed-forward)
        layer in the Transformer encoder.
      hidden_act: The non-linear activation function (function or string) in the
        encoder and pooler.
      hidden_dropout_prob: The dropout probability for all fully connected
        layers in the embeddings, encoder, and pooler.
      attention_probs_dropout_prob: The dropout ratio for the attention
      max_position_embeddings: The maximum sequence length that this model might
        ever be used with. Typically set this to something large just in case
        (e.g., 512 or 1024 or 2048).
      type_vocab_size: The vocabulary size of the `token_type_ids` passed into
      initializer_range: The stdev of the truncated_normal_initializer for
        initializing all weight matrices.
    self.vocab_size = vocab_size
    self.hidden_size = hidden_size
    self.num_hidden_layers = num_hidden_layers
    self.num_attention_heads = num_attention_heads
    self.hidden_act = hidden_act
    self.intermediate_size = intermediate_size
    self.hidden_dropout_prob = hidden_dropout_prob
    self.attention_probs_dropout_prob = attention_probs_dropout_prob
    self.max_position_embeddings = max_position_embeddings
    self.type_vocab_size = type_vocab_size
    self.initializer_range = initializer_range
  def from_dict(cls, json_object):
    """Constructs a `BertConfig` from a Python dictionary of parameters."""
    config = BertConfig(vocab_size=None)
    for (key, value) in six.iteritems(json_object):
      config.__dict__[key] = value
    return config
  def from_json_file(cls, json_file):
    """Constructs a `BertConfig` from a json file of parameters."""
    with tf.gfile.GFile(json_file, "r") as reader:
      text =
    return cls.from_dict(json.loads(text))
  def to_dict(self):
    """Serializes this instance to a Python dictionary."""
    output = copy.deepcopy(self.__dict__)
    return output
  def to_json_string(self):
    """Serializes this instance to a JSON string."""
    return json.dumps(self.to_dict(), indent=2, sort_keys=True) + "\n"






[tokens: [CLS] ancient sage [MASK] [MASK] the name kang un ##im [MASK] ##ant to a monk - - pumped water nightly that he might study by day , so i [MASK] the [MASK] of cloak ##s [MASK] para ##sol ##acies , at the sacred doors of her [MASK] - room [MASK] im ##bib ##e celestial knowledge . from my youth i felt in me a [SEP] fallen star , i am , bobbie ! ' continued he , [MASK] ##ively , stroking his lean [MASK] - - ' a fallen star ! - [MASK] fallen , if the dignity [MASK] philosophy will allow of the simi ##le , among the hog [MASK] of the lower world - [MASK] indeed , even into the hog - bucket itself . [SEP]
segment_ids: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
is_random_next: False
masked_lm_positions: 3 4 6 7 10 29 31 35 38 46 49 71 77 83 92 98 110 116 124
masked_lm_labels: - - name is ##port , guardian and ##s lecture , sir pens stomach - of ##s - bucket
def embedding_lookup(input_ids,
  """Looks up words embeddings for id tensor.
    input_ids: int32 Tensor of shape [batch_size, seq_length] containing word
    vocab_size: int. Size of the embedding vocabulary.
    embedding_size: int. Width of the word embeddings.
    initializer_range: float. Embedding initialization range.
    word_embedding_name: string. Name of the embedding table.
    use_one_hot_embeddings: bool. If True, use one-hot method for word
      embeddings. If False, use `tf.nn.embedding_lookup()`. One hot is better
      for TPUs.
    float Tensor of shape [batch_size, seq_length, embedding_size].
  # This function assumes that the input is of shape [batch_size, seq_length,
  # num_inputs].
  # If the input is a 2D tensor of shape [batch_size, seq_length], we
  # reshape to [batch_size, seq_length, 1].
  if input_ids.shape.ndims == 2:
    input_ids = tf.expand_dims(input_ids, axis=[-1])
    #print(input_ids) #shape=(32, 128, 1)
  embedding_table = tf.get_variable(
      shape=[vocab_size, embedding_size],
  #print(embedding_table) #shape=(30522, 768)
  if use_one_hot_embeddings:
    flat_input_ids = tf.reshape(input_ids, [-1])
    one_hot_input_ids = tf.one_hot(flat_input_ids, depth=vocab_size)
    output = tf.matmul(one_hot_input_ids, embedding_table)
    output = tf.nn.embedding_lookup(embedding_table, input_ids)
  input_shape = get_shape_list(input_ids)
  output = tf.reshape(output,
                      input_shape[0:-1] + [input_shape[-1] * embedding_size])
  #print(output) #shape=(32, 128, 768)  batch_size=32,embedding_size=128,hidden_size=768
  #print(embedding_table) #shape=(30522, 768)
  return (output, embedding_table)

 v1_cons = tf.get_variable('v1_cons', shape=[1,4], initializer=tf.constant_initializer())
 v2_cons = tf.get_variable('v2_cons', shape=[1,4], initializer=tf.constant_initializer(9))
 常量初始化器v1_cons: [[0. 0. 0. 0.]]
 常量初始化器v2_cons: [[9. 9. 9. 9.]]
embedding_postprocessor 它包括token_type_embedding和position_embedding。也就是图中的Segement Embeddings和Position Embeddings。

embedding结构图:选自《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》。
但此代码中Position Embeddings部分与之前提出的Transformer不同,此代码中Position Embeddings是训练出来的,而传统的Transformer(如下)是固定值


如上所示,输入有 A 句「my dog is cute」和 B 句「he likes playing」这两个自然句,我们首先需要将每个单词及特殊符号都转化为词嵌入向量,因为神经网络只能进行数值计算。其中特殊符 [SEP] 是用于分割两个句子的符号,前面半句会加上分割编码 A,后半句会加上分割编码 B。

因为要建模句子之间的关系,BERT 有一个任务是预测 B 句是不是 A 句后面的一句话,而这个分类任务会借助 A/B 句最前面的特殊符 [CLS] 实现,该特殊符可以视为汇集了整个输入序列的表征。

最后的位置编码是 Transformer 架构本身决定的,因为基于完全注意力的方法并不能像 CNN 或 RNN 那样编码词与词之间的位置关系,但是正因为这种属性才能无视距离长短建模两个词之间的关系。因此为了令 Transformer 感知词与词之间的位置关系,我们需要使用位置编码给每个词加上位置信息。

def embedding_postprocessor(input_tensor,
  #print(input_tensor) #shape=(32, 128, 768)
  """Performs various post-processing on a word embedding tensor.
    input_tensor: float Tensor of shape [batch_size, seq_length,embedding_size].
    use_token_type: bool. Whether to add embeddings for `token_type_ids`.
    token_type_ids: (optional) int32 Tensor of shape [batch_size, seq_length].
      Must be specified if `use_token_type` is True.
    token_type_vocab_size: int. The vocabulary size of `token_type_ids`.
    token_type_embedding_name: string. The name of the embedding table variable
      for token type ids.
    use_position_embeddings: bool. Whether to add position embeddings for the
      position of each token in the sequence.
    position_embedding_name: string. The name of the embedding table variable
      for positional embeddings.
    initializer_range: float. Range of the weight initialization.
    max_position_embeddings: int. Maximum sequence length that might ever be
      used with this model. This can be longer than the sequence length of
      input_tensor, but cannot be shorter.
    dropout_prob: float. Dropout probability applied to the final output tensor.
    float tensor with same shape as `input_tensor`.
    ValueError: One of the tensor shapes or input values is invalid.
  input_shape = get_shape_list(input_tensor, expected_rank=3)
  batch_size = input_shape[0]   #32
  seq_length = input_shape[1]   #128
  width = input_shape[2]        #768
  output = input_tensor
  if use_token_type:
    if token_type_ids is None:
      raise ValueError("`token_type_ids` must be specified if"
                       "`use_token_type` is True.")
    token_type_table = tf.get_variable(
        shape=[token_type_vocab_size, width],
    # This vocab will be small so we always do one-hot here, since it is always
    # faster for a small vocabulary.
    flat_token_type_ids = tf.reshape(token_type_ids, [-1])
    one_hot_ids = tf.one_hot(flat_token_type_ids, depth=token_type_vocab_size)
    token_type_embeddings = tf.matmul(one_hot_ids, token_type_table)
    token_type_embeddings = tf.reshape(token_type_embeddings,
                                       [batch_size, seq_length, width])
    output += token_type_embeddings
  if use_position_embeddings:
    assert_op = tf.assert_less_equal(seq_length, max_position_embeddings)
    with tf.control_dependencies([assert_op]):
      full_position_embeddings = tf.get_variable(
          shape=[max_position_embeddings, width],
      # Since the position embedding table is a learned variable, we create it
      # using a (long) sequence length `max_position_embeddings`. The actual
      # sequence length might be shorter than this, for faster training of
      # tasks that do not have long sequences.
      # So `full_position_embeddings` is effectively an embedding table
      # for position [0, 1, 2, ..., max_position_embeddings-1], and the current
      # sequence has positions [0, 1, 2, ... seq_length-1], so we can just
      # perform a slice.
      position_embeddings = tf.slice(full_position_embeddings, [0, 0],
                                     [seq_length, -1])
      num_dims = len(output.shape.as_list())
      # Only the last two dimensions are relevant (`seq_length` and `width`), so
      # we broadcast among the first dimensions, which is typically just
      # the batch size.
      position_broadcast_shape = []
      for _ in range(num_dims - 2):
      position_broadcast_shape.extend([seq_length, width])
      position_embeddings = tf.reshape(position_embeddings,
      output += position_embeddings
  output = layer_norm_and_dropout(output, dropout_prob)
  #print(output) #shape=(32, 128, 768)
  return output

模型怎么用呢,在BertModel class中有两个函数。get_pool_output表示获取每个batch第一个词的[CLS]表示结果。BERT认为这个词包含了整条语料的信息;适用于句子级别的分类问题。get_sequence_output表示BERT最终的输出结果,shape为[batch_size,seq_length,hidden_size]。可以直观理解为对每条语料的最终表示,适用于seq2seq问题。

BERT 的全称是基于 Transformer 的双向编码器表征,其中「双向」表示模型在处理某一个词时,它能同时利用前面的词和后面的词两部分信息。这种「双向」的来源在于 BERT 与传统语言模型不同,它不是在给定所有前面词的条件下预测最可能的当前词,而是随机遮掩一些词,并利用所有没被遮掩的词进行预测。下图展示了三种预训练模型,其中 BERT 和 ELMo 都使用双向信息,OpenAI GPT 使用单向信息。

