TensorFlow 关键运行机制汇总

做了几个利用 TensorFlow 来构建 RNN 的练习，发现很多示例代码中的语句的目的和来源不是很清楚，因此特意查看了一下 TensorFlow 的官方文档，找到了很多重要的信息，了解这些信息对于理解 TensorFlow 背后的运行机制有很多帮助。鉴于 TensorFlow 的强大功能，很多运行机制理解的也比较粗浅，后续会持续修正。文中主要引用的信息来源是 TensorFlow 官方文档的 Low Level API 部分，强烈建议任何想要深刻理解 TensorFlow 的读者阅读。

张量 Tensors

The central unit of data in TensorFlow is the tensor. A tensor consists of a set of primitive values shaped into an array of any number of dimensions. A tensor's rank is its number of dimensions, while its shape is a tuple of integers specifying the array's length along each dimension. TensorFlow uses numpy arrays to represent tensor values.

正因为 TensorFlow 中的张量构建于 Numpy 数组之上，因此在 TensorFlow 中一维向量是以行向量的形式存储的，并且对于张量的形状 shape、秩 rank 和维数 n_dim，以及轴 axis 的定义和计算都沿用 Numpy 中的相关规定。

同 Numpy 数组一样，同一个 Tensor 中的数据类型必须相同，但可以通过 tf.cast() 来改变 Tensor 中的数据类型：

# Cast a constant integer tensor into floating point.
float_tensor = tf.cast(tf.constant([1, 2, 3]), dtype=tf.float32)

TensorFlow 的核心操作

在通过 TensorFlow 进行模型构建时，其核心的两个操作过程是：

通过构建计算图 Computational graph 来描述相关变量间的运算关系
通过开启一个 Session 来运行计算图来获取运算的结果

You might think of TensorFlow Core programs as consisting of two discrete sections:

Building the computational graph (a tf.Graph).

Running the computational graph (using a tf.Session).

事实上这种编程方式在计算机科学的语境中被称为数据流编程 Dataflow programming，如果你也和我一样一开始就接触的函数式编程，那么这种编程范式可能会觉得有些新奇。TensorFlow 之所以采用了数据流编程的形式，其中一个重要的原因式是这种模式可以更好的适应在多个设备上的并行计算。

Dataflow has several advantages that TensorFlow leverages when executing your programs:

Parallelism. By using explicit edges to represent dependencies between operations, it is easy for the system to identify operations that can execute in parallel.

Distributed execution. By using explicit edges to represent the values that flow between operations, it is possible for TensorFlow to partition your program across multiple devices (CPUs, GPUs, and TPUs) attached to different machines. TensorFlow inserts the necessary communication and coordination between devices.

Compilation. TensorFlow's XLA compiler can use the information in your dataflow graph to generate faster code, for example, by fusing together adjacent operations.

Portability. The dataflow graph is a language-independent representation of the code in your model. You can build a dataflow graph in Python, store it in a SavedModel, and restore it in a C++ program for low-latency inference.

计算图 Graph 及其构建

TensorFlow 计算图可以进一步分解为两个核心的组成部分：

运算：以节点的形式呈现
张量：以连线的方式呈现

A computational graph is a series of TensorFlow operations arranged into a graph. The graph is composed of two types of objects.

Operations (or "ops"): The nodes of the graph. Operations describe calculations that consume and produce tensors.

Tensors: The edges in the graph. These represent the values that will flow through the graph. Most TensorFlow functions return tf.Tensors.

会话 Session

在计算图构建完成后，如果想要得到计算图的运算结果，需要将这个计算图放置在一个会话 Session 中运行。

A session encapsulates the state of the TensorFlow runtime, and runs TensorFlow operations.

In [2]:
sess = tf.Session()
print(sess.run(total))
Out [2]:
7.0

In [3]:
print(sess.run({'ab':(a, b), 'total': total}))
Out [3]:
{'ab': (3.0, 4.0), 'total': 7.0}

对比上述两次代码的运行结果可以发现，计算图本身可以理解为一个函数，因此可以通过给予不同的参数而得到不同的结果。

Some TensorFlow functions return tf.Operations instead of tf.Tensors. The result of calling run on an Operation is None. You run an operation to cause a side-effect, not to retrieve a value. Examples of this include the initialization, and training ops demonstrated later.

# Initializing the variables
init = tf.global_variables_initializer()
sess.run(init)

除了默认的会话启动外，还可以通过 tf.ConfigProto 来对会话的启动方式进行设置：

# Launch the graph in a session that allows soft device placement and
# logs the placement decisions.

sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True,
                                        log_device_placement=True))

变量 Variable 的创建

最简单的变量创建方式是 tf.get_variable() ，其参数需要包含一个变量名 name，当这个变量名已经存在的时候，按照 tf.get_variable(name, shape=None, dtype=None, initializer=None, ...) 中的其他参数返回变量名中的值，否则创建一个新的变量，其常用方法如下：

my_variable = tf.get_variable('my_variable', [1, 2, 3]) 默认数据类型为 dtype=tf.float32，初始值为默认初始化方法 tf.glorot_uniform_initializer 生成的一系列按照形状为 (1, 2, 3) 的随机数
my_int_variable = tf.get_variable('my_int_variable', [1, 2, 3], dtype=tf.int32, initializer=tf.zeros_initializer()) 生成一个形状为 (1, 2, 3) 的初始值全 0 的变量
other_variable = tf.get_variable('other_variable', dtype=tf.int32, initializer=tf.constant([23, 42])) 给予变量初始值为一个常数张量，此时变量的形状会根据常数张量的形状来自动确定，无需再次指定

还有一种常用的变量的创建方式是实例化 Variable 类，此时一般只需传入初始值即可：

w = tf.Variable(<initial-value>, name=<optional-name>)

二者的主要区别是 tf.get_variable() 可以结合 tf.variable_scope() 更好的适应变量的重用。

变量的初始化

在 TensorFlow 中变量的创建是和初始化是分开的，因此变量创建时提供的初始值只有在显式的进行初始化操作时才真正的被传递到变量当中去：

变量的全局初始化方式是session.run(tf.global_variables_initializer())
变量的局部初始化方式是session.run(my_variable.initializer)

需要注意的是 tf.global_variables_initializer() 不是万能的，其在变量之间存在依赖关系时可能会出现初始化错误，此时需要先通过局部初始化的方式初始化部分变量，或者对于需要在其他变量初始化的基础上通过运算得到的变量中采用 variable.initialized_value() 来显式的表明初始化顺序：

v = tf.get_variable('v', shape=(), initializer=tf.zeros_initializer())
w = tf.get_variable('w', initializer=v.initialized_value() + 1)

变量的归集 Collections

由于在计算图中的无数个位置都可以创建变量，此时为了可以方便的对变量进行归类和批量访问，可以创建不同名称的多个列表 lists 来归类这些变量，这些列表在 TensorFlow 中称为集合 collections，tf.GraphKeys 定义了一系列可以直接用于调用的标准集合名称：

By default every tf.Variable gets placed in the following two collections:

tf.GraphKeys.GLOBAL_VARIABLES --- variables that can be shared across multiple devices

tf.GraphKeys.TRAINABLE_VARIABLES --- variables for which TensorFlow will calculate gradients.

正因为默认情况下每一个变量都会被归集在 tf.GraphKeys.GLOBAL_VARIABLES 下，这就是为何我们可以通过 sess.run(tf.global_variables_initializer()) 来批量的初始化变量。如果不希望变量是可以通过训练来改变的，则可以将变量添加至 tf.GraphKeys.LOCAL_VARIABLES 或通过在变量构造时通过设置 trainable=False 来完成：

my_local = tf.get_variable('my_local', shape=(), collections=[tf.GraphKeys.LOCAL_VARIABLES])
my_non_trainable = tf.get_variable('my_non_trainable', shape=(),trainable=False)

这两种情况下相应的变量都会被归集在 tf.GraphKeys.LOCAL_VARIABLES 下。

最简单的将变量添加至某个集合的方式为：
tf.add_to_collection('my_collection_name', my_variable_name)

而相应的访问某个集合下的全部变量的方法为： tf.get_collection('my_collection_name')

变量和运算的内部命名机制

在 TensorFlow 中，变量的每一次运算都会按照 TensorFlow 中设定的规则在内部产生一个新的用于记录这次运算的运算名，这个运算名独立于程序中设定的变量名本身。

Each operation in a graph is given a unique name. This name is independent of the names the objects are assigned to in Python. Tensors are named after the operation that produces them followed by an output index, as in "add:0"

注意下面这几行代码的输出结果中运算和变量名的呈现方式：

In [1]:
import numpy as np
import tensorflow as tf

a = tf.constant(3.0, dtype=tf.float32)
b = tf.constant(4.0)
total = a + b

print(a)
print(b)
print(total)

Out [1]:
Tensor("Const_8:0", shape=(), dtype=float32)
Tensor("Const_9:0", shape=(), dtype=float32)
Tensor("add_7:0", shape=(), dtype=float32)

在 TensorFlow 中，绝大多数运算的结果返回的都是张量 tf.Tensors，在上述代码中，输出结果的 3 个张量的运算名称在每一次运行后都会发生改变，因此在 TensorFlow 中常常需要在建立运算的同时指定 Tensor 的名称，这在后续需要按照名称来指定某些操作，例如在迁移学习中加载之前训练得到的参数时就变得非常重要，否则可能会出现参数被加载在错误的层的情况。

变量的重用 Sharing Variables

在多层神经网络尤其是 RNN 中，一个比较常见的现象是参数共享。由于在变量命名时采用最能直接反映变量属性和情景意义的显示命名方式非常常见。而当这些类似功能的变量同时需要多个时，为了避免歧义，需要为 TensorFlow 明确同一个变量名的多次调用时需要重新创建新的同名的变量还是重用已有的变量。

在下面这段示例代码中，权重 weights 和偏置 biases 需要在不同层之间重新生成，此时，为了让 TensorFlow 明确这一操作，可以通过 tf.variable_scope() 创建不同的变量适用范围：

def conv_relu(input, kernel_shape, bias_shape):
    # Create variable named 'weights'.
    weights = tf.get_variable('weights', kernel_shape,
        initializer=tf.random_normal_initializer())
    # Create variable named 'biases'.
    biases = tf.get_variable('biases', bias_shape,
        initializer=tf.constant_initializer(0.0))
    conv = tf.nn.conv2d(input, weights,
        strides=[1, 1, 1, 1], padding='SAME')
    return tf.nn.relu(conv + biases)

def my_image_filter(input_images):
    with tf.variable_scope('conv1'):
        # Variables created here will be named 'conv1/weights', 'conv1/biases'.
        relu1 = conv_relu(input_images, [5, 5, 32, 32], [32])
    with tf.variable_scope('conv2'):
        # Variables created here will be named 'conv2/weights', 'conv2/biases'.
        return conv_relu(relu1, [5, 5, 32, 32], [32])

而当确定需要变量重用时，则需要在两次或多次调用这个变量时创建相同名称的 variable_scope，再通过 reuse=True 来明确在这些范围中变量是共享的：

with tf.variable_scope('model'):
  output1 = my_image_filter(input1)
with tf.variable_scope('model', reuse=True):
  output2 = my_image_filter(input2)

模型参数 Model Parameters 的定义

在实际模型构建中，为了可以更加清晰和显式的定义全局模型参数（变量），可以在文件的头部定义一系列的 FLAGS 来完成，其推荐的定义方式为：

import tensorflow as tf
# Below `tf.app.flags` is a tensorflow wrapper for `absl.flags`
flags = tf.app.flags

# The defined `flag_name` can be used by `FLAGS.flag_name`
flags.DEFINE_*("flag_name", default_value, "short strings to describe this flag")

FLAGS = flags.FLAGS

Placeholders

正如函数构建过程中的形式参数一样，在计算图的构建过程中可以通过 Placeholder 来引入待赋予具体内容的变量，进而在运行计算图的时候再给予合适的变量值来完成计算。

A graph can be parameterized to accept external inputs, known as placeholders. A placeholder is a promise to provide a value later, like a function argument.

In [4]:
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
z = x + y
print(sess.run(z, feed_dict={x:3, y: 4.5}))
print(sess.run(z, feed_dict={x: [1, 3], y: [2, 4]}))
Out [4]:
7.5
[ 3.  7.]

层 Layers

TensorFlow 中的"层"对象打包了相应层所需要的变量（权重、偏置）和输入间的运算关系，如 tf.layers.Dense，tf.layers.Conv2D 等。

In [5]:
x = tf.placeholder(tf.float32, shape=[None, 3])
# Instantiate the tf.layers.Dense class with `output units`
# the input shape must be partially available for TF to infer the
# required weights shape
linear_model = tf.layers.Dense(units=1)
# call the layer as if it is a function
y = linear_model(x)

init = tf.global_variables_initializer()
sess.run(init)

print(sess.run(y, {x: [[1, 2, 3],[4, 5, 6]]}))

Out [5]:
[[-3.41378999]
 [-9.14999008]]

The layer inspects its input to determine sizes for its internal variables. So here we must set the shape of the x placeholder so that the layer can build a weight matrix of the correct size.

For each layer class (like tf.layers.Dense) TensorFlow also supplies a shortcut function (like tf.layers.dense). The only difference is that the shortcut function versions create and run the layer in a single call. For example, the following code is equivalent to the earlier version:

In [6]:
x = tf.placeholder(tf.float32, shape=[None, 3])
y = tf.layers.dense(x, units=1)

init = tf.global_variables_initializer()
sess.run(init)

print(sess.run(y, {x: [[1, 2, 3], [4, 5, 6]]}))

Out [6]:
[[-3.41378999]
 [-9.14999008]]

While convenient, this approach allows no access to the tf.layers.Layer object. This makes introspection and debugging more difficult, and layer reuse impossible.

参数的存储和恢复

在 TensorFlow 可以通过 saver = tf.train.Saver() 来设置变量的存储和恢复，这里需要注意的是，变量的存储和恢复都是可以设定的，也即只存储或恢复部分变量的数据。

# Model checkpoint with saver.save()
v1 = tf.get_variable("v1", shape=(3), initializer=tf.zeros_initializer)
v2 = tf.get_variable("v2", shape=(5), initializer=tf.zeros_initializer)

inc_v1 = v1.assign(v1 + 1)
dec_v2 = v2.assign(v2 - 1)

init = tf.global_variables_initializer()

saver = tf.train.Saver()

with tf.Session() as sess:
    sess.run(init)
    inc_v1.op.run()
    dec_v2.op.run()
    save_path = saver.save(sess, "/tmp/model.ckpt")
    print("Model saved in path: %s" % save_path)

# Partially restore the parameters from saved model checkpoint
tf.reset_default_graph()

v1 = tf.get_variable("v1", (3), initializer=tf.zeros_initializer)
v2 = tf.get_variable("v2", (5), initializer=tf.zeros_initializer)

saver = tf.train.Saver({"v2": v2})

with tf.Session() as sess:
    # `v1` needs to be initialized because it is not restored
    v1.initializer.run()
    saver.restore(sess, "/tmp/model.ckpt")
    print("v1: {}".format(v1.eval()))
    print("v2: {}".format(v2.eval()))

如果需要了解 checkpoint 文件中所包含的变量情况，可以先通过 inspect_checkpoint 进行检查：

from tensorflow.python.tools import inspect_checkpoint as ckpt

ckpt.print_tensors_in_checkpoint_file("/tmp/model.ckpt", tensor_name="", all_tensors=True)

tensor_name:  v1
[1. 1. 1.]
tensor_name:  v2
[-1. -1. -1. -1. -1.]

也可以只检查其中部分变量的值：

ckpt.print_tensors_in_checkpoint_file("/tmp/model.ckpt", tensor_name="v1", all_tensors=False)

Eager Execution

如果需要在模型构建过程中能够有机会查看代码的运行情况，不一定非要等待所有的模型构建完成，可以用下面一行代码来开启魔力：

tf.enable_eager_execution()

并行计算 Parallel Computation

在日常的使用中，大型的工作项目通常需要在多个主机上进行多机多卡的训练，这在 TensorFlow 中需要做特别设置。

单机多 GPU 计算

在使用多个 GPU 进行并行计算时，模型参数的同步更新需要等所有的 GPU 上都完成了相应的计算才能进行，因此在使用多个 GPU 进行并行计算时，最好使用统一型号的 GPU。由于向 GPU 写入和读取数据的速度较慢，因此在 TensorFlow 中，模型的参数更新是通过 CPU 来进行的，并且只在数据批次发生变更时才更新 GPU 中的数据。

Parallel training with multiple GPUs within one machine

在 TensorFlow 中，默认会只使用单台主机上的单个 GPU 进行计算，并且默认选择 ID=0 的 GPU，如果需要指定某个 GPU 进行计算，并且在这个 GPU 不可用时自动切换到另一个可用的 GPU上，则可以通过以下代码：

with tf.device("device:GPU:2"):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(allow_soft_placement=True, log_device_placement=True))
print(sess.run(c))

当希望 TensorFlow 可以使用多个 GPU 进行计算时，可以采用 In-graph replication 模式：

# Creates a graph.
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))

多机多 GPU 计算

在多机多卡训练中，同样有以下几个术语需要提前理解：

cluster：执行多机多卡计算的计算机集群
server: 用于执行多机多卡训练的每一台主机，每一台主机上可以包含多张 GPU 显卡
client: 实施 TensorFlow 程序编写和计算操作指令的客户端

A TensorFlow "cluster" is a set of "tasks" that participate in the distributed execution of a TensorFlow graph. Each task is associated with a TensorFlow "server", which contains a "master" that can be used to create sessions, and a "worker" that executes operations in the graph. A cluster can also be divided into one or more "jobs", where each job contains one or more tasks.

replicas: 在多机训练时，模型会在每一台主机上进行一次复制，因此 num_of_replicas 等于机器的数量，每一台使用数据集的每一批次的子集进行训练
clones: 每一台机器上所包含的 GPU 的数量
tower: 在使用多个 GPU 时，需要构建一个 tower 对象来实现对于多个 GPU 的负载均衡，对于任意一个 tower，需要设定两个属性：
- 一个独立的名称，用于管理运算的操作空间 scope，如 tower_0, tower_0/conv1/Conv2D
- 一个优选的执行这个 tower 函数的运算的设备，如 /device:GPU:0

In order to properly make use of multiple GPU's, one must introduce new abstractions, not present when using a single GPU, that facilitate the multi-GPU use case. In particular, one must introduce a means to isolate the inference and gradient calculations on the various GPU's. The abstraction we introduce for this purpose is called a 'tower'.

A tower is specified by two properties:

Scope - A scope, as provided by tf.name_scope(), is a means to isolate the operations within a tower.
For example, all operations within 'tower 0' could have their name prefixed with tower_0/.

Device - A hardware device, as provided by tf.device(), on which all operations within the tower execute.
For example, all operations of 'tower 0' could execute on the first GPU tf.device('/gpu:0').

jobs & tasks: job 指一个模型的计算任务，其中可以包含多个子项的工作职责 tasks，如针对 CPU 和 GPU 的工作分配

A job comprises a list of "tasks", which typically serve a common purpose. For example, a job named ps (for "parameter server") typically hosts nodes that store and update variables; while a job named worker typically hosts stateless nodes that perform compute-intensive tasks.

A task corresponds to a specific TensorFlow server, and typically corresponds to a single process. A task belongs to a particular "job" and is identified by its index within that job's list of tasks.

parameter server(ps): 存储参数并实施参数更新的 CPU，可以被多个 Tower 共享，也即被多台机器共享
workers：执行数据前处理、损失函数和梯度计算的 GPUs