Stanford cs231n Assignment #2 实现CNN -- Convolutional Neural Nets

assignment2下，在cnn之前还有fully connected neural nets，batch norm和dropout的assignment，dropout实现起来还是挺简单的，batch norm的话在上一章我参考了别人的解法（一开始自己想错了，后来还没来得及自己重新推导一遍，之后希望可以补上这个过程），fcn_net的实现主要依靠forward:存储cache并逐层计算layer的输出，backward:根据cache和dout来计算当前层的参数的梯度下降增量（dout是上一层传递下来的，cache是当前层的，注意cache在forward过程中存下的是当前layer的输入量和其他有关计算的，比如第1层(输入层)的cache就是输入x和第一层的weight和bias，如果是一些特殊的layer，比如dropout层需要保存当前层的mask【决定哪些输入feature在此置0, 即np.random.rand(x.shape[0], x.shape[1]) 】）。

接下来是重点...实现一个CNN，毕竟fully connected nets只是一个testbed。对于实现cnn，我会一步一步记录自己的理解和做法。

1. Conv Layer Naive Forward

卷积层的pad，输出的dimension计算如下：

注意kernel是对输入层的所有层都进行卷积，所以如果只有一个filter那么输出只有1层，如果有n个filter，那么输出就有n层了。一个filter中用于卷积的kernel数取决于上一层的输入channel的个数，每个channel对应一个kernel，代码：

def conv_forward_naive(x, w, b, conv_param):
  """
  A naive implementation of the forward pass for a convolutional layer.

  The input consists of N data points, each with C channels, height H and width
  W. We convolve each input with F different filters, where each filter spans
  all C channels and has height HH and width HH.

  Input:
  - x: Input data of shape (N, C, H, W)
  - w: Filter weights of shape (F, C, HH, WW)
  - b: Biases, of shape (F,)
  - conv_param: A dictionary with the following keys:
    - 'stride': The number of pixels between adjacent receptive fields in the
      horizontal and vertical directions.
    - 'pad': The number of pixels that will be used to zero-pad the input.

  Returns a tuple of:
  - out: Output data, of shape (N, F, H', W') where H' and W' are given by
    H' = 1 + (H + 2 * pad - HH) / stride
    W' = 1 + (W + 2 * pad - WW) / stride
  - cache: (x, w, b, conv_param)
  """
  out = None
  #############################################################################
  # TODO: Implement the convolutional forward pass.                           #
  # Hint: you can use the function np.pad for padding.                        #
  #############################################################################
  stride = conv_param['stride']
  pad = conv_param['pad']
  print 'pad:', pad
  # print 'stride:', stride, 'pad:', pad
  # print 'x', x.shape, 'w', w.shape
  
  x_pad = np.zeros((x.shape[0], x.shape[1], x.shape[2]+pad*2, x.shape[3]+pad*2))
  
  # pad 0 to image
  for i in range(x.shape[0]):
    for j in range(x.shape[1]):
      x_pad[i,j,:,:]=np.pad(x[i,j,:,:], pad, 'constant', constant_values=0)

  H_out = (x.shape[2]+2*pad-w.shape[2])/stride+1
  W_out = (x.shape[3]+2*pad-w.shape[3])/stride+1
  out = np.zeros((x.shape[0], w.shape[0], H_out, W_out))

  # convolution
  for i in range(H_out):
    for j in range(W_out):
      for k in range(w.shape[0]):
        out[:, k, i, j] = np.sum(w[k,:,:,:]*x_pad[:, :, i*stride:i*stride+w.shape[2], j*stride:j*stride+w.shape[3]], axis=(1,2,3))+b[k]
  
  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################
  cache = (x, w, b, conv_param)
  return out, cache

2. Conv Layer Naive Backward

ConvLayer Connection

个人总结的感性认识：

dx一定和x的shape相同，dw一定和w的shape相同，db一定和b的shape相同，dout和out的shape一定相同。只要考虑下它们的意义就可以知道。dL/dx的含义（也就是代码中的dx）是x的改变会多大程度地改变L，而x的改变也只能是在自己的shape下。
在计算的时候先将x进行pad。我们知道在卷积的过程中，每一次卷积运算，对应于上图的2×2和2×2矩阵的作用，其实就是一个相乘累加过程。对于out层的每一个元素，都是由input layer和weight矩阵对应的相乘累加得到的。以dx为例，input矩阵x中的每个元素可能会参与１到几次不等的运算，与weight相乘将结果贡献到out layer中，比如x(0,0)只参与了一次，那么它的dx(0,0)=weight(0,0)×dout(...)，而x(0,1)参与了两次，分别贡献到out的不同位置，那么dx(0,1)=weight(0,0)×dout(...)+weight(0,1)×dout(...)，以此类推。

代码：

def conv_backward_naive(dout, cache):
  """
  A naive implementation of the backward pass for a convolutional layer.

  Inputs:
  - dout: Upstream derivatives.
  - cache: A tuple of (x, w, b, conv_param) as in conv_forward_naive

  Returns a tuple of:
  - dx: Gradient with respect to x
  - dw: Gradient with respect to w
  - db: Gradient with respect to b
  """
  dx, dw, db = None, None, None
  #############################################################################
  # TODO: Implement the convolutional backward pass.                          #
  #############################################################################
  x = cache[0]
  w = cache[1]
  b = cache[2]
  stride = cache[3]['stride']
  pad = cache[3]['pad']

  x_pad=np.pad(x, [(0,0),(0,0),(pad,pad),(pad,pad)], 'constant', constant_values=0)
  dx_pad = np.zeros((x.shape[0], x.shape[1], x.shape[2]+2*pad, x.shape[3]+2*pad))
  dw = np.zeros(w.shape)
  db = np.zeros(b.shape)
  # k，即第一维是样本个数；f，即第二维是filter个数；xi和xj是该层输出map的长宽。
  # 以下代码根据上面的分析就很好理解了
  for k in range(dout.shape[0]):
    for f in range(dout.shape[1]):
      for xi in range(dout.shape[2]):
        for xj in range(dout.shape[3]):
          dx_pad[k,:,xi*stride:xi*stride+w.shape[2],xj*stride:xj*stride+w.shape[3]] += w[f,:]*dout[k,f,xi,xj]
          dw[f] += x_pad[k,:,xi*stride:xi*stride+w.shape[2],xj*stride:xj*stride+w.shape[3]]*dout[k,f,xi,xj]

  dx = dx_pad[:,:,pad:pad+x.shape[2],pad:pad+x.shape[3]]
  db = np.sum(dout, axis=(0,2,3))
  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################
  return dx, dw, db

3. Max Pooling Naive Forward

def max_pool_forward_naive(x, pool_param):
  """
  A naive implementation of the forward pass for a max pooling layer.

  Inputs:
  - x: Input data, of shape (N, C, H, W)
  - pool_param: dictionary with the following keys:
    - 'pool_height': The height of each pooling region
    - 'pool_width': The width of each pooling region
    - 'stride': The distance between adjacent pooling regions

  Returns a tuple of:
  - out: Output data
  - cache: (x, pool_param)
  """
  out = None
  #############################################################################
  # TODO: Implement the max pooling forward pass                              #
  #############################################################################
  pool_height = pool_param['pool_height']
  pool_width = pool_param['pool_width']
  stride = pool_param['stride']

  out = np.zeros(( x.shape[0], x.shape[1],
                   (x.shape[2]-pool_height)//stride+1,
                   (x.shape[3]-pool_width)//stride+1 ))

  for n in range(x.shape[0]):
    for c in range(x.shape[1]):
      for i in range(out.shape[2]):
        for j in range(out.shape[3]):
          out[n,c,i,j] = np.max( x[n,c,i*stride:i*stride+pool_height, j*stride:j*stride+pool_width] )

  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################
  cache = (x, pool_param)
  return out, cache

4. Max Pooling Naive Backward

def max_pool_backward_naive(dout, cache):
  """
  A naive implementation of the backward pass for a max pooling layer.

  Inputs:
  - dout: Upstream derivatives
  - cache: A tuple of (x, pool_param) as in the forward pass.

  Returns:
  - dx: Gradient with respect to x
  """
  dx = None
  #############################################################################
  # TODO: Implement the max pooling backward pass                             #
  #############################################################################
  x = cache[0]
  dx = np.zeros(x.shape)
  pool_param = cache[1]
  stride = pool_param['stride']
  pool_height = pool_param['pool_height']
  pool_width = pool_param['pool_width']
  for n in range(x.shape[0]):
    for c in range(x.shape[1]):
      for i in range(dout.shape[2]):
        for j in range(dout.shape[3]):
          tmp = x[n,c,i*stride:i*stride+pool_height,j*stride:j*stride+pool_width]
          binary = tmp==np.max(tmp)
          dx[n,c,i*stride:i*stride+pool_height,j*stride:j*stride+pool_width] = binary*dout[n,c,i,j]
  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################
  return dx

5. 训练一个(Conv + Relu + Pooling) + (Affine + Relu) + (Affine) + Softmax　三层网络

class ThreeLayerConvNet(object):
  """
  A three-layer convolutional network with the following architecture:
  
  conv - relu - 2x2 max pool - affine - relu - affine - softmax
  
  The network operates on minibatches of data that have shape (N, C, H, W)
  consisting of N images, each with height H and width W and with C input
  channels.
  """
  
  def __init__(self, input_dim=(3, 32, 32), num_filters=32, filter_size=7,
               hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0,
               dtype=np.float32):
    """
    Initialize a new network.
    
    Inputs:
    - input_dim: Tuple (C, H, W) giving size of input data
    - num_filters: Number of filters to use in the convolutional layer
    - filter_size: Size of filters to use in the convolutional layer
    - hidden_dim: Number of units to use in the fully-connected hidden layer
    - num_classes: Number of scores to produce from the final affine layer.
    - weight_scale: Scalar giving standard deviation for random initialization
      of weights.
    - reg: Scalar giving L2 regularization strength
    - dtype: numpy datatype to use for computation.
    """
    self.params = {}
    self.reg = reg
    self.dtype = dtype
    
    ############################################################################
    # TODO: Initialize weights and biases for the three-layer convolutional    #
    # network. Weights should be initialized from a Gaussian with standard     #
    # deviation equal to weight_scale; biases should be initialized to zero.   #
    # All weights and biases should be stored in the dictionary self.params.   #
    # Store weights and biases for the convolutional layer using the keys 'W1' #
    # and 'b1'; use keys 'W2' and 'b2' for the weights and biases of the       #
    # hidden affine layer, and keys 'W3' and 'b3' for the weights and biases   #
    # of the output affine layer.                                              #
    ############################################################################
    self.params['W1'] = np.random.normal(0, weight_scale, (num_filters, input_dim[0], filter_size, filter_size))
    self.params['b1'] = np.zeros(num_filters)
    W2_rol_size = num_filters * input_dim[1]/2 * input_dim[2]/2
    self.params['W2'] = np.random.normal(0, weight_scale, (W2_rol_size, hidden_dim))
    self.params['b2'] = np.zeros(hidden_dim)
    self.params['W3'] = np.random.normal(0, weight_scale, (hidden_dim, num_classes))
    self.params['b3'] = np.zeros(num_classes)
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################

    for k, v in self.params.iteritems():
      self.params[k] = v.astype(dtype)
     
 
  def loss(self, X, y=None):
    """
    Evaluate loss and gradient for the three-layer convolutional network.
    
    Input / output: Same API as TwoLayerNet in fc_net.py.
    """
    W1, b1 = self.params['W1'], self.params['b1']
    W2, b2 = self.params['W2'], self.params['b2']
    W3, b3 = self.params['W3'], self.params['b3']
    
    # pass conv_param to the forward pass for the convolutional layer
    filter_size = W1.shape[2]
    conv_param = {'stride': 1, 'pad': (filter_size - 1) / 2}

    # pass pool_param to the forward pass for the max-pooling layer
    pool_param = {'pool_height': 2, 'pool_width': 2, 'stride': 2}

    scores = None
    loss, grads = 0, {}
    ############################################################################
    # TODO: Implement the forward pass for the three-layer convolutional net,  #
    # computing the class scores for X and storing them in the scores          #
    # variable.                                                                #
    ############################################################################
    out1, cache1 = conv_relu_pool_forward(X, W1, b1, conv_param, pool_param)
    out2, cache2 = affine_relu_forward(out1, W2, b2)
    out3, cache3 = affine_forward(out2, W3, b3)
    score = out3
    loss, dscore = softmax_loss(out3, y)
    loss += np.sum(0.5*self.reg*np.sum(W**2) for W in [W1, W2, W3])
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################
    
    if y is None:
      return scores
    
    ############################################################################
    # TODO: Implement the backward pass for the three-layer convolutional net, #
    # storing the loss and gradients in the loss and grads variables. Compute  #
    # data loss using softmax, and make sure that grads[k] holds the gradients #
    # for self.params[k]. Don't forget to add L2 regularization!               #
    ############################################################################
    dout3, grads['W3'], grads['b3'] = affine_backward(dscore, cache3)
    dout2, grads['W2'], grads['b2'] = affine_relu_backward(dout3, cache2)
    dout1, grads['W1'], grads['b1'] = conv_relu_pool_backward(dout2, cache1)

    grads['W3'] += self.reg*W3
    grads['W2'] += self.reg*W2
    grads['W1'] += self.reg*W1
    ############################################################################
    #                             END OF YOUR CODE                             #
    ############################################################################
    
    return loss, grads

6. Spatial Batch Normalization

We already saw that batch normalization is a very useful technique for training deep fully-connected networks. Batch normalization can also be used for convolutional networks, but we need to tweak it a bit; the modification will be called "spatial batch normalization."

Normally batch-normalization accepts inputs of shape (N, D) and produces outputs of shape (N, D), where we normalize across the minibatch dimension N. For data coming from convolutional layers, batch normalization needs to accept inputs of shape (N, C, H, W) and produce outputs of shape (N, C, H, W) where the N dimension gives the minibatch size and the (H, W) dimensions give the spatial size of the feature map.

If the feature map was produced using convolutions, then we expect the statistics of each feature channel to be relatively consistent both between different images and different locations within the same image. Therefore spatial batch normalization computes a mean and variance for each of the C feature channels by computing statistics over both the minibatch dimension N and the spatial dimensions H and W.

CNN和全连接的NN的BN方法不一样的地方在于，Conv layer使用的Spatial BN的思想是按照Channel（RGB通道）来进行归一化，保证不同图和同一图的不同位置在一个通道的分布上是一致的、归一化的。这部分我没有自己去写，直接给出github上martinkersner的代码：

def spatial_batchnorm_forward(x, gamma, beta, bn_param):

"""

 Computes the forward pass for spatial batch normalization.
 Inputs:
 - x: Input data of shape (N, C, H, W)
 - gamma: Scale parameter, of shape (C,)
 - beta: Shift parameter, of shape (C,)
 - bn_param: Dictionary with the following keys:
 - mode: 'train' or 'test'; required
 - eps: Constant for numeric stability
 - momentum: Constant for running mean / variance. momentum=0 means that old information is discarded completely at every time step, while momentum=1 means that new information is never incorporated. The default of momentum=0.9 should work well in most situations.
 - running_mean: Array of shape (D,) giving running mean of features
 - running_var Array of shape (D,) giving running variance of features

Returns a tuple of:
 - out: Output data, of shape (N, C, H, W)
 - cache: Values needed for the backward pass
 """

  out, cache = None, None

############################################################################## 
# Implement the forward pass for spatial batch normalization.
# HINT: You can implement spatial batch normalization using the vanilla #
# version of batch normalization defined above. Your implementation should #
# be very short; ours is less than five lines. 
##############################################################################

  N, C, H, W = x.shape
  x_reshaped = x.transpose(0,2,3,1).reshape(N*H*W, C)
  out_tmp, cache = batchnorm_forward(x_reshaped, gamma,   beta, bn_param)
  out = out_tmp.reshape(N, H, W, C).transpose(0, 3, 1, 2)

  return out, cache



def spatial_batchnorm_backward(dout, cache):

"""
 Computes the backward pass for spatial batch normalization.

 Inputs:
 - dout: Upstream derivatives, of shape (N, C, H, W)
 - cache: Values from the forward pass
 
 Returns a tuple of:
 - dx: Gradient with respect to inputs, of shape (N, C, H, W)
 - dgamma: Gradient with respect to scale parameter, of shape (C,)
 - dbeta: Gradient with respect to shift parameter, of shape (C,)
 """
  dx, dgamma, dbeta = None, None, None
#############################################################################
# Implement the backward pass for spatial batch normalization. 
## HINT: You can implement spatial batch normalization using the vanilla 
## version of batch normalization defined above. Your implementation should
## be very short; ours is less than five lines. 
##############################################################################

  N, C, H, W = dout.shape
  dout_reshaped = dout.transpose(0,2,3,1).reshape(N*H*W, C)
  dx_tmp, dgamma, dbeta = batchnorm_backward(dout_reshaped, cache)
  dx = dx_tmp.reshape(N, H, W, C).transpose(0, 3, 1, 2)

return dx, dgamma, dbeta

【写在最后的tip】Tips for training：

For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:

If the parameters are working well, you should see improvement within a few hundred iterations
Remember the course-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.

另外对于cnn的梯度反向传播来说，可以看看更高效的做法，而不是用程序中的for loop：
http://www.cnblogs.com/pinard/p/6494810.html

最后编辑于：2017.12.05 03:27:04

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,723评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,080评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,604评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,440评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,431评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,499评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,893评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,541评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,751评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,547评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,619评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,320评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,890评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,896评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,137评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,796评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,335评论 2赞 342