Batch Normalization, 批标准化, 和普通的数据标准化类似, 是将分散的数据统一的一种做法, 也是优化神经网络的一种方法。

BN介绍

假如有个极简的网络模型,每一层只有一个节点,没有偏置,那么如果这个网络有三层的话,可以用下式表示其输出值:

1
Z=x * w1 * w2 * w3

假如有两个神经网络,学习出了两套权重(w1:1,w2:1,w3:1)和(w1:0.01,w2:1000,w3:0.01),它们对应的输出z都是相同的。

  1. 反向传播:假设反向传播时计算出的损失值δy为1,那么对于这两套权重的修正值将变为(δw1:1,δw2:1,δw3:1)和(δw1:100,δw2:0.0001,δw3:100)
  2. 更新权重:这时更新过后的两套权重就变成了(δw1:2,δw2:2,δw3:2)和(δw1:100.01,δw2:10000.0001,δw3:100.01)
  3. 第二次正向传播:假设输入样本是1,第一个神经网络值为:Z=1x2x2x2=8;第二个神经网络值为:Z=1x100.1x10000.0001x100.01=100000000

可以看到两个网络的输出值差别巨大,如果再往下进行,这时计算出的loss值会变得更大,使得网络无法计算,这种现象叫做梯度爆炸。产生梯度爆炸的原因就是因为网络的内部协变量转移(Internal Covariate Shift),即正向传播时的不同层的参数会将反向训练计算时所参照的数据样本分布改变。
这就是引入批量正则化的目的,它的作用是最大限度的保证每次的正向传播输出在同一分布上,这样反向计算时参照的数据样本分布就会与正向计算时的数据分布一样了。保证了分布统一,对权重的调整才更有意义。

批量正则化的做法就是将每一层运算出来的数据都归一化成均值为0方差为1的标准高斯分布,这样就会在保留样本分布特征的同时又消除了层与层的分布差异。

在实际应用中,批量正则化的收敛速度非常快,并且具有很强的泛化能力,某种情况下可以完全代替正则化、Dropout。

BN的定义

Tensorflow中的BN实现:

1
2
3
4
5
6
7
8
9
tf.nn.batch_normalization(
x, # 代表输入
mean, # 代表样本的均值
variabce, # 代表方差
offset, # 代表偏移,即相加一个转化值,后面会用激活函数来转换,所以这里不需要再转化,直接使用0
scale, # 缩放,即乘以一个转化值,同理,一般用1
variance_epsilon, # 为了避免分母为0的情况,给分母加一个极小值.默认即可
name=None
)

要使用这个函数,还需要另一个函数配合——tf.nn.moments,由它来计算均值和方差,然后使用BN。

1
2
tf.nn.moments(x,axes,name=None,keep_dims=False)
# axes主要是指定哪个轴来求均值与方差

为了求样本的均值和方差,一般会设为保留最后一个维度,对于x来将可以直接使用公式axis=list(range(len(x.get_shape())-1))即可。例如,[128,3,3,12]axes就为[0,1,2],输出的均值方差维度为[12]

我们希望使用平滑指数衰减的方法来优化每次的均值和方差,于是就用到了tf.train.ExponentialMovingAverage函数。作用是让上一次的值对本次的值有个衰减后的影响,从而使每次的值连接起来后会相对平滑一些。

1
shadow_variable=decay * shadow_variable + (1-decay) * variable

  • decay:衰减指数,是在ExponentialMovingAverage中指定的,如0.9
  • variable:本批次样本中的值
  • 等式右边shadow_variable:上次总样本的值
  • 等式左边shadow_variable:计算出来的本次总样本的值
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    189
    190
    191
    192
    193
    194
    195
    196
    import tensorflow as tf
    import numpy as np
    import matplotlib.pyplot as plt
    N_LAYERS = 7 # 一共7个隐藏层
    N_HIDDEN_UNITS = 30 # 每个隐藏层有30个神经元


    def fix_seed(seed=1):
    np.random.seed(seed)
    tf.set_random_seed(seed)


    def plot_his(inputs, inputs_norm):
    # plot histogram for the inputs of every layer
    for j, all_inputs in enumerate([inputs, inputs_norm]):
    for i, input in enumerate(all_inputs):
    plt.subplot(2, len(all_inputs), j * len(all_inputs) + (i + 1))
    plt.cla()
    if i == 0:
    the_range = (-7, 10)
    else:
    the_range = (-1, 1)
    plt.hist(input.ravel(), bins=15, range=the_range, color='#FF5733')
    plt.yticks(())
    if j == 1:
    plt.xticks(the_range)
    else:
    plt.xticks(())
    ax = plt.gca()
    ax.spines['right'].set_color('none')
    ax.spines['top'].set_color('none')
    plt.title("%s normalizing" % ("Without" if j == 0 else "With"))
    plt.draw()
    plt.pause(0.01)


    def build_net(xs, ys, norm):
    def add_layer(inputs,
    in_size,
    out_size,
    activation_function=None,
    norm=False):
    # weights and biases (bad initialization for this case)
    Weights = tf.Variable(
    tf.random_normal([in_size, out_size], mean=0., stddev=1.))
    biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)

    # fully connected product
    Wx_plus_b = tf.matmul(inputs, Weights) + biases

    # normalize fully connected product
    if norm:
    # Batch Normalize
    fc_mean, fc_var = tf.nn.moments(
    Wx_plus_b,
    axes=[
    0
    ], # the dimension you wanna normalize, here [0] for batch
    # for image, you wanna do [0, 1, 2] for [batch, height, width] but not channel
    )
    scale = tf.Variable(tf.ones([out_size]))
    shift = tf.Variable(tf.zeros([out_size]))
    epsilon = 0.001

    # apply moving average for mean and var when train on batch
    ema = tf.train.ExponentialMovingAverage(decay=0.5)

    def mean_var_with_update():
    ema_apply_op = ema.apply([fc_mean, fc_var])
    with tf.control_dependencies([ema_apply_op]):
    return tf.identity(fc_mean), tf.identity(fc_var)

    mean, var = mean_var_with_update()

    Wx_plus_b = tf.nn.batch_normalization(Wx_plus_b, mean, var, shift,
    scale, epsilon)
    # similar with this two steps:
    # Wx_plus_b = (Wx_plus_b - fc_mean) / tf.sqrt(fc_var + 0.001)
    # Wx_plus_b = Wx_plus_b * scale + shift

    # activation
    if activation_function is None:
    outputs = Wx_plus_b
    else:
    outputs = activation_function(Wx_plus_b)

    return outputs

    fix_seed(1)

    if norm:
    # BN for the first input
    fc_mean, fc_var = tf.nn.moments(
    xs,
    axes=[0],
    )
    scale = tf.Variable(tf.ones([1]))
    shift = tf.Variable(tf.zeros([1]))
    epsilon = 0.001
    # apply moving average for mean and var when train on batch
    ema = tf.train.ExponentialMovingAverage(decay=0.5)

    def mean_var_with_update():
    ema_apply_op = ema.apply([fc_mean, fc_var])
    with tf.control_dependencies([ema_apply_op]):
    return tf.identity(fc_mean), tf.identity(fc_var)

    mean, var = mean_var_with_update()
    xs = tf.nn.batch_normalization(xs, mean, var, shift, scale, epsilon)

    # record inputs for every layer
    layers_inputs = [xs]

    # build hidden layers
    for l_n in range(N_LAYERS):
    layer_input = layers_inputs[l_n]
    in_size = layers_inputs[l_n].get_shape()[1].value

    output = add_layer(
    layer_input, # input
    in_size, # input size
    N_HIDDEN_UNITS, # output size
    tf.nn.relu, # activation function
    norm, # normalize before activation
    )
    layers_inputs.append(output) # add output for next run

    # build output layer
    prediction = add_layer(layers_inputs[-1], 30, 1, activation_function=None)

    cost = tf.reduce_mean(
    tf.reduce_sum(tf.square(ys - prediction), reduction_indices=[1]))
    train_op = tf.train.GradientDescentOptimizer(0.001).minimize(cost)
    return [train_op, cost, layers_inputs]


    # 创建数据
    fix_seed(1)
    x_data = np.linspace(-7, 10, 2500)[:, np.newaxis]
    np.random.shuffle(x_data)
    noise = np.random.normal(0, 8, x_data.shape)
    y_data = np.square(x_data) - 5 + noise
    plt.scatter(x_data, y_data)
    plt.show()

    xs = tf.placeholder(tf.float32, [None, 1]) # [num_samples,num_features]
    ys = tf.placeholder(tf.float32, [None, 1])

    train_op, cost, layers_inputs = build_net(xs, ys, norm=False) # without BN
    train_op_norm, cost_norm, layers_inputs_norm = build_net(
    xs, ys, norm=True) # with BN
    with tf.Session() as sess:
    init = tf.global_variables_initializer()
    sess.run(init)

    cost_his = []
    cost_his_norm = []
    record_step = 5
    plt.ion()
    plt.figure(figsize=(7, 3))
    for i in range(250):
    if i % 50 == 0:
    # plot histogram
    all_inputs, all_inputs_norm = sess.run(
    [layers_inputs, layers_inputs_norm],
    feed_dict={
    xs: x_data,
    ys: y_data
    })
    plot_his(all_inputs, all_inputs_norm)

    # train on batch
    sess.run([train_op, train_op_norm],
    feed_dict={
    xs: x_data[i * 10:i * 10 + 10],
    ys: y_data[i * 10:i * 10 + 10]
    })

    if i % record_step == 0:
    cost_his.append(sess.run(cost, feed_dict={xs: x_data, ys: y_data}))
    cost_his_norm.append(
    sess.run(cost_norm, feed_dict={
    xs: x_data,
    ys: y_data
    }))
    plt.ioff()
    plt.figure()
    plt.plot(
    np.arange(len(cost_his)) * record_step, np.array(cost_his),
    label='no BN') # no norm
    plt.plot(
    np.arange(len(cost_his)) * record_step,
    np.array(cost_his_norm),
    label='BN') # norm
    plt.legend()
    plt.show()

BN的简单用法

上面的函数虽然参数不多,但需要几个函数联合起来使用,于是Tensorflow中的layers模块里又实现了一次BN函数,相当于把几个函数合并到一起。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# 需要导入以下模块
from tensorflow.contrib.layers.python.layers import batch_norm
# 函数的定义
def batch_norm(
inputs,
decay=0.999,
center=True,
scale=False,
epsilon=0.001,
activation_fn=None,
param_initializers=None,
param_regularizers=None,
updates_collections=ops.GrapKeys.UPDATE_OPS,
is_training=True,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
batch_weights=None,
fused=False,
data_format=DATA_FORMAT_NHWC,
zero_debias_moving_mean=False,
scope=None,
renorm_clipping=None,
renorm_decay=0.99
):

  • inputs:输入
  • decay:移动平均值的衰减速度,是使用了一种叫做平滑指数衰减的方法更新均值方差,一般设为0.9;值太小会导致均值和方差更新太快,而值太大又会导致几乎没有衰减,容易出现过拟合,这种情况一般需要把值调小点
  • scale:如果为True,则乘以gamma。如果为False,gamma则不使用。当下一层是线性的时(例如relu),由于缩放可以由下一层完成,所以可以禁用该层。
  • epslion:为了避免分母为0,给分母加一个极小值。一般默认即可。
  • is_training:当它为True时,代表是训练过程,这时会不断更新样本集的均值与方差。当测试时,设为False,这样就会使用训练样本集的均值与方差
  • updates_collections:其默认是tf.GraphKeys.UPDATE_OPS,在训练时提供了一种内置的均值方差更新机制,即通过图(一个计算任务)中的tf.GraphKeys.UPDATE_OPS变量来更新。但是它是在每次当前批次训练完成后才更新均值和方差,这样导致当前数据总是使用前一次的均值和方差,没有得到最新的更新。所以一般都设为None,让均值和方差即时更新。这样做虽然相比默认值在性能稍慢点,但是对模型的训练有较大帮助。
  • reuse:支持共享变量,与scope联合使用
  • scope:指定变量的作用域variable_scope

为CIFAR图片分类模型添加BN

添加BN函数

在池化函数后加入BN函数

1
2
3
4
5
6
7
8
def avg_pool_6x6(x):
return tf.nn.avg_pool(x, ksize=[1, 6, 6, 1],
strides=[1, 6, 6, 1], padding='SAME')
def batch_norm_layer(value,train = None, name = 'batch_norm'):
if train is not None:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = True)
else:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = False)

为BN函数添加占位符参数

由于BN里面需要设置是否为训练状态,所以这里定义一个train将训练转态当成一个占位符来传入

1
2
3
x=tf.placeholder(tf.float32,[None,24,24,3) # CIFAR数据集的shape为24x24x3
y=tf.placeholder(tf.float32,[None,10]) # 10类
train=tf.plcaeholder(tf.float32)

修改网络结构添加BN层

在第一层h_conv1与第二层h_conv2的输出之前卷积之后加入BN层

1
2
3
4
5
6
h_conv1=tf.nn.relu(batch_norm_layer((conv2d(x_image,W_conv1)+b_conv1),train))
h_pool1=max_pool_2x2(h_conv1)
W_conv2=weight_variable([5,5,64,64])
b_conv2=bias_variable([64])
h_conv2=tf.nn.relu(batch_norm_layer((conv2d(h_pool1,W_conv2)+b_conv2),train))
h_pool2=max_pool_2x2(h_conv2)

加入衰减学习率

将原来的学习率改成衰减学习率,使用0.04的初始值,让其每100次退化0.9

1
2
3
4
5
6
cross_entropy = -tf.reduce_sum(y*tf.log(y_conv))
global_step = tf.Variable(0, trainable=False)
decaylearning_rate = tf.train.exponential_decay(0.04, global_step,1000, 0.9)
train_step = tf.train.AdamOptimizer(decaylearning_rate).minimize(cross_entropy,global_step=global_step)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

在运行session中

在session中找到循环的部分,为占位符train添加数值1,表明当前是训练状态。其他地方不动,因为第一步的BN函数设定好train为None,默认是测试状态。

1
2
3
4
for i in range(20000):
image_batch, label_batch = sess.run([images_train, labels_train])
label_b = np.eye(10,dtype=float)[label_batch] #one hot
train_step.run(feed_dict={x:image_batch, y: label_b,train:1},session=sess)

完整代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import cifar10_input
import tensorflow as tf
import numpy as np
from tensorflow.contrib.layers.python.layers import batch_norm

batch_size = 128
data_dir = '/tmp/cifar10_data/cifar-10-batches-bin'
print("begin")
images_train, labels_train = cifar10_input.inputs(eval_data = False,data_dir = data_dir, batch_size = batch_size)
images_test, labels_test = cifar10_input.inputs(eval_data = True, data_dir = data_dir, batch_size = batch_size)
print("begin data")

def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)

def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)

def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')

def avg_pool_6x6(x):
return tf.nn.avg_pool(x, ksize=[1, 6, 6, 1],
strides=[1, 6, 6, 1], padding='SAME')

def batch_norm_layer(value,train = None, name = 'batch_norm'):
if train is not None:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = True)
else:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = False)

# tf Graph Input
x = tf.placeholder(tf.float32, [None, 24,24,3]) # cifar data image of shape 24*24*3
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 数字=> 10 classes
train = tf.placeholder(tf.float32)

W_conv1 = weight_variable([5, 5, 3, 64])
b_conv1 = bias_variable([64])

x_image = tf.reshape(x, [-1,24,24,3])

h_conv1 = tf.nn.relu(batch_norm_layer((conv2d(x_image, W_conv1) + b_conv1),train))
h_pool1 = max_pool_2x2(h_conv1)

W_conv2 = weight_variable([5, 5, 64, 64])
b_conv2 = bias_variable([64])

h_conv2 = tf.nn.relu(batch_norm_layer((conv2d(h_pool1, W_conv2) + b_conv2),train))
h_pool2 = max_pool_2x2(h_conv2)


W_conv3 = weight_variable([5, 5, 64, 10])
b_conv3 = bias_variable([10])
h_conv3 = tf.nn.relu(conv2d(h_pool2, W_conv3) + b_conv3)

nt_hpool3=avg_pool_6x6(h_conv3)#10
nt_hpool3_flat = tf.reshape(nt_hpool3, [-1, 10])
y_conv=tf.nn.softmax(nt_hpool3_flat)


cross_entropy = -tf.reduce_sum(y*tf.log(y_conv))
global_step = tf.Variable(0, trainable=False)
decaylearning_rate = tf.train.exponential_decay(0.04, global_step,1000, 0.9)
train_step = tf.train.AdamOptimizer(decaylearning_rate).minimize(cross_entropy,global_step=global_step)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tf.train.start_queue_runners(sess=sess)
for i in range(20000):
image_batch, label_batch = sess.run([images_train, labels_train])
label_b = np.eye(10,dtype=float)[label_batch] #one hot
train_step.run(feed_dict={x:image_batch, y: label_b,train:1},session=sess)
if i%200 == 0:
train_accuracy = accuracy.eval(feed_dict={
x:image_batch, y: label_b},session=sess)
print( "step %d, training accuracy %g"%(i, train_accuracy))
image_batch, label_batch = sess.run([images_test, labels_test])
label_b = np.eye(10,dtype=float)[label_batch]#one hot
print ("finished! test accuracy %g"%accuracy.eval(feed_dict={
x:image_batch, y: label_b},session=sess))