Batch Normalization
Batch Normalization, 批标准化, 和普通的数据标准化类似, 是将分散的数据统一的一种做法, 也是优化神经网络的一种方法。
BN介绍
假如有个极简的网络模型,每一层只有一个节点,没有偏置,那么如果这个网络有三层的话,可以用下式表示其输出值:1
Z=x * w1 * w2 * w3
假如有两个神经网络,学习出了两套权重(w1:1,w2:1,w3:1)和(w1:0.01,w2:1000,w3:0.01),它们对应的输出z都是相同的。
- 反向传播:假设反向传播时计算出的损失值δy为1,那么对于这两套权重的修正值将变为(δw1:1,δw2:1,δw3:1)和(δw1:100,δw2:0.0001,δw3:100)
- 更新权重:这时更新过后的两套权重就变成了(δw1:2,δw2:2,δw3:2)和(δw1:100.01,δw2:10000.0001,δw3:100.01)
- 第二次正向传播:假设输入样本是1,第一个神经网络值为:Z=1x2x2x2=8;第二个神经网络值为:Z=1x100.1x10000.0001x100.01=100000000
可以看到两个网络的输出值差别巨大,如果再往下进行,这时计算出的loss值会变得更大,使得网络无法计算,这种现象叫做梯度爆炸。产生梯度爆炸的原因就是因为网络的内部协变量转移(Internal Covariate Shift),即正向传播时的不同层的参数会将反向训练计算时所参照的数据样本分布改变。
这就是引入批量正则化的目的,它的作用是最大限度的保证每次的正向传播输出在同一分布上,这样反向计算时参照的数据样本分布就会与正向计算时的数据分布一样了。保证了分布统一,对权重的调整才更有意义。
批量正则化的做法就是将每一层运算出来的数据都归一化成均值为0方差为1的标准高斯分布,这样就会在保留样本分布特征的同时又消除了层与层的分布差异。
在实际应用中,批量正则化的收敛速度非常快,并且具有很强的泛化能力,某种情况下可以完全代替正则化、Dropout。
BN的定义
Tensorflow中的BN实现:1
2
3
4
5
6
7
8
9tf.nn.batch_normalization(
x, # 代表输入
mean, # 代表样本的均值
variabce, # 代表方差
offset, # 代表偏移,即相加一个转化值,后面会用激活函数来转换,所以这里不需要再转化,直接使用0
scale, # 缩放,即乘以一个转化值,同理,一般用1
variance_epsilon, # 为了避免分母为0的情况,给分母加一个极小值.默认即可
name=None
)
要使用这个函数,还需要另一个函数配合——tf.nn.moments,由它来计算均值和方差,然后使用BN。1
2tf.nn.moments(x,axes,name=None,keep_dims=False)
# axes主要是指定哪个轴来求均值与方差
为了求样本的均值和方差,一般会设为保留最后一个维度,对于x来将可以直接使用公式axis=list(range(len(x.get_shape())-1))即可。例如,[128,3,3,12]axes就为[0,1,2],输出的均值方差维度为[12]
我们希望使用平滑指数衰减的方法来优化每次的均值和方差,于是就用到了tf.train.ExponentialMovingAverage函数。作用是让上一次的值对本次的值有个衰减后的影响,从而使每次的值连接起来后会相对平滑一些。1
shadow_variable=decay * shadow_variable + (1-decay) * variable
- decay:衰减指数,是在ExponentialMovingAverage中指定的,如0.9
- variable:本批次样本中的值
- 等式右边shadow_variable:上次总样本的值
- 等式左边shadow_variable:计算出来的本次总样本的值
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
N_LAYERS = 7 # 一共7个隐藏层
N_HIDDEN_UNITS = 30 # 每个隐藏层有30个神经元
def fix_seed(seed=1):
np.random.seed(seed)
tf.set_random_seed(seed)
def plot_his(inputs, inputs_norm):
# plot histogram for the inputs of every layer
for j, all_inputs in enumerate([inputs, inputs_norm]):
for i, input in enumerate(all_inputs):
plt.subplot(2, len(all_inputs), j * len(all_inputs) + (i + 1))
plt.cla()
if i == 0:
the_range = (-7, 10)
else:
the_range = (-1, 1)
plt.hist(input.ravel(), bins=15, range=the_range, color='#FF5733')
plt.yticks(())
if j == 1:
plt.xticks(the_range)
else:
plt.xticks(())
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
plt.title("%s normalizing" % ("Without" if j == 0 else "With"))
plt.draw()
plt.pause(0.01)
def build_net(xs, ys, norm):
def add_layer(inputs,
in_size,
out_size,
activation_function=None,
norm=False):
# weights and biases (bad initialization for this case)
Weights = tf.Variable(
tf.random_normal([in_size, out_size], mean=0., stddev=1.))
biases = tf.Variable(tf.zeros([1, out_size]) + 0.1)
# fully connected product
Wx_plus_b = tf.matmul(inputs, Weights) + biases
# normalize fully connected product
if norm:
# Batch Normalize
fc_mean, fc_var = tf.nn.moments(
Wx_plus_b,
axes=[
0
], # the dimension you wanna normalize, here [0] for batch
# for image, you wanna do [0, 1, 2] for [batch, height, width] but not channel
)
scale = tf.Variable(tf.ones([out_size]))
shift = tf.Variable(tf.zeros([out_size]))
epsilon = 0.001
# apply moving average for mean and var when train on batch
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([fc_mean, fc_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(fc_mean), tf.identity(fc_var)
mean, var = mean_var_with_update()
Wx_plus_b = tf.nn.batch_normalization(Wx_plus_b, mean, var, shift,
scale, epsilon)
# similar with this two steps:
# Wx_plus_b = (Wx_plus_b - fc_mean) / tf.sqrt(fc_var + 0.001)
# Wx_plus_b = Wx_plus_b * scale + shift
# activation
if activation_function is None:
outputs = Wx_plus_b
else:
outputs = activation_function(Wx_plus_b)
return outputs
fix_seed(1)
if norm:
# BN for the first input
fc_mean, fc_var = tf.nn.moments(
xs,
axes=[0],
)
scale = tf.Variable(tf.ones([1]))
shift = tf.Variable(tf.zeros([1]))
epsilon = 0.001
# apply moving average for mean and var when train on batch
ema = tf.train.ExponentialMovingAverage(decay=0.5)
def mean_var_with_update():
ema_apply_op = ema.apply([fc_mean, fc_var])
with tf.control_dependencies([ema_apply_op]):
return tf.identity(fc_mean), tf.identity(fc_var)
mean, var = mean_var_with_update()
xs = tf.nn.batch_normalization(xs, mean, var, shift, scale, epsilon)
# record inputs for every layer
layers_inputs = [xs]
# build hidden layers
for l_n in range(N_LAYERS):
layer_input = layers_inputs[l_n]
in_size = layers_inputs[l_n].get_shape()[1].value
output = add_layer(
layer_input, # input
in_size, # input size
N_HIDDEN_UNITS, # output size
tf.nn.relu, # activation function
norm, # normalize before activation
)
layers_inputs.append(output) # add output for next run
# build output layer
prediction = add_layer(layers_inputs[-1], 30, 1, activation_function=None)
cost = tf.reduce_mean(
tf.reduce_sum(tf.square(ys - prediction), reduction_indices=[1]))
train_op = tf.train.GradientDescentOptimizer(0.001).minimize(cost)
return [train_op, cost, layers_inputs]
# 创建数据
fix_seed(1)
x_data = np.linspace(-7, 10, 2500)[:, np.newaxis]
np.random.shuffle(x_data)
noise = np.random.normal(0, 8, x_data.shape)
y_data = np.square(x_data) - 5 + noise
plt.scatter(x_data, y_data)
plt.show()
xs = tf.placeholder(tf.float32, [None, 1]) # [num_samples,num_features]
ys = tf.placeholder(tf.float32, [None, 1])
train_op, cost, layers_inputs = build_net(xs, ys, norm=False) # without BN
train_op_norm, cost_norm, layers_inputs_norm = build_net(
xs, ys, norm=True) # with BN
with tf.Session() as sess:
init = tf.global_variables_initializer()
sess.run(init)
cost_his = []
cost_his_norm = []
record_step = 5
plt.ion()
plt.figure(figsize=(7, 3))
for i in range(250):
if i % 50 == 0:
# plot histogram
all_inputs, all_inputs_norm = sess.run(
[layers_inputs, layers_inputs_norm],
feed_dict={
xs: x_data,
ys: y_data
})
plot_his(all_inputs, all_inputs_norm)
# train on batch
sess.run([train_op, train_op_norm],
feed_dict={
xs: x_data[i * 10:i * 10 + 10],
ys: y_data[i * 10:i * 10 + 10]
})
if i % record_step == 0:
cost_his.append(sess.run(cost, feed_dict={xs: x_data, ys: y_data}))
cost_his_norm.append(
sess.run(cost_norm, feed_dict={
xs: x_data,
ys: y_data
}))
plt.ioff()
plt.figure()
plt.plot(
np.arange(len(cost_his)) * record_step, np.array(cost_his),
label='no BN') # no norm
plt.plot(
np.arange(len(cost_his)) * record_step,
np.array(cost_his_norm),
label='BN') # norm
plt.legend()
plt.show()
BN的简单用法
上面的函数虽然参数不多,但需要几个函数联合起来使用,于是Tensorflow中的layers模块里又实现了一次BN函数,相当于把几个函数合并到一起。1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26# 需要导入以下模块
from tensorflow.contrib.layers.python.layers import batch_norm
# 函数的定义
def batch_norm(
inputs,
decay=0.999,
center=True,
scale=False,
epsilon=0.001,
activation_fn=None,
param_initializers=None,
param_regularizers=None,
updates_collections=ops.GrapKeys.UPDATE_OPS,
is_training=True,
reuse=None,
variables_collections=None,
outputs_collections=None,
trainable=True,
batch_weights=None,
fused=False,
data_format=DATA_FORMAT_NHWC,
zero_debias_moving_mean=False,
scope=None,
renorm_clipping=None,
renorm_decay=0.99
):
- inputs:输入
- decay:移动平均值的衰减速度,是使用了一种叫做平滑指数衰减的方法更新均值方差,一般设为0.9;值太小会导致均值和方差更新太快,而值太大又会导致几乎没有衰减,容易出现过拟合,这种情况一般需要把值调小点
- scale:如果为True,则乘以gamma。如果为False,gamma则不使用。当下一层是线性的时(例如relu),由于缩放可以由下一层完成,所以可以禁用该层。
- epslion:为了避免分母为0,给分母加一个极小值。一般默认即可。
- is_training:当它为True时,代表是训练过程,这时会不断更新样本集的均值与方差。当测试时,设为False,这样就会使用训练样本集的均值与方差
- updates_collections:其默认是tf.GraphKeys.UPDATE_OPS,在训练时提供了一种内置的均值方差更新机制,即通过图(一个计算任务)中的tf.GraphKeys.UPDATE_OPS变量来更新。但是它是在每次当前批次训练完成后才更新均值和方差,这样导致当前数据总是使用前一次的均值和方差,没有得到最新的更新。所以一般都设为None,让均值和方差即时更新。这样做虽然相比默认值在性能稍慢点,但是对模型的训练有较大帮助。
- reuse:支持共享变量,与scope联合使用
- scope:指定变量的作用域variable_scope
为CIFAR图片分类模型添加BN
添加BN函数
在池化函数后加入BN函数1
2
3
4
5
6
7
8def avg_pool_6x6(x):
return tf.nn.avg_pool(x, ksize=[1, 6, 6, 1],
strides=[1, 6, 6, 1], padding='SAME')
def batch_norm_layer(value,train = None, name = 'batch_norm'):
if train is not None:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = True)
else:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = False)
为BN函数添加占位符参数
由于BN里面需要设置是否为训练状态,所以这里定义一个train将训练转态当成一个占位符来传入1
2
3x=tf.placeholder(tf.float32,[None,24,24,3) # CIFAR数据集的shape为24x24x3
y=tf.placeholder(tf.float32,[None,10]) # 10类
train=tf.plcaeholder(tf.float32)
修改网络结构添加BN层
在第一层h_conv1与第二层h_conv2的输出之前卷积之后加入BN层1
2
3
4
5
6h_conv1=tf.nn.relu(batch_norm_layer((conv2d(x_image,W_conv1)+b_conv1),train))
h_pool1=max_pool_2x2(h_conv1)
W_conv2=weight_variable([5,5,64,64])
b_conv2=bias_variable([64])
h_conv2=tf.nn.relu(batch_norm_layer((conv2d(h_pool1,W_conv2)+b_conv2),train))
h_pool2=max_pool_2x2(h_conv2)
加入衰减学习率
将原来的学习率改成衰减学习率,使用0.04的初始值,让其每100次退化0.91
2
3
4
5
6cross_entropy = -tf.reduce_sum(y*tf.log(y_conv))
global_step = tf.Variable(0, trainable=False)
decaylearning_rate = tf.train.exponential_decay(0.04, global_step,1000, 0.9)
train_step = tf.train.AdamOptimizer(decaylearning_rate).minimize(cross_entropy,global_step=global_step)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
在运行session中
在session中找到循环的部分,为占位符train添加数值1,表明当前是训练状态。其他地方不动,因为第一步的BN函数设定好train为None,默认是测试状态。1
2
3
4for i in range(20000):
image_batch, label_batch = sess.run([images_train, labels_train])
label_b = np.eye(10,dtype=float)[label_batch] #one hot
train_step.run(feed_dict={x:image_batch, y: label_b,train:1},session=sess)
完整代码如下:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87import cifar10_input
import tensorflow as tf
import numpy as np
from tensorflow.contrib.layers.python.layers import batch_norm
batch_size = 128
data_dir = '/tmp/cifar10_data/cifar-10-batches-bin'
print("begin")
images_train, labels_train = cifar10_input.inputs(eval_data = False,data_dir = data_dir, batch_size = batch_size)
images_test, labels_test = cifar10_input.inputs(eval_data = True, data_dir = data_dir, batch_size = batch_size)
print("begin data")
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def conv2d(x, W):
return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')
def max_pool_2x2(x):
return tf.nn.max_pool(x, ksize=[1, 2, 2, 1],
strides=[1, 2, 2, 1], padding='SAME')
def avg_pool_6x6(x):
return tf.nn.avg_pool(x, ksize=[1, 6, 6, 1],
strides=[1, 6, 6, 1], padding='SAME')
def batch_norm_layer(value,train = None, name = 'batch_norm'):
if train is not None:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = True)
else:
return batch_norm(value, decay = 0.9,updates_collections=None, is_training = False)
# tf Graph Input
x = tf.placeholder(tf.float32, [None, 24,24,3]) # cifar data image of shape 24*24*3
y = tf.placeholder(tf.float32, [None, 10]) # 0-9 数字=> 10 classes
train = tf.placeholder(tf.float32)
W_conv1 = weight_variable([5, 5, 3, 64])
b_conv1 = bias_variable([64])
x_image = tf.reshape(x, [-1,24,24,3])
h_conv1 = tf.nn.relu(batch_norm_layer((conv2d(x_image, W_conv1) + b_conv1),train))
h_pool1 = max_pool_2x2(h_conv1)
W_conv2 = weight_variable([5, 5, 64, 64])
b_conv2 = bias_variable([64])
h_conv2 = tf.nn.relu(batch_norm_layer((conv2d(h_pool1, W_conv2) + b_conv2),train))
h_pool2 = max_pool_2x2(h_conv2)
W_conv3 = weight_variable([5, 5, 64, 10])
b_conv3 = bias_variable([10])
h_conv3 = tf.nn.relu(conv2d(h_pool2, W_conv3) + b_conv3)
nt_hpool3=avg_pool_6x6(h_conv3)#10
nt_hpool3_flat = tf.reshape(nt_hpool3, [-1, 10])
y_conv=tf.nn.softmax(nt_hpool3_flat)
cross_entropy = -tf.reduce_sum(y*tf.log(y_conv))
global_step = tf.Variable(0, trainable=False)
decaylearning_rate = tf.train.exponential_decay(0.04, global_step,1000, 0.9)
train_step = tf.train.AdamOptimizer(decaylearning_rate).minimize(cross_entropy,global_step=global_step)
correct_prediction = tf.equal(tf.argmax(y_conv,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))
sess = tf.Session()
sess.run(tf.global_variables_initializer())
tf.train.start_queue_runners(sess=sess)
for i in range(20000):
image_batch, label_batch = sess.run([images_train, labels_train])
label_b = np.eye(10,dtype=float)[label_batch] #one hot
train_step.run(feed_dict={x:image_batch, y: label_b,train:1},session=sess)
if i%200 == 0:
train_accuracy = accuracy.eval(feed_dict={
x:image_batch, y: label_b},session=sess)
print( "step %d, training accuracy %g"%(i, train_accuracy))
image_batch, label_batch = sess.run([images_test, labels_test])
label_b = np.eye(10,dtype=float)[label_batch]#one hot
print ("finished! test accuracy %g"%accuracy.eval(feed_dict={
x:image_batch, y: label_b},session=sess))