卷积网络的数学本质
2020-02-27 198浏览
- 1.Feed Forward and Backward Run in Deep Convolution Neural Network Pushparaja Murugan School of Mechanical and Aerospace Engineering, Nanyang Technological Univeristy, Singapore 639815 pushpara001@e.ntu.edu.sg Abstract Convolution Neural Networks (CNN), known as ConvNets are widely used in many visual imagery application, object classification, speech recognition. After the implementation and demonstration of the deep convolution neural network in Imagenet classification in 2012 by krizhevsky, the architecture of deep Convolution Neural Network is attracted many researchers. This has led to the major development in Deep learning frameworks such as Tensorflow, caffe, keras, theno. Though the implementation of deep learning is quite possible by employing deep learning frameworks, mathematical theory and concepts are harder to understand for new learners and practitioners. This article is intended to provide an overview of ConvNets architecture and to explain the mathematical theory behind it including activation function, loss function, feedforward and backward propagation. In this article, grey scale image is taken as input information image, ReLU and Sigmoid activation function are considered for developing the architecture and cross-entropy loss function is used for computing the difference between predicted value and actual value. The architecture is developed in such a way that it can contain one convolution layer, one pooling layer, and multiple dense layers.Keywords:Deep learning, ConvNets, Convolution Neural Netowrk, Forward and backward propogation Nomenclature α Learning rate yˆiL+1 Predicated value
- 2.L Loss or cost function σ Activation function Summation a Non-linearly transformed of net input b Bias- parameter bL+1 Bias matrix of final layer in fully connected layer bli Bias value of ith neuron at lth layer C Channel of image c Depth of convolution kernel D1 Depth of convolution layer D2 Depth of pooling layer Dn Number of pooling layer kernel Dimc Dimension of convolution layer Dimp Dimension of pooling layer e Exponential f (x) First derivative f (x) Function H Width of image H1 Height of convolution layer H2 Height of pooling layer i, j Adjecent neurons in fully connected layer k Width and height of pooling layer kernel kp,q Convolution Kernel bank k1 Width of convolution kernel k2 Height of convolution kernel KD Number of kernel L Final layers in fully connected layer l First layers in fully connected layer L + 1 Classification layer in fully connected layer l − 1 Vectorized pooling layer n Last neurons in fully connected layer 2
- 3.p Number of convolution kernel P p,q Pooling Kernel bank q Number of convolution layer t Total number of training samples u, v Pixels of kernel W Height of image w Wight- parameter W l Wight matrix of first layer in fully connected layer W L+1 Wight matrix of final layer in fully connected layer W1 Width of convolution layer W2 Width of pooling layer wil Wights of ith node at lth layer x Input signal y Matrix of actual labled value of training set yL+1 Matrix of predicted value yi Actual value from labelled training set z Linearly transformed net Inputs of fully connected layer ZP Value of Zeropadding ZS Value of stride 1 Introduction The study of neural networks, human behavior, and perceptions has started in the early 1950s. Over the decades, different types of neural networks were developed such as Elman, Hopfield and Jordan networks for approximating complex functions and recognizing patterns in the late 1970s. [1] [2] [3]. However, recent development in neural networks profoundly showed incredible results in object classification, pattern recognization, and natural language processing. The advancement in computer vision and the deep Convolution Neural Networks are widely used many application such as cancer cell classification, medical image processing application, star cluster classification, self-driving cars and number plate recognition. CovnNets are bio-inspired artificial neural networks developed on mathematical representation to analyze visual imagery, pattern recognition, and speech recognition. Unlike machine learning, CovnNets can be fed with raw image pixel values rather than feature vectors as input [4]. The basic design principle of CovnNets is developing an architecture and learning algorithm in such way that it reduces the number of the parameter without compromising the computational power of learning algorithm [5]. As the name refers, it consists of the linear mathematical operation of convolution followed by non-linear activators, pooling layers, and deep neural network classifier. The convolution processes act as appropriate feature detectors that demonstrate the ability to deal with a large 3
- 4.amount of low-level information. A complete convolution layer has different feature detectors so that multiple features can be extracted from the same image. A single feature detector is smaller in size as compares with the input images is slid over the images for the convolution operation. Hence, all of the units in that feature detector share the same weight and bias. That will help to detect same features in all of the points in the image. That gives the properties of invariance to transformation and shift of the images [6]. Local connections between the pixels are used many times in an architecture. With local respective field, neurons can extract the elementary features such as the orientation of edges and corners and end points. So that higher degree of complex features is detected in hidden layers when its combined in hidden layers. These functions of sparse connectivity between subsequent layers, parameter sharing of weights between the adjacent pixels and equivarient representation enable CNN to use efficiently in image reorganization and image classification problems [7] [8]. 2 Architecture Figure 2.1: Architecture Convolution Neural Network 2.1 Convolution layers Convolution layers are set of parallel feature maps, formed by sliding different kernel (feature detector) over an input image and projecting the element-wise dot as the feature maps [9]. This sliding process is known as stride Zs. This kernel bank is smaller in size as compares with the input image and are overlapped on the input image which prompts the parameters such as weight and bias sharing between the adjacent pixel of the image as well as control the dimensions of feature maps. Using the small size of kernels, however often result in imperfect overlays and limit the power of the learning algorithm. Hence, Zero padding Zp process usually implemented to control the size of the input image. Zero padding will control the feature maps and kernels dimensions independently by adding zero to input symmetrically [10]. During the training of algorithm, set of kernel filters, known as filter bank with the dimension of (k1, k2, c), slide over the fixed size (H, W, C) input image. The stride and zero padding are the critical measures to control the dimension of the convolution layers. As a result feature maps are produced which are stacked together to form the convolution layers. The dimension of the convolution layer can be computed by following Eqn. 2.1. 4
- 5.Dimc(H1, W1, D1) = (H + 2ZP − k1)/ZS + 1), (W + 2ZP − k2)/ZS + 1), KD 2.2 Activation functions (Eq. 2.1) Activation function defines the output of a neuron based on given a set of inputs. Weighted sum of linear net input value is passed through an activation function for non-linear transformation. A typical activation function is based on conditional probablity which will return the value one or zero as a output op {P (op = 1 ip) or P (op = 0 ip)}. When the net input information ip cross the threshold value, the activation function returns to value one and it passes the information to the next layers. If the net input ip value is below the threshold value, it returns to value zero and will not pass the information. Based on this segregation of relevent and irrelevent information, the activation function decides whether the neuron should activate or not. Higher the net input value greater the activation. Different types of activation functions are developed and used for different application. Some of the commonly used activation function are given in the Table 1. 2.3 Pooling layers Pooling layer refers to downsampling layer which combines the output of the neuron cluster at one layer to single neuron in the next layer. Pooling operations carried out after the nonlinear activation where the pooling layers help to reduce the number of data points and to avoid overfitting. It also act as a smoothing process from which unwanted noise can be eliminated. Most commonly Max pooling operation is used. Addition to that average pooling and L2 norm pooling operation are also used in some cases. When Dn number of kernel windows and the stride value of ZS is employed to develop pooling layers, the dimension of the pooling layer can be computed by, Dimp(H2, W2, D2) = (H1 − k)/ZS + 1), (W1 − k)/ZS + 1), Dn (Eq. 2.2) 2.4 Fully connected dense layers After the pooling layers, pixels of pooling layers is stretched to single column vector. These vectorized and concatnated data points are fed into dense layers ,known as fully connected layers for the classification. The function of fully connected dense layers is similar to Deep Neural Neworks. The architecture of CovnNets is given in Figure 2.1. This type of constraint architecture will proficiently surpass the classical machine learning algorithms in image classification problems [11] [12]. 2.5 Loss or cost function Loss function maps an event of one or more variable onto a real number associated with some cost. Loss function is used to measure the performance of the model and inconsistency between actual yi and predicted value yˆiL+1. Performance of model increses with the decrease value of loss function. 5
- 6.Name Sigmoid tanh Functions σ(x) = 1 1+e−x Derivatives Figure f (x) = f (x)(1 − f (x))2 σ(x) = ex−e−x ez +e−z f (x) = 1 − f (x)2 ReLU f (x) = 0 if x < 0 x if x ≥ 0. f (x) = 0 if x < 0 1 if x ≥ 0. Leaky ReLU f (x) = 0.01x if x < 0 f (x) = 0.01 if x < 0 x if x ≥ 0. 1 if x ≥ 0. Softmax f (x) = ex j 1 ex f (x) = − ex j 1 ex (ex)2 ( j 1 ex )2 Table 1: Non-linear activation function If the output vector of all possible output is yi = {0, 1} and an event x with set of input vector variable x = (xi, x2 . . . xt), then the mapping of x to yi is given by, L(yˆiL+1, yi) = 1 t i=t (yi, (σ(x), w, b)) i=1 (Eq. 2.3) where L(yˆiL+1, yi) is loss function Many types of loss functions are developed for various applications and some are given below. 2.5.1 Mean Squared Error Mean Squared Error or known as quadratic loss function, is mostly used in linear regression models to measure the performance. If yˆiL+1 is the computed output value of t training sample and yi is the corresponding labeled value, then the Mean Squared Error(MSE) is given by, L(yˆiL+1, yi) = 1 t i=t (yi − yˆiL+1)2 i=1 (Eq. 2.4) 6
- 7.Downside of the MSE is, tends to suffer from slow learning speed (slow convergence) when it incorprated with Sigmoid activation function. 2.5.2 Mean Squared Logarithmic Error Mean Squared Logarithmic Error(MSLE) is also used to measure performance of the model. L(yˆiL+1, yi) = 1 t i=t (log(yi + 1) − log(yˆiL+1 − 1))2 i=1 (Eq. 2.5) 2.5.3 L2 Loss function L2 loss function is square root of L2 norm of the difference between actual labeled value and computed value from the net input and is given by, i=t L(yˆiL+1, yi) = (yi − yˆiL+1)2 i=1 (Eq. 2.6) 2.5.4 L1 Loss function L1 loss function is sum of absolute errors of the difference between actual labeled value and computed value from the net input and is expressed as, i=t L(yˆiL+1, yi) = yi − yˆiL+1 i=1 (Eq. 2.7) 2.5.5 Mean Absolute Error Mean Absolute Error is used to measure the proximity of the predictions and actual values, which is expressed by, L(yˆiL+1, yi) = 1 t i=t yi − yˆiL+1 i=1 (Eq. 2.8) 2.5.6 Mean Absolute Percentage Error Mean Absolute Percentage Error is given by, L(yˆiL+1, yi) = 1 t i=t i=1 ( yi − yˆiL+1 ) × 100 yi Major downside of MAPE is, inablity to perform when there are zero values. (Eq. 2.9) 2.5.7 Cross Entrophy The most commonly used loss function is Cross Entropy loss function and is expained below. If the probablity of output yi is in the traning set label yiLˆ+1 is, P (yi al−1) = iLtˆ+1 = 1 and the 7
- 8.the probablity of output yi is not in the traning set label yiLˆ+1 is, P (yi zl−1) = yiLˆ+1 = 0 [13]. The expected label is y, than Hence, P (yi zl−1) = yˆiL+1yi(1 − yˆiL+1)(1−yi) (Eq. 2.10) log P (yi zl−1) = log((yˆiL+1)(yt)(1 − yˆiL+1)(1−yi)) (Eq. 2.11) = (yi) log(yˆiL+1) + (1 − yi) log(1 − yˆtL+1) To minimize the cost function, log P (yi zl−1) = −log((yˆiL+1)(yi)(1 − yˆiL+1)(1−yi)) In case of i training samples, the cost function is, L(yˆiL+1, yi) = − 1 t i=t ((yi) log(yˆiL+1) + (1 − yi) log(1 − yˆiL+1)) 1 (Eq. 2.12) (Eq. 2.13) (Eq. 2.14) (Eq. 2.15) 3 Learning of CovnNets 3.1 Feed - Forward run Feed forward run or propogation can be explained as mutiplying the input value by randomly initiated weights and adding randomly initiated bias values of each connection of every neurons followed by summation of all the products of all the neurons. Then passing the net input value through non-linear activation functions. In a discrete color space, image and kernel can be represented as a 3D tensor with the dimension of (H, W, C) and (k1, k2, c) where m, n, c are represent the mth, nth pixel in cth channel. First two indices are indicate the spatial co-ordinates and last index is indicate the color channel. If a kernel is slided over the color image, the multidimensional tensor convolution operation can be expressed as, m nC (I K)ij = Km,n,cIi+m,j+n,c m=1 n=1 c=1 (Eq. 3.1) Convolution process is indicated by sympol. For grey scale image, convolution process can be expressed as, mn (I K)ij = Km,nIi+m,j+n m=1 n=1 (Eq. 3.2) A kernel bank kup,,qv is slided over the image Im,n with stride value of 1 and zero padding value of 0. The feature maps of the convolution layer Cmp,,qn can be computed by, mn Cmp,,qn = I(m−u,n−v).Kup,,vq + bp,q m=1 n=1 (Eq. 3.3) 8
- 9.Figure 3.1: Convolution Neural Network These feature maps are passed through a non-linear acitivation function σ, mn Cmp,,qn = σ( I(m−u,n−v).Kup,,vq + bp,q) m=1 n=1 (Eq. 3.4) where σ is a ReLU activation fucntion. Pooling layer Pmp,,qn is developed by taking out the maximum valued pixels m, n in the convolution layers. The pooling layer can be calculated by, Pmp,,qn = max(Cmp,,qn) (Eq. 3.5) The pooling layer P p,q is concatenated to form a long vector with the length of p × q and is fed into fully connected dense layers for the classification, then the vecotoized data points ali−1 in l − 1 layer is given by, ali−1 = f (P p,q) (Eq. 3.6) This long vector is fed into a fully connected dense layers from l layer to L + 1. If the fully connected dense layers is developed with L number of layers and n number of neurons, then l is the first layer, L is the last layer and (L + 1) is the classification layer as shown in the figure 3.2, the forward run between the layers are given by, z1l = w1l 1al1−1 + w1l 2al2−1 + · · · + w1l−j 1al1−1 + · · · + blj z2l = w2l 1a1l−1 + w2l 2al2−1 + · · · + w2l−j 1al1−1 + · · · + blj zil = wil1ali−1 + wilj ali−1 + · · · + w2l−j 1ail−1 + · · · + blj (Eq. 3.7) (Eq. 3.8) (Eq. 3.9) 9
- 10.Figure 3.2: Forward run in fully connected layer z1l w1l 1 ... zil ... = w......il1 w1l 2 ... wil2 ... w1l 3 ... wil3 ... ... ... ... ... w1l n ... wiln ... aal1li−−... 11 ... + bl1 b......li (Eq. 3.10) Consider a single neuron (j) in a fully connected layer at layer l as given in the Fig.3.3. The input values ali−1 are multiplied and added by weights wij and bias values blj respectively. Then the final net input value zil are passed through a non-linear activation function σ. Then the corresponding output value alj is computed by, zjl = w1l j al1−1 + w2l j al2−1 + · · · + wilj−1ali−1 + · · · + blj Where zil is the input of the activation function for the neuron j at layer l, n zjl = wilj alj−1 + bli i=1 n alj = σ( wilj alj−1 + bli) i=1 Hence, the output of lth layer is, (Eq. 3.11) (Eq. 3.12) (Eq. 3.13) (Eq. 3.14) al = σ((W l)T al−1 + bl) (Eq. 3.15) 10
- 11.al1−1 w1l j Inputs al2−1 w2l j Bias blj Σzjl Activate function σ(zjl ) Output alj ali−1 wilj Weights Figure 3.3: Forward run in a neuron j at lth layer al = σ(zl) (Eq. 3.16) where al is, al1 σ(z1l ) al = ... ali ... = σ(......zil) (Eq. 3.17) W l is, w1l j Wl = ... wilj ... (Eq. 3.18) In this same manner, the output value of last leyer L is given by, aL = σ((W L)T aL−1 + bL) (Eq. 3.19) where, aL = σ(zL) (Eq. 3.20) aL1 σ(z1L ) aL = a......Li = σ(z......iL) (Eq. 3.21) Expanding this to classification layers, final output predicted value yˆiL+1 of a neuron unit (i) at L + 1 layer can be expressed as, yˆiL+1 = σ(W L . . . . . . σ(W 2(σ(W 1a1 + b1) + b2 . . . · · · + bL)) (Eq. 3.22) 11
- 12.If the predicted value is yˆiL+1 and the actual labeled value is yi, than the performance of the model can be computed by the following loss function equation, From the Eqn.2.14, cross-entropy loss function is, L(yˆiL+1, yi) = − 1 t i=t ((yi) log(yˆiL+1) + 1 (1 − yi) log(1 − yˆiL+1)) (Eq. 3.23) (Eq. 3.24) 3.2 Backward run Backward run, also known as backward propogation is referred to backward propogation of errors which use gradient descent to compute the gradient of the loss function with respect to the parameters such as weight and bias and is shown in the Fig 3.4. During the backward propogation, gradient of loss function of final layers with respect to the parameters is computed first where the gradient of first layer is computed last. Also, the partial derivative of one layers is reused in computation of partial derivative of another layers by chain rule which will lead to efficient computation of gradient at each layers. This will be used to minimize the loss function. Performance of model increases as the loss function value decreses [14] [15] [16]. In the back propogation, the paramters such as W L+1, bL+1, W l, bl, , kp,q and bp,q are needed to be update in order to minimize the cost function. Figure 3.4: Back propogation in fully connected layer 12
- 13.Partial derivative of loss function of ith neuron at classification layer L + 1 with respect to predicted values yˆiL+1 is, ∂L(yˆiL+1, yi) ∂ yiL+1 = 1 t i=t 1 ∂(−1((yt log(yˆiL+1) + (1 − yi) log((l ∂yˆiL+1 − yˆiL+1)) (Eq. 3.25) ∂L(yˆiL+1, yi) ∂yˆiL+1 = 1 t i=t 1 −yi yˆiL+1 + 1 − yi 1 − yˆiL+1 (Eq. 3.26) In case of multiclass categorical classification problem, the lost function of classification layer L + 1 is, ∂L(yˆ1L+1,y1) 1 L ∂ yˆ1L+1 ∂ (yˆ2L+1,y2) ∂yˆ2L... +1 L ∂ (yiL+1,yi) ∂yˆiL... +1 = i 1 i 1 i t 1 t 1 t 1 −y1 yˆ−1Ly+21 yˆ2L+1... −yi yˆiL+1... + + + 1−y1 1−1−yˆ1Ly+ 2 1 1−yˆ2L+1 1−yi 1−yˆiL+1 (Eq. 3.27) Partial derivate of cost function with respect to weight wiL,i+−11 of ith neuron in final layer L, For convinent purpose, the notation of the weight of Lth layer is denoted as wiL,i−1. ∂L(yˆiL+1, yi) ∂ wiL,i+−11 = 1 t i=t 1 ∂L(yˆiL+1, yi) ∂yˆiL+1 ∂yˆiL+1 ∂wiL,i+−11 (Eq. 3.28) = 1 t t 1 ( −yi yˆiL+1 + 1 1 − yi − yˆiL+1 )( ∂yˆiL+1 ∂ wiL,i+−11 ) (Eq. 3.29) = 1 t t 1 ( −yi yˆiL+1 + 1 1 − yi − yˆiL+1 )( ∂aLt +1 ∂ wiL,i+−11 ) (Eq. 3.30) = 1 t t 1 ( −yi yˆiL+1 + 1 1 − yi − yˆiL+1 )( ∂σ(ztL+1) ∂ wiL,i+−11 ) (Eq. 3.31) = 1 t t 1 ( −yi yˆiL+1 + 1 1 − yi − yˆiL+1 )σ (ziL+1) (Eq. 3.32) = 1 t t 1 ( −yi yˆiL+1 + 1 − yi 1 − yˆiL+1 )σ i ( i=1 wi,i−1aL−1 + bL) (Eq. 3.33) 13
- 14.In this final layer Lth, sigmoid activation function is utilized for non-linear transformation. From the Table 1, Sigmoid activation funtion is written as, σ(ziL+1) = 1 1 + expziL+1 (Eq. 3.34) The derivative of the sigmoid function is expressed as, ∂σ(ziL+1) ∂(ziL+1) = ∂1 1+expziL+1 ∂(ziL+1) (Eq. 3.35) = σ(ziL+1)(1 − σ(ziL+1) Substuting the Eqn.3.66 in Eqn.3.33, (Eq. 3.36) ∂L(yˆiL+1, yi) ∂ wiL,i−1 = 1 t t 1 ( −yi yˆiL+1 + 1 1 − yi − yˆiL+1 )(σ( i i=1 wi,i−1aL−1 + bL)(1 − i σ( i=1 wi,i−1aL−1 + bL)) (Eq. 3.37) where yˆiL+1 = aLi +1 = σ(ziL+1) ∂L(yˆiL+1, yi) ∂ wiL,i−1 = 1 t t 1 ( −yi yiL+1 + 1 − σ( 1 − yi i i=1 wi,i−1aL−1 + bL) i )(σ( i=1 wi,i−1aL−1 + bL)(1 − i σ( i=1 wi,i−1aL−1 + bL)) (Eq. 3.38) ∂L(yˆiL+1, yi) ∂ wiL,i−1 = 1 t t 1 i yˆiL+1(σ( wi,i−1aL−1 + bL − yi) i=1 (Eq. 3.39) Hence, the partial derivative loss function with respect to weights of every neuron in Lth layers is expressed as, ∂L(yˆL+1, y) ∂W L = L ∂ (yˆ1L+1,y1) ∂ w1L,0 L ∂ (yˆ2L+1,y2) ∂ w2L,i−2 ... L ∂ (yˆiL+1,yi) ∂ wiL,i−1 ... = 1 t 1 t 1 t t 1t 1 t 1 yˆ1L+1(σ(z1L+1 yˆ2L+1(σ(z2L+1 ... yˆiL+1(σ(ziL+1 − − − y1) yy2i)) ... (Eq. 3.40) Partial derivative of cost function with respect to bias bli in ith neuron at Lth layer is, ∂L(yˆiL+1, yi) ∂bLi = 1 t t 1 ∂L(yˆiL+1, yi) ∂yˆiL+1 ∂yˆiL+1 ∂bLi (Eq. 3.41) 14
- 15.= 1 t t 1 −yi yˆiL+1 + 1 1 − yˆi − yiL+1 ( ∂yˆiL+1 ∂bLi ) (Eq. 3.42) ∂L(yˆiL+1, yi) ∂bLi = σ(ziL+1) − yi (Eq. 3.43) Partial derivative of cost function with respect to bias of every neurons at Lth is written as, bL = ∂L(yˆ1L+1,y1) L ∂ (yˆ∂2Lb+L11,y2) ∂b...L2 L ∂ (yˆiL+1,yi) ∂b...Li = σ(z1L+1) σσ((zz2LiL++1...1)) ... − − − y1 yy2i (Eq. 3.44) In this same way, partial derivatives of loss function with respect to all of hidden neruons and hidden layers can be calculated. ReLU non-linear activation function is used in all of the hidden layers from l − 1 to L1. Partial derivative of loss function with respect to weight of ith neuron at first layer l of fully connected dense layer ∂L(yˆiL+1, yt) ∂ wil,i−1 = ∂L(yˆiL+1, yi) ∂ yiL+1 ∂yˆiL+1 ∂ wil,i−1 = 1 t t 1 −yi yˆiL+1 + 1 1 − yi − yˆiL+1 ( ∂yˆiL+1 ∂ wil,i−1 ) = 1 t t 1 −yi yˆiL+1 + 1 1 − yi − yˆiL+1 ( ∂aLi +1 ∂ wil,i−1 ) = 1 t t 1 −yi yˆiL+1 + 1 − yi 1 − yˆiL+1 ∂σ(ziL+1) ∂ wil,i−1 = 1 t t 1 −yi yˆiL+1 + 1 1 − yi − yˆiL+1 σ (zil) ∂L(yˆiL+1, yt) ∂ wil,i−1 = 1 t t 1 −yi yˆiL+1 + 1 1 − yi − yˆiL+1 σ (zil) ∂L(yˆiL+1, yt) ∂ wil,i−1 = 1 t t 1 −yi yˆiL+1 + 1 1 − yi − yˆiL+1 σ i ( wi,i−1al−1 + bl) i=1 (Eq. 3.45) (Eq. 3.46) (Eq. 3.47) (Eq. 3.48) (Eq. 3.49) (Eq. 3.50) (Eq. 3.51) 15
- 16.Since, ReLU activation function is used, than the derivative of ReLU activation function is, From the Table.1, σ (z) = 0 1 if x < 0 if x ≥ 0. (Eq. 3.52) If z > 0, ∂L(yˆiL+1, yi) ∂ wil,i−1 = yi − zil zil(1 − zil) (Eq. 3.53) Hence, partial derivative of loss function with respect to weight of all neuron at lth layer is, Wl = L ∂ (yˆ1L+1,y1) ∂w1l ,0 L ∂ (yˆ2L+1,y2) ∂w2l ,1 ... L ∂ (yˆiL+1,yi) ∂ wil,i−1 ... = y1−z1l z1ly(21−−zz2l1l ) z2l (1...−z2l ) yi −zil zil (1...−zil ) (Eq. 3.54) Partial derivative of loss function with respect to bias of ith neuron at lth layer is, ∂L(yˆiL+1, yi) ∂bli = ∂L(yˆiL+1, yi) ∂yˆiL+1 ∂yˆiL+1 ∂bLi (Eq. 3.55) = 1 t t i −yi yˆiL + 1 + 1 1 − yi − yˆiL+1 ( ∂ yˆiL+1 ∂bli ) (Eq. 3.56) ∂L(yˆiL+1, yi) ∂bli = σ(zil−1) − yi where σ is a ReLU non-linear activation function, hence, if zi > 0, ∂L(yˆiL+1, yi) ∂bli = zil−1 − yi Hence, the partial derivatives of loss function with respect to bias at the layer l is, (Eq. 3.57) (Eq. 3.58) bl = ∂L(yˆ1L+1,yi) ∂ ∂ L(yˆ∂2Lb+li 1 ∂...bli L(yˆiL+1 ∂...bli ,yi ,yi ) ) = z1l−1 − zz2ill−−11...−− ... yi yi yi (Eq. 3.59) 16
- 17.In order to perform the learning of ConvNets, it is also neccessary to update the kernel bank weights and bias value in convolution layers as well as in pooling layers, Partial derivative of loss function with respect to input value ail−1 is, ∂L(ytL+1, yt) ∂ ali−1 = ∂L(ytL+1, yt) ∂ytL+1 ∂ ytL+1 ∂ ali−1 (Eq. 3.60) from the (Eq.1.31), ∂L(yˆiL+1, yt) ∂ ali−1 = 1 t t i ( −yi yˆiL + 1 1 − yi − yˆiL+1 ) ∂ yiL+1 ∂ ali−1 (Eq. 3.61) ∂L(yˆiL+1, yi) ∂ ali−1 = 1 t t i ( −yi yˆiL + 1 1 − yi − yˆiL+1 ) (∂ wiL,i−−11 aL−1 ∂ ali−1 + bL−1) (Eq. 3.62) = 1 t t i ( −yt+1 ytL+1 + 1 + 1 1 − − yt+1 ytL++11 )wil,i−1 (Eq. 3.63) For all input values al−1 at l − 1th layer, ∂L(yˆiL+1, yi) ∂al−1 = 1 t t i ( −yt+1 ytL+1 + 1 + 1 1 − − yt+1 ytL++11 )W l (Eq. 3.64) Reshaping the long vector L ∂ (ytL+1,yt) ∂ al−1 P p,q = f −1 ∂L(ytL+1, ∂al−1 yt) (Eq. 3.65) Primary function of pooling layer is reduce the number of parameters and also to control the overfitting of the model. Hence, no learning takes place in pooling layers. The pooling layer error is computed by acquiring single value winning unit. Since, there are no parameters are needed to be updated in pooling layer, upsampling can be done to obtain L . ∂ ∂(yCtLm p+,,q1n,yt) ∂L(yˆiL+1, yt) ∂ Cmp,,qn = P p,q (Eq. 3.66) Partial derivative of loss function with respect to convolution kernel kup,,qv is, ∂L(yˆiL+1, yt) ∂ kup,,qv = mn m=1 n=1 ∂L(yˆiL+1, yt) ∂Cmp,,qn ∂ Cmp,,qn ∂ kup,,qv (Eq. 3.67) ∂L(yˆiL+1, yt) ∂ kup,,qv = mn m=1 n=1 ∂L(yˆiL+1, yt) σ( ∂ Cmp,,qn u u=1 v v=1 Im−u,j−v kup,.qv + bp,q ) ∂ kup,,qv (Eq. 3.68) 17
- 18.∂L(yˆiL+1, yi) ∂ kup,,qv = mn m=1 n=1 ∂L(yˆiL+1, ∂ Cmp,,qn yt) Im−u,j−v (Eq. 3.69) Updated weight of kernel kup,,qv can be obtained by rotating the image to 180 deg ∂L(yˆiL+1, yt) ∂ kup,,qv = m m=1 n n=1 rotIm−u,n−v . ∂L(yˆiL+1, ∂ Cmp,,qn yt) (Eq. 3.70) kp,q = rot180oI ∗ ∂L(yˆiL+1, yt) ∂ Cmp,,qn (Eq. 3.71) Partial derivative of lost function with respect to bias bp,q of convolution kernel is, ∂L(yˆiL+1, yi) ∂bp,q = mn m=1 n=1 ∂L(yˆiL+1, yt) ∂ Cmp,,qn ∂ Cmp,,qn ∂bp,q (Eq. 3.72) = m m=1 n n=1 ∂L(yˆiL+1, ∂ Cmp,,qn yi) ∂σ( u u=1 v v=1 Im−u,j−v kup,.qv + bp,q ) ∂bp,q (Eq. 3.73) bp,q = ∂L(yˆiL+1, yi) ∂bp,q = mn m=1 n=1 ∂L(yˆiL+1, yi) ∂ Cmp,,qn (Eq. 3.74) 3.3 Parameter updates In order to minimize the loss function, it is necessary to update the learning parameter at every iteration process on the basis of gradient descent. Though various optimization techniques are developed to increase the learning speed, this article is considered only gradient descent optimization. The weight and bias update of fully connected dense layer L + 1 is given by, W L+1 = W L+1 − α ∂L(yˆL+1, ∂W L y) (Eq. 3.75) bL+1 = bL+1 − α ∂ L(yˆL+1, ∂bL y) The weight and bias update of fully connected dense layer l is given by, W l = W l − α ∂L(yˆL+1, ∂W l y) (Eq. 3.76) (Eq. 3.77) bl = bl − α ∂ L(yˆL+1 ∂bl , y1) The weight and bias update of convolution kernel l is given by, kp,q = kp,q − α ∂ L(yˆL+1 ∂ kup,,qv , y) (Eq. 3.78) (Eq. 3.79) Where α is the learning rate. bp,q = α ∂ L(yˆL+1, ∂bp,q y) (Eq. 3.80) 18
- 19.4 Conclusion In this article, an overview of a Convolution Neural Network architecture is explained including various activation fucntions and loss functions. Step by step procedure of feed forward and backward propogation is explained elobrately. For mathametical simplicity concern, Grey scale image is taken as input information, kernel stride value is taken as 1, Zeropadding value is taken as 0, non-linear transformation of intermediate layer and final layers are carried out by ReLU and sigmoid activation functions. Cross entrohpy loss function is used as a performance measure of the model. However, there are numerous optimazation and regularization procedure to minimize the loss function, to increase the learning rate and to aviod the overfitting of the model, this article is an attempt of only considering the formulation of typical Convolution Neural Network architecture with graident descent optimization. References [1] D. O. Hebb, The organization ofbehavior:A neuropsychological theory. Psychology Press, 2005. [2] J. J. Hopfield, “Neural networks and physical systems with emergent collective computational abilities,” Proceedings of the national academy of sciences, vol. 79, no. 8, pp. 2554– 2558, 1982. [3] H. D. Simon, “Partitioning of unstructured problems for parallel processing,” Computing systems in engineering, vol. 2, no. 2-3, pp. 135–148, 1991. [4] Y. LeCun, Y. Bengio, et al., “Convolutional networks for images, speech, and time series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995, 1995. [5] Y. LeCun et al., “Generalization and network design strategies,” Connectionism in perspective, pp. 143–155, 1989. [6] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio, “Object recognition with gradient-based learning,” Shape, contour and grouping in computer vision, pp. 823–823, 1999. [7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436– 444, 2015. [8] C. M. Bishop, Neural networks for pattern recognition. Oxford university press, 1995. [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097– 1105, 2012. [10] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016. [11] D. C. Ciresan, U. Meier, J. Masci, L. Maria Gambardella, and J. Schmidhuber, “Flexible, high performance convolutional neural networks for image classification,” in IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, p. 1237, Barcelona, Spain, 2011. [12] J. Schmidhuber, “Deep learning in neuralnetworks:An overview,” Neural networks, vol. 61, pp. 85–117, 2015. [13] P.-T. De Boer, D. P. Kroese, S. Mannor, and R. Y. Rubinstein, “A tutorial on the crossentropy method,” Annals of operations research, vol. 134, no. 1, pp. 19–67, 2005. 19
- 20.[14] D. E. Rumelhart, G. E. Hinton, R. J. Williams, et al., “Learning representations by backpropagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1, 1988. [15] F. J. Pineda, “Generalization of back propagation to recurrent and higher order neural networks,” in Neural information processing systems, pp. 602–611, 1988. [16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. 20