Deep Quaternion Networks

2020-02-27 484浏览

1.Deep Quaternion Networks Chase J. Gaudet School of Computing & Informatics University of Lousiana at Lafayette Lafayette, USA cjg7182@louisiana.edu Anthony S. Maida School of Computing & Informatics University of Lousiana at Lafayette Lafayette, USA maida@louisiana.eduarXiv:1712.04604v2[cs.NE] 30 Jan 2018 Abstract—The ﬁeld of deep learning has seen signiﬁcant advancement in recent years. However, much of the existing work has been focused on real-valued numbers. Recent work has shown that a deep learning system using the complex numbers can be deeper for a ﬁxed parameter budget compared to its real-valued counterpart. In this work, we explore the beneﬁts of generalizing one step further into the hyper-complex numbers, quaternions speciﬁcally, and provide the architecture components needed to build deep quaternion networks. We go over quaternion convolutions, present a quaternion weight initialization scheme, and present algorithms for quaternion batch-normalization. These pieces are tested in a classiﬁcation model by end-to-end training on the CIFAR-10 and CIFAR-100 data sets and a segmentation model by end-to-end training on the KITTI Road Segmentation data set. The quaternion networks show improved convergence compared to real-valued and complex-valued networks, especially on the segmentation task. Index Terms—quaternion, complex, neural networks, deep learning I. INTRODUCTION There have been many advances in deep neural network architectures in the past few years. One such improvement is a normalization technique called batch normalization [1] that standardizes the activations of layers inside a network using minibatch statistics. It has been shown to regularize the network as well as provide faster and more stable training. Another improvement comes from architectures that add so called shortcut paths to the network. These shortcut paths connect later layers to earlier layers typically, which allows for the stronger gradients to propagate to the earlier layers. This method can be seen in Highway Networks [2] and Residual Networks [3]. Other work has been done to ﬁnd new activation functions with more desirable properties. One example is the exponential linear unit (ELU) [4], which attempts to keep activations standardized. All of the above methods are combating the vanishing gradient problem [5] that plagues deep architectures. With solutions to this problem appearing it is only natural to move to a system that will allow one to construct deeper architectures with as low a parameter cost as possible. Other work in this area has explored the use of complex and hyper-complex numbers, which are a generalization of the complex, such as quaternions. Using complex numbers in recurrent neural networks (RNNs) has been shown to increase learning speed and provide a more noise robust memory retrieval mechanism [6]–[8]. The ﬁrst formulation of complex batch normalization and complex weight initialization is presented by [9] where they achieve some state of the art results on the MusicNet data set. Hyper-complex numbers are less explored in neural networks, but have seen use in manual image and signal processing techniques [10]–[12]. Examples of using quaternion values in networks is mostly limited to architectures that take in quaternion inputs or predict quaternion outputs, but do not have quaternion weight values [13], [14]. There are some more recent examples of building models that use quaternions represented as real-values. In [15] they used a quaternion multi-layer perceptron (QMLP) for document understanding and [16] uses a similar approach in processing multi-dimensional signals. Building on [9] our contribution in this paper is to formulate and implement quaternion convolution, batch normalization, and weight initialization 1. There arises some difﬁculty over complex batch normalization that we had to overcome as their is no analytic form for our inverse square root matrix. II. MOTIVATION AND RELATED WORK The ability of quaternions to effectively represent spatial transformations and analyze multi-dimensional signals makes them promising for applications in artiﬁcial intelligence. One common use of quaternions is for representing rotation into a more compact form. PoseNet [14] used a quaternion as the target output in their model where the goal was to recover the 6−DOF camera pose from a single RGB image. The ability to encode rotations may make a quaternion network more robust to rotational variance. Quaternion representation has also been used in signal processing. The amount of information in the phase of an image has been shown to be sufﬁcient to recover the majority of information encoded in its magnitude by Oppenheim and Lin [17]. The phase also encodes information such as shapes, edges, and orientations. Quaternions can be represented as a 2 x 2 matrix of complex numbers, which gives them a group of phases potentially holding more information compared to a single phase. Bulow and Sommer [12] used the higher complexity representation of quaternions by extending Gabor’s complex signal to a quaternion one which was then used for texture segmentation. Another use of quaternion ﬁlters is shown in [11] where 1Source code located athttps://github.com/gaudetcj/DeepQuaternionNetworks
2.they introduce a new class of ﬁlter based on convolution with hyper-complex masks, and present three color edge detecting ﬁlters. These ﬁlters rely on a three-space rotation about the grey line of RGB space and when applied to a color image produce an almost greyscale image with color edges where the original image had a sharp change of color. More quaternion ﬁlter use is shown in [18] where they show that it is effective in the context of segmenting color images into regions of similar color texture. They state the advantage of using quaternion arithmetic is that a color can be represented and analyzed as a single entity, which we will see holds for quaternion convolution in a convolutional neural network architecture as well in Section III-C. A quaternionic extension of a feed forward neural network, for processing multi-dimensional signals, is shown in [16]. They expect that quaternion neurons operate on multidimensional signals as single entities, rather than real-valued neurons that deal with each element of signals independently. A convolutional neural network (CNN) should be able to learn a powerful set of quaternion ﬁlters for more impressive tasks. Another large motivation is discussed in [9], which is that complex numbers are more efﬁcient and provide more robust memory mechanisms compared to the reals [10]–[12]. They continue that residual networks have a similar architecture to associative memories since the residual shortcut paths compute their residual and then sum it into the memory provided by the identity connection. Again, given that quaternions can be represented as a complex group, they may provide an even more efﬁcient and robust memory mechanisms. III. QUATERNION NETWORK COMPONENTS This section will include the work done to obtain a working deep quaternion network. Some of the longer derivations are given in the Appendix. Since we will be performing quaternion arithmetic using reals it is useful to embed H into a real-valued representation. There exists an injective homomorphism from H to the matrix ring M (4, R) where M (4, R) is a 4x4 real matrix. The 4 x 4 matrix can be written as a −b −c −d 1 0 0 0 b c a d −d a c −b = a 0 0 1 0 0 1 0 0 d −c b a 0001 0 −1 0 0  + b 1 0 0 0 0 0 0 −1 0010 0 0 −1 0 + c 0 1 0 0 0 1 0 0 0 −1 0 0 0 0 0 −1 + d 0 0 0 1 −1 0 0 0   .  (4) 10 0 0 This representation of quaternions is not unique, but we will stick to the above in this paper. It is also possible to represent H as M (2, C) where M (2, C) is a 2 x 2 complex matrix. With our real-valued representation a quaternion real-valued 2D convolution layer can be expressed as follows. Say that the layer has N feature maps such that N is divisible by 4. We let the ﬁrst N/4 feature maps represent the real components, the second N/4 represent the i imaginary components, the third N/4 represent the j imaginary components, and the last N/4 represent the k imaginary components. A. Quaternion Representation In 1833 Hamilton proposed complex numbers C be deﬁned as the set R2 of ordered pairs (a, b) of real numbers. He then began working to see if triplets (a, b, c) could extend multiplication of complex numbers. In 1843 he discovered a way to multiply in four dimensions instead of three, but the multiplication lost commutativity. This construction is now known as quaternions. Quaternions are composed of four components, one real part, and three imaginary parts. Typically denoted as H = {a + bi + cj + dk : a, b, c, d ∈ R} (1) where a is the real part, (i, j, k) denotes the three imaginary axis, and (b, c, d) denotes the three imaginary components. Quaternions are governed by the followingarithmetic:i2 = j2 = k2 = ijk = −1 (2) which, by enforcing distributivity, leads to the noncommutative multiplication rules ij = k, jk = i, ki = j, ji = −k, kj = −i, ik = −j (3) B. Quaternion Differentiability In order for the network to perform backpropagation the cost function and activation functions used must be differentiable with respect to the real, i, j, and k components of each quaternion parameter of the network. As the complex chain rule is shown in [9], we provide the quaternion chain rule which is given in the Appendix section VII-A. C. Quaternion Convolution Convolution in the quaternion domain is done by convolving a quaternion ﬁlter matrix W = A+iB+jC+kD by a quaternion vector h = w + ix + jy + kz. Performing the convolution by using the distributive property and grouping terms one gets W ∗ h = (A ∗ w − B ∗ x − C ∗ y − D ∗ z)+ i(A ∗ x + B ∗ w + C ∗ z − D ∗ y)+ j(A ∗ y − B ∗ z + C ∗ w + D ∗ x)+ k(A ∗ z + B ∗ y − C ∗ x + D ∗ w). (5)
3.Using a matrix to represent the components of the convolution wehave: R(W ∗ h)  A −B −C −D w I J (W (W ∗ ∗ h)  h) = B C A D −D A C −B ∗    x y    (6) K (W ∗ h) D −C B A z An example is shown in Fig. 1 where one can see how quaternion convolution forces a linear depthwise mixture of the channels. This is similar to a mixture of standard convolution and depthwise separable convolution from [19]. This reuse of ﬁlters on every layer and combination may help extract texture information across channels as seen in [18]. sition of V−1 where V is the covariance matrix givenby:'>by: