Abstract approaches such as convolutional neural networks
Abstract Although the study of biological vision and computer vision attempt to understand powerful visual information processing from different angles, they have a long history of informing each other. Recent advances in texture synthesis that were motivated by visual neuroscience have led to a substantial advance in image synthesis and manipulation in computer vision using convolutional neural networks. Here we review, the most traditional approaches, texture descriptors such as gray-level co-occurrence matrices and local binary patterns, to more recent approaches such as convolutional neural networks and multi-scale patch-based recognition and discuss how they can in turn inspire new research in visual perception and computational neuroscience. IntroductionA fascinating property of human visual perception is that physically very different images are perceived to look very much the same. A prominent example of this property is texture perception: When more than a handful of similar objects are nearby, our visual system groups them together and we become insensitive to their precise spatial arrangement ‘things’ (Fig. a) become ‘stuff” (Dubuc and Zucker, 2001). Since texture perception is omnipresent in human vision, it has occupied vision scientists for many years to characterize under what conditions things become stuff and what exactly constitutes a texture. Mathematically, we can formalize a texture as a sample from an ensemble of images with (spatially) stationary statistics. That is, the local statistical dependencies between pixels are the same irrespective of absolute position and, consequently, individual elements can have a highly variable spatial arrangement. The study of visual textures as stationary images was pioneered by Julesz (Julesz, 1962), who hypothesized that all images with equal Nth-order joint pixel histograms are pre-attentively indistinguishable for human observers and therefore samples from the same texture. This specific hypothesis turned out to be wrong for computationally tractable values of N (Julesz et al., 1978). Yet, the basic idea to describe a texture by a set of spatial summary statistics forms the basis of parametric texture modelling and even the notion of images with equal Nth?order joint pixel histograms is still applied fruitfully today (Yu et al., 2015).Image style transferImage style transfer is defined as: given two images on the input, synthesize a third image that has the semantic content of the first image and the texture/style of the second. To work properly it is needed to determine the content and the style of any image (content/style extractor) and then merge some arbitrary content with another arbitrary style (merger). But the central problem of style transfer revolves around ability to come up with a clear way of computing the “content” of an image as distinct from computing the “style” of an image. Before deep learning arrived at the scene, researchers had been handcrafting methods to extract the content and texture of images, merge them and see if the results were interesting or garbage. The two main reasons are that, before deep learning, texture transfer methods had at least one of these two shortcomings:? did not preserve the semantic content of the content image well enough ? did not generalize well for images outside of a small subset of study casesClassification of textureTexture classi?cation consists of an image processing and computer vision task, which can be applied to numerous ?elds and industries such as computer-aided medical diagnosis, classi?cation of forest species, geo-processing, writer identi?cation and veri?cation, oil and gas etc.Texture descriptors1. Gray-Level Co-occurrence Matrices (GLCM): GLCM is the probability of the joint occurrence of gray-levels i and j, where i ? G and j ? G and G denoted the gray-level depth of the image, within a de?ned spatial relation in an image (Cavalin et al., 2013)2. Local Binary Patterns (LBP): A histogram is computed with the distribution of the binary con?gurations of the pixels of the image, (Fig. b) based on thresholding the surrounding window of each pixel when the intensity of the neighborhood pixel is below or above the center value (Kylberg et al., 2013). 3. Gabor Filters Banks: Its main idea is to represent an image by taking into account both frequency and orientation aspects, by making use of a set of classes of Gabor functions.4. Local Phase Quantization (LPQ): LPQ is based on quantized phase information of the Discrete Fourier Transform (DFT). It uses the local phase information extracted using the 2-D DFT or, more precisely, a Short-Term Fourier Transform (STFT) computed over a rectangular M × M neighborhood N at each pixel position x of the image i(x)5. Patch and Multiscale-Based Approaches: Given that an image may contain repetitions of the same texture pattern, instead of extracting features from the entire image as a whole, its main idea is to divide the original image into several smaller images, (Fig. c) i.e. the patches, where each patch can be considered as a different “observation” of the input image, and both train and perform recognition by considering these multiple “observations” (F. Bianconi and A. Fernández, 2007) 6. Deep Learning Approaches: With the prominent recent advances on deep learning, especially Convolution Neural Networks for image recognition, the application of CNNs on texture recognition problems has drawn the attention of various researchers. In some works, the focus has been mainly at evaluation standard CNN architectures that had been previously used for objected recognition or combinations of CNNs with other classi?ers such as Support Vector Machines (SVM) trained with textural (Krizhevsky et al., 2012) Convolutional Neural Network (CNN)The modelling of more complex computational building blocks of the visual system has benefitted greatly from a revolution in the field of computer vision and machine learning. In 2012, Alex and colleagues outperformed the state of the art in object recognition in the Image Net Large Scale Visual Recognition Challenge (ILSVRC) by a large margin using a deep convolutional neural network (CNN).Interestingly, the representations of CNNs trained to solve large scale object recognition appear to be generally useful for visual information processing. They transfer to other datasets and tasks (Donahue et al., 2014). It has become common practice in the computer vision community to use the activations of CNNs pre-trained on object recognition as the adhoc feature representation to solve visual information processing tasks. DiCarlo and coworkers showed that performance optimized CNNs provide not only general and powerful image features for machine vision, but they also excel in predicting neural activity in primate areas V4 and IT (Cadieu et al., 2014), two areas of the ventral visual stream the part of the primate brain responsible for object recognition. Studies in humans confirmed the usefulness of CNN representations as predictors for biological vision processes. They set the state of the art in predicting fMRI responses in the human ventral stream (Güçlü and Gerven, 2015) and in predicting where in images humans look (Kümmerer et al., 2015). Thus, models from computer vision and machine learning that share only a very coarse computational architecture with the visual cortex are the best models for predicting brain responses and human visual behavior.A texture model based on deep neural network features:The predictive power of modern convolutional neural networks and the fruitfulness of parametric texture modelling for visual neuroscience (Freeman et al., 2013) motivated us to work on a texture model (Gatys et al., 2015) based on the feature representations of a high performing CNN trained on object recognition (Simonyan and Zisserman, 2014). This texture model then led to the development of the widely popularized neural style transfer algorithm (Gatys et al., 2016).The central part of this CNN based texture algorithm is a pre-image search. Pre?image search techniques have been used before for understanding and visualizing the feature representations of CNNs (Mahendran and Vedaldi, 2014) and for synthesizing textures (Portilla and Simoncelli, 2000). Building on these ideas, a pre-image search is constrained by the spatial correlations between feature maps of different convolutional layers in the network (Gatys et al., 2015). The resulting textures look very realistic and many of them are indistinguishable from the originals under realistic viewing conditions (Wallis et al., 2017). Furthermore, the texture parameters are factorized into layers of neurons that make high?level image information increasingly explicit. This property allow to generate images with an increasing degree of naturalness (Gatys et al., 2015), which could also serve as useful stimuli for studying visual perception.From texture modelling to style transfer:The neural network texture model forms the basis of Neural Style Transfer (Gatys et al., 2016), an algorithm that repaints a photograph in the style of an arbitrary painting. The algorithm maps the texture of the painting onto the photograph while preserving high level features (the content) of the photograph. It can be thought of as synthesizing a texture in the style of the painting with the additional constraint of matching the activations of higher layer deep network features. Since units in higher layers are quite invariant under low level variation, these constraints allow for a lot of flexibility in terms of drawing style.A useful analogy for the two types of constraints, content and style is the phase spectrum and the power spectrum of an image. It has long been known that the phase spectrum is important for recognizing the content of an image while the power spectrum captures spatial correlations in a shift invariant manner more related to texture or style. Phase and power spectrum contain fully complementary information and can thus be arbitrarily recombined without compromises. The content and style representations obtained with convolutional neural networks are perceptually quite complementary, but not completely independent. In order to explain this, Gatys et al. divided workflow into: Content extractor: To separate the semantic content of an image CNN has an architecture that is inspired by some of the mechanisms present in our visual system and its main use is in computer vision problems.Style extractor: The style extractor use the same idea of the content extractor (i.e. use the output of a hidden layer), but it adds one more step. It uses a correlation estimator based on the Gram matrix of the filters of a given hidden layer. Which is just a complicated sentence that means: it destroys semantics of the image, but preserves its basics components, making a good texture extractorMerger: As a final step, there is need to blend the content of one image with the style of another. It is done with the help of an optimization problem, a mathematical problem which defines a cost function, which needed to minimize. Neural network feature spaces for perceptual loss functions:The key insight of Neural Style Transfer was that feature spaces of neural networks trained on object recognition allow the separation and recombination of perceptually important image features (‘content’ and ‘style) in an unprecedented manner. This insight touches on a fundamental problem in computer vision and image processing to find image representations that enable the analysis, synthesis and manipulation of images with respect to perceptual variables. Closely related is the search for measures of image quality and image distortion that have a better correspondence to human perception than classical Peak Signal to Noise Ratio (PSNR) or even the improved Structural Similarity Index (SSIM) (Wang et al., 2004). This class of problems can be summarized as generating images subject to certain perceptual constraints. The introduction of better perceptual loss functions based on pre-trained neural network features has inspired a large body of new work in all of these areas. Single image super resolution has seen a considerable boost in performance by introducing the use of feature spaces of pre-trained neural networks as a measure of perceptual image quality (Sajjadi et al., 2016). In short, feature spaces that correspond to human perception have enabled a wide range of improvements in image editing and synthesis, which range from low level tasks such as image super resolution to high level tasks such as attribute based image synthesis. Conclusion:Research in texture recognition has developed as a branch in image recognition. Different methods have been proposed to exploit the speci?c characteristic of textures, starting by the proposal of texture descriptors then moving to novel classi?cation schemes.The major advances in image synthesis and editing became possible because current CNNs like the VGG network are good models of human perception beyond the specific task that they have been originally trained for. The power of current CNNs to model natural perception renders them the most useful models for understanding the visual system in the brain. Importantly, this claim does not build on the superficial similarity between the CNNs and the neural anatomy in the visual pathway. One meaningful way to define “neural similarity” between artificial and biological neural networks is system identification.Establishing such a correspondence between CNNs and biological neural networks is a promising start, but understanding human perception will take much more. An obvious further direction is to combine system identification with perceptually meaningful image synthesis methods. Such an approach could generate stimuli to explore the response properties of biological neurons and their impact on perception in a directed manner if we want to understand the brain; the explanation of behavioral differences between artificial neural networks is an extremely important exercise. It allows getting a better idea of what constitutes an explanation of neural network function and what we could hope to accomplish if there were no experimental limitations.