Learning Transformational Invariants from Natural Movies

Cadieu & Olshausen, 2009, Learning Transformational Invariants from Natural Movies

My favorite thing about this paper is its cool spatiotemporal eigenfunction video. Although their mathematical methods don't relate directly to biology, their model provides a new way of thinking about separation of form and motion information in early visual processing.

Their model has two layers, the first of which resembles basis-function learning explored in previous papers, and the second of which learns a sparse representation of motion based on modulations of spatial basis functions. They aim to extract transformation invariants : the features common to moving objects that depend entirely on the movement, and very little on the form of the object.

The paper opens with a bit of background :
In previous work it has been shown that many of the observed response properties of neurons in V1 may be accounted for in terms of a sparse coding model of images [15, 16]:
I(x,t) = σ_i u_i(t) A_i (x) + n(x,t)
where I(x,t) is the image intensity as a function of space (x ∈ R² ) and time, A_i (x) is a spatial basis function with coefficient u_i , and the term, n(x,t) corresponds to Gaussian noise with variance σ_n that is small compared to the image variance. The sparse coding model imposes a kurtotic, independent prior over the coefficients, and when adapted to natural image patches the A_i (x) converge to a set of localized, oriented, multiscale functions similar to a Gabor wavelet decomposition of images.

My translation (which may be no more accessible than the original) of this is :
It's possible to reconstruct an image I(x,t) by looking at neural responses. This reconstruction can be thought of as a sum of small pieces A_i(x), each piece encoded by a single neuron. The reconstruction won't be perfect, since the neurons are noisy and may not care to represent all image features, so thats why you have an n(x,t)noise term. The contribution u_i(t) attributed to each piece changes over time. Oftentimes, we think of neural encodings as "sparse". This means that most of the time only a few neurons are firing. In the case of image reconstruction, this means that you want most u_i(t) weights to be near zero at any given time. This is what the paper means by "kurtotic" : skewed ( in this case skewed toward smaller weights ). By localized they mean that each piece fits within some small circle. By oriented they mean that the pieces generally aren't symmetric and if you rotate them they don't look the same. By multi-scale they mean that pieces of many different sizes are used.

The authors note that pieces ( spatial basis functions A_i(x) ) tend to form pairs of very similar pieces that differ only by a small shift. A moving object will tend to activate one piece and then the other in short succession, and the local direction and speed of movement can be inferred by the order in which the pieces were activated.

So, the authors devise another representation, one that treats these pairs of pieces as if they were coupled. You can imagine forming a 2D space, with one feature as the X and another as the Y. If you plot how a moving object activates these pieces, you get circular trajectories. The activation of pairs of components can be well represented in polar co-ordinates, where the way the angle θ changes tells you something about the velocity, and the way the radius ρ changes tells you something about the intensity of the stimulus, or how well that feature matches the stimulus. If you know about complex analysis then you know that this polar representation is cleanly expressed in terms of complex numbers z=ρ exp(iθ), (see the paper for details).

The paper goes on to replace the spatial basis functions A_i(x) with a pair of real and imaginary basis functions, and represent the temporal modulation as a polar-complex parameter. This adds a constraint on the types of spatial basis functions you can use, and a nonlinear transformation of the spatial decomposition. Rather than have pairs of basis functions that tend to trade-off activation as if they were the real and imaginary components, you have phase and amplitude information. The then train a second system to encode motion in terms of the derivative of these phase components.

The authors note that "Whether or how phase is represented in V1 is not known"


  1. further notes :

    I've been thinking about the temporal aspect of sparse coding. Some sparse coding algorithms stop at sparsely encoding the spatial components of visual stimuli. This paper achieved sparseness by taking the derivative of the activation of spatial features ( which I suspect is biologically plausible, if implemented as a high pass filter ), and then feeding these velocities in as if they were spatial features to another layer.

    Sparse coding in time might be achieved by differentiation.

    Sparse coding *of* movement is achieved by spatially integrating local velocity vectors.

    Things I still need to understand :
    -- can one do the same thing with an extreme sparseness constraint where the component weights can only be either 0 or 1, and must usually be 0 in the time course of the data
    -- can one do the same thing were space and time are treated as inseparable.