## 20110428

### Mutual Information Does not Necessarily Predict Decoding Accuracy

( Ignore, personal notes )

Entropy and Mutual Information are increasingly used for analysis in Neuroscience. Frequently, spike times and a continuous variables are binned into discrete processes. This avoids certain conceptual problems with reasoning about information on continuous variables, but can add its own complications. Mutual information between a spike train and a stimulus does not necessarily predict measures of decoding accuracy like correlation or root mean squared error (RMSE). While we may know that there are N bits of mutual information between a count process (derived from spiking data) and a discrete variable (derived from a continuous stimulus), we do not know how "important" this information is.

The "importance" of information is fixed by the experimenter, and may represent a prejudiced expectation of what the neuron encodes, or may represent a practical constraint of the experiment. For instance, in brain machine interface (BMI) research, we are interested in reconstructing from neural recordings how the arm moves (kinematics). A good decoding will be highly correlated with measured kinematics, or that the RMSE between the decoding and the measurement should be minimized. We place more importance on the information that pertains to large amplitude kinematic information, but the mutual information for a discrete random variable does not necessarily capture this importance.

This is intuitive when considering binary representations of integers. Consider two discrete random variables A and B that generate N bit integers. Say that K bits in A and B are always the same, and that the remaining N-K bits are independent. In this case, A and B will have K bits of mutual information $I(A;B)=K$. What do we know about $|A-B|$, the absolute difference between these two processes ?

If A and B share the highest order bits, then errors will be limited to the N-K lower order bits and the error is bounded as $|A-B|\sim O(2^{N-K})$. However, if A and B share their low order bits, while the high order bits are independent, then the magnitude of $|A-B|$ will scarcely differ from the case where all bits of A and B are independent. Mutual information does not tell you the magnitude of the impact of the shared information on the values of A and B.

A population of cells in the brains of rats ( and probably primates ) has receptive fields that would seem to suffer from this decoding issue. Grid cells in the entorhinal cortex have been known to encode a rat's position in the environment. These cells have spatially periodic receptive fields. If you were to listen to a grid cell, and place a point on map representing the animal's location each time the cell fired, you would see a hexagonal pattern of "bumps". Different grid cells represent different spatial frequencies ( larger / smaller bumps ), or different phase ( slightly shifted hexagonal grids of bumps ). Decoding the position of an animal from grid cell activity is much like decoding the value of an integer from its binary representation.

Say I can record from some grid cells. If these cells contain a range of spatial scales, starting with the largest period and decreasing, I can figure out where the animal is. Start with the cell that has the largest period receptive field to exclude some areas of the room, and narrow down the position using cells with progressively finer spatial scales. However, if I start with the cells with small, high spatial frequency maps, I may be able to restrict the animal's location to a grid-like collection of possibilities, but this information is not particularly useful if the gross location of the animal is missing.

It is unclear to me whether decoding kinematics could suffer from a similar problem. At the very least, we know that the lower bounds on decoding accuracy for a given mutual information are quite bad, with the grid cell encoding as a worst case scenario. In practice, decoding accuracy might correlate quite well with mutual information.

It seems like some of my confusion might be cleared up by the concept of distortion in information theory. Apparently there is a relationship between distortion and mutual information that holds for both discrete and continuous random variables.

Particularly confused / speculative / handwave part :

Consider histograms with N equal sized bins ( as opposed to equal width ). These bins can be enumerated by N integers each $log_2(N)$ bits long. Each additional bit of information excludes half remaining bins. However, this half could be "all points larger than 10", or it might be "all points N s.t. N%2=0". When it comes to honing in on the position of a point in space, the former is more useful and will reduce error, while the latter barely reduces error at all.

Perhaps considering the information present in a collection of histograms, starting from most to least course, could reveal the relative contribution of "bits" to different magnitudes of error ? But then, why not consider the Fourier decomposition in the stimulus-value domain, or all arbitrary partitions of the sample space ? When the sample space is continuous, one might be tempted to take ever finer partitions, which should cause the entropy of the distribution to diverge. At this point, differential entropy looks like it might be much better than discrete entropy for some applications.