Is Maximizing Mutual Information, or Information, Equivalent to the Maximum Coverage Problem ?

( Ignore, Personal notes )
I need some measure theory / analysis / math help to proceed :

update : This question has been answered in depth at MathOverflow

Say you have N cells. Say also that you would like to construct the best possible decoder using only K of these cells. How can we find these K best cells ?

This problem appears superficially similar to the maximum coverage problem.

(From Wiki) Formally, (unweighted) Maximum Coverage
Given a number $k$ and a collection of sets $S=S_1,S_2,\ldots,S_m$, where $S_i \subseteq \left\{e_1, e_2, \ldots, e_n \right\}$, find a subset $S^\prime \subseteq S$ of sets, such that $\left| S^\prime \right| \leq k$ and the number of covered elements $\left| \bigcup_{S_i \in S^\prime}{S_i} \right|$ is maximized.

A very hand-wavy argument :

We know that greedy selection approximates maximum coverage as well as can be hoped. Perhaps selecting the K best cells can be re-phrased as the maximum coverage problem ? Forget about mutual information for the moment and just consider entropy, or information. Given a number $k$ and a collection $S=S_1,S_2,\ldots,S_m$ of random variables. I would like to find a subset $S^\prime \subseteq S$ of variables, such that $\left| S^\prime \right| \leq k$ and the joint entropy $H( S^\prime )$ is maximized.

Entropy of random variables seems to behave, algebraically, like the magnitude of a set. There is probably some formal way of stating this, like "they are both ((some abstract algebraic structure here))" or something, but this is what I can say for now: let $S(x)$ be a hypothetical set representation of random variable $x$ such that $|S(x)|=H(x)$ the magnitude of $S(x)$ equals the entropy of $x$. Maybe $|S(x)|$ isn't really a magnitude but some other notion of "volume" ( needs details )
Entropy is like the size of a set : $H(x)\sim|S(x)|$
Joint entropy is like union : $H(x,y)\sim|S(x)\cup S(y)|$
Mutual information is like intersection : $I(x;y)\sim|S(x)\cap S(y)|$
Conditional entropy is like set-minus : $H(x|y)\sim|S(x)\setminus S(y)|$
Repeatedly selecting the variable that would most increase joint entropy gives a greedy algorithm for finding $S^\prime$ with large joint entropy. Maybe a result analogous to "greedy selection is the best polynomial time approximation for maximum cover" holds for greedy maximization of joint entropy ?

Lets explore $H(x)\sim|S(x)|$ a bit more. Is it really possible to think of random variables as sets ?

Say I have a collection $X=x_1,x_2,\ldots,x_n$ of independent random variables with unit entropy $H(x_i)=1$. If I create a collection $S=S_1,S_2,\ldots,S_m\,\,st.\,\,S_i\subseteq X$ of subsets of $X$, I know that the joint entropy of each set equals its size $H(S_i)=|S_i|$, since each element $x\in S_i$ is an independent random variable with unit entropy. Under this construction, maximum cover for $S$ and maximizing the joint entropy of $\bigcup_{S_i\in S^\prime} S_i$ are equivalent.

It isn't clear that I can go in the reverse direction and argue that this equivalence holds for any collection of random variables.

super handwave ( I need some measure theory / analysis / math majors to proceed ) :

Consider some sort of limiting case where the sets are actually subsets of the reals ( with some nice property, like .. measurable or compact ? or something ? ) $S_i\subseteq \mathbb{R}$, so that you can have real-valued entropy corresponding to the volumes of these sets $H(S_i)=|S_i|$. The collection of disjoint sets that can be formed by union, intersection, and set-minus from your collection of independent elements with which you construct a weighted set-cover analogy of joint-entropy maximization. If I can get away with this, then I can also easily convert maximizing joint entropy to maximizing mutual information, by considering only the parts of $S_i$ that intersect with some target variable $A$, such that $I(A;S_i)=|S(A)\cap S(S_i)|$. If this is possible it probably doesn't involve taking actual subsets of the reals, just proving things based on the algebraic behavior of information. Something like this has almost certainly been investigated somewhere and I just need to track it down and understand it.


Mutual Information Does not Necessarily Predict Decoding Accuracy

( Ignore, personal notes )

Entropy and Mutual Information are increasingly used for analysis in Neuroscience. Frequently, spike times and a continuous variables are binned into discrete processes. This avoids certain conceptual problems with reasoning about information on continuous variables, but can add its own complications. Mutual information between a spike train and a stimulus does not necessarily predict measures of decoding accuracy like correlation or root mean squared error (RMSE). While we may know that there are N bits of mutual information between a count process (derived from spiking data) and a discrete variable (derived from a continuous stimulus), we do not know how "important" this information is.

The "importance" of information is fixed by the experimenter, and may represent a prejudiced expectation of what the neuron encodes, or may represent a practical constraint of the experiment. For instance, in brain machine interface (BMI) research, we are interested in reconstructing from neural recordings how the arm moves (kinematics). A good decoding will be highly correlated with measured kinematics, or that the RMSE between the decoding and the measurement should be minimized. We place more importance on the information that pertains to large amplitude kinematic information, but the mutual information for a discrete random variable does not necessarily capture this importance.

This is intuitive when considering binary representations of integers. Consider two discrete random variables A and B that generate N bit integers. Say that K bits in A and B are always the same, and that the remaining N-K bits are independent. In this case, A and B will have K bits of mutual information $I(A;B)=K$. What do we know about $|A-B|$, the absolute difference between these two processes ?

If A and B share the highest order bits, then errors will be limited to the N-K lower order bits and the error is bounded as $|A-B|\sim O(2^{N-K})$. However, if A and B share their low order bits, while the high order bits are independent, then the magnitude of $|A-B|$ will scarcely differ from the case where all bits of A and B are independent. Mutual information does not tell you the magnitude of the impact of the shared information on the values of A and B.

A population of cells in the brains of rats ( and probably primates ) has receptive fields that would seem to suffer from this decoding issue. Grid cells in the entorhinal cortex have been known to encode a rat's position in the environment. These cells have spatially periodic receptive fields. If you were to listen to a grid cell, and place a point on map representing the animal's location each time the cell fired, you would see a hexagonal pattern of "bumps". Different grid cells represent different spatial frequencies ( larger / smaller bumps ), or different phase ( slightly shifted hexagonal grids of bumps ). Decoding the position of an animal from grid cell activity is much like decoding the value of an integer from its binary representation.

Say I can record from some grid cells. If these cells contain a range of spatial scales, starting with the largest period and decreasing, I can figure out where the animal is. Start with the cell that has the largest period receptive field to exclude some areas of the room, and narrow down the position using cells with progressively finer spatial scales. However, if I start with the cells with small, high spatial frequency maps, I may be able to restrict the animal's location to a grid-like collection of possibilities, but this information is not particularly useful if the gross location of the animal is missing.

It is unclear to me whether decoding kinematics could suffer from a similar problem. At the very least, we know that the lower bounds on decoding accuracy for a given mutual information are quite bad, with the grid cell encoding as a worst case scenario. In practice, decoding accuracy might correlate quite well with mutual information.

It seems like some of my confusion might be cleared up by the concept of distortion in information theory. Apparently there is a relationship between distortion and mutual information that holds for both discrete and continuous random variables.

Particularly confused / speculative / handwave part :

Consider histograms with N equal sized bins ( as opposed to equal width ). These bins can be enumerated by N integers each $log_2(N)$ bits long. Each additional bit of information excludes half remaining bins. However, this half could be "all points larger than 10", or it might be "all points N s.t. N%2=0". When it comes to honing in on the position of a point in space, the former is more useful and will reduce error, while the latter barely reduces error at all.

Perhaps considering the information present in a collection of histograms, starting from most to least course, could reveal the relative contribution of "bits" to different magnitudes of error ? But then, why not consider the Fourier decomposition in the stimulus-value domain, or all arbitrary partitions of the sample space ? When the sample space is continuous, one might be tempted to take ever finer partitions, which should cause the entropy of the distribution to diverge. At this point, differential entropy looks like it might be much better than discrete entropy for some applications.


Some Functions

(ignore this, personal notes)

I keep forgetting this and having to re-derive it, so I'm putting it up here. Exponential decay a very simple response function. To a first approximation, the response function of a synapse might be modeled as an exponential function. The exponential isn't perfect, however, since synapses don't react immediately. Sometimes we use a function of the form $t*exp(-t)$, which has a rising phase as well as a falling phase. This function is actually the result of convolving two exponential decay functions together. Generally, the family $t^n* xp(-t)$ captures the idea of having n feed-forward variables coupled by exponential decay $X_n'=X_{n-1}-X_n$. However, I keep forgetting the appropriate constants to convert between the system of differential equations and the family of response functions. I'm not going to derive that right now, but rather show a series of different parameterizations of the response functions so I can remember how the response changes as $n$ varies.

The integral of $t^n exp(-t)$ actually gets larger for larger $n$. To keep the integral 1, divide by $n!$.

The peak response of $t^n*exp(-t)$ arrives at time $t=n$. Rescale time with $t:=nt$ to get a peak response at $t=1$.

Rescaling time changes the integral again, but this is easily fixed by multiplying by $n$.

Notice how the curves are getting progressively more smooth for larger $n$. I wonder what this looks like on a log axis ?

So, this family of functions appears to tend toward a log-gaussian in the limit. I don't know if thats particularly meaningful.