## 20110429

### Is Maximizing Mutual Information, or Information, Equivalent to the Maximum Coverage Problem ?

( Ignore, Personal notes )
I need some measure theory / analysis / math help to proceed :

update : This question has been answered in depth at MathOverflow

Say you have N cells. Say also that you would like to construct the best possible decoder using only K of these cells. How can we find these K best cells ?

This problem appears superficially similar to the maximum coverage problem.

(From Wiki) Formally, (unweighted) Maximum Coverage
Given a number $k$ and a collection of sets $S=S_1,S_2,\ldots,S_m$, where $S_i \subseteq \left\{e_1, e_2, \ldots, e_n \right\}$, find a subset $S^\prime \subseteq S$ of sets, such that $\left| S^\prime \right| \leq k$ and the number of covered elements $\left| \bigcup_{S_i \in S^\prime}{S_i} \right|$ is maximized.

A very hand-wavy argument :

We know that greedy selection approximates maximum coverage as well as can be hoped. Perhaps selecting the K best cells can be re-phrased as the maximum coverage problem ? Forget about mutual information for the moment and just consider entropy, or information. Given a number $k$ and a collection $S=S_1,S_2,\ldots,S_m$ of random variables. I would like to find a subset $S^\prime \subseteq S$ of variables, such that $\left| S^\prime \right| \leq k$ and the joint entropy $H( S^\prime )$ is maximized.

Entropy of random variables seems to behave, algebraically, like the magnitude of a set. There is probably some formal way of stating this, like "they are both ((some abstract algebraic structure here))" or something, but this is what I can say for now: let $S(x)$ be a hypothetical set representation of random variable $x$ such that $|S(x)|=H(x)$ the magnitude of $S(x)$ equals the entropy of $x$. Maybe $|S(x)|$ isn't really a magnitude but some other notion of "volume" ( needs details )
Entropy is like the size of a set : $H(x)\sim|S(x)|$
Joint entropy is like union : $H(x,y)\sim|S(x)\cup S(y)|$
Mutual information is like intersection : $I(x;y)\sim|S(x)\cap S(y)|$
Conditional entropy is like set-minus : $H(x|y)\sim|S(x)\setminus S(y)|$
Repeatedly selecting the variable that would most increase joint entropy gives a greedy algorithm for finding $S^\prime$ with large joint entropy. Maybe a result analogous to "greedy selection is the best polynomial time approximation for maximum cover" holds for greedy maximization of joint entropy ?

Lets explore $H(x)\sim|S(x)|$ a bit more. Is it really possible to think of random variables as sets ?

Say I have a collection $X=x_1,x_2,\ldots,x_n$ of independent random variables with unit entropy $H(x_i)=1$. If I create a collection $S=S_1,S_2,\ldots,S_m\,\,st.\,\,S_i\subseteq X$ of subsets of $X$, I know that the joint entropy of each set equals its size $H(S_i)=|S_i|$, since each element $x\in S_i$ is an independent random variable with unit entropy. Under this construction, maximum cover for $S$ and maximizing the joint entropy of $\bigcup_{S_i\in S^\prime} S_i$ are equivalent.

It isn't clear that I can go in the reverse direction and argue that this equivalence holds for any collection of random variables.

super handwave ( I need some measure theory / analysis / math majors to proceed ) :

Consider some sort of limiting case where the sets are actually subsets of the reals ( with some nice property, like .. measurable or compact ? or something ? ) $S_i\subseteq \mathbb{R}$, so that you can have real-valued entropy corresponding to the volumes of these sets $H(S_i)=|S_i|$. The collection of disjoint sets that can be formed by union, intersection, and set-minus from your collection of independent elements with which you construct a weighted set-cover analogy of joint-entropy maximization. If I can get away with this, then I can also easily convert maximizing joint entropy to maximizing mutual information, by considering only the parts of $S_i$ that intersect with some target variable $A$, such that $I(A;S_i)=|S(A)\cap S(S_i)|$. If this is possible it probably doesn't involve taking actual subsets of the reals, just proving things based on the algebraic behavior of information. Something like this has almost certainly been investigated somewhere and I just need to track it down and understand it.