20120113

Equivalence of Statistics on a Pair of Gaussian Channels

For a pair of Gaussian channels ( continuous random variables who's values follow a normal distribution ), the mutual information, correlation, root mean squared error, correlation, and signal to noise ratio, are all equivalent and can be computed from each-other. Without loss of generality we restrict this discussion to zero-mean unit variance channels. This discussion elaborates on the discussion of mutual information between Gaussian channels presented in the third chapter of Spikes.

Correlation & Mutual Information

Consider a single gaussian channel $y = g x + n$, where $x$ is the input, $y$ is the output, $g$ is the gain, and $n$ is addative gaussian noise. Without loss of generality, assume that $x$, $n$ and $y$ have been converted to z-scores. Reconstructed z-scores can always be mapped back to the original gaussian variable by multiplying by the original standard deviations and adding in the original means. This means that all random variables have zero mean and unit variance. If we do this, we will need a separate gain for the signal and noise, say, $a$ and $b$.
\[y = a x + b n\]
Since ths signal and noise are independent, their variances add:
\[\sigma^2_{y} = \sigma^2_{a x} + \sigma^2_{b n}\] and the gain parameters can be factored out
\[\sigma^2_{y} = a^2 \sigma^2_{x} + b^2 \sigma^2_{n}.\]
Since $\sigma^2_{y}=\sigma^2_{x}=\sigma^2_{n}=1$,
\[a^2+b^2=1\]
This can be parameterized as
\[\sigma^2_{y} = \alpha \sigma^2_{x} + (1-\alpha) \sigma^2_{n},\,\,\alpha=a^2\in[0,1]\]
and
\[y = x\sqrt{\alpha} + n\sqrt{1-\alpha}\]

The relationships between mutual information $I$ and signal-to-noise ration $SNR$ come from Spikes, chapter 3.
\[I=\frac{1}{2}lg(1+\frac{\sigma^2_{a x}}{\sigma^2_{b n}})=\frac{1}{2}lg(1+SNR)\]
Where $lg(\dots)$ is the base-2 logarithm.
The $SNR$ simplifies as :
\[SNR=\frac{\sigma^2_{a x}}{\sigma^2_{b n}}=\frac{\alpha \sigma^2_x}{(1-\alpha) \sigma^2_n}=\frac{\alpha}{1-\alpha}\]
Mutual information simplifies as :
\[I=\frac{1}{2}lg(1+SNR)=\frac{1}{2}lg{\frac{\sigma^2_y}{\sigma^2_{b n}}}=\frac{1}{2}lg{\frac{\sigma^2_y}{(1-\alpha)\sigma^2_n}}=\frac{1}{2}lg{\frac{1}{1-\alpha}}\]
The correlation $\rho$ is the standard definition of Pearson's product-moment correlation coefficient, which can be viewed as the angle $\theta$ between vectors defined by the samples of random variables $x$ and $y$.
\[\rho=cos(\theta)=\frac{x y}{|x||y|}\]

Since $x$ and $n$ are independent, the samples of $x$ and $n$ can be viewed as an orthonormal basis for the samples of $y$, where the weights of the components are just previously defined $a$ and $b$, respectively. This relates our gain parameters to the correlation coefficient: the tangent of the angle between $y$ and $x$ is just the ratio of the noise gain to the signal gain
\[tan(\theta)=\frac{b}{a}=\frac{\sqrt{1-\alpha}}{\sqrt{\alpha}}\]
Then $tan(\theta)$ can be expressed in terms of the correlation coefficient $\rho$ :
\[tan(\theta)=\frac{sin(\theta)}{cos(\theta)}=\frac{\sqrt{1-cos(\theta)^2}}{cos(\theta)}=\frac{\sqrt{1-\rho^2}}{\rho}\]
This gives the relationship $\sqrt{1-\alpha}/\sqrt{\alpha}=\sqrt{1-\rho^2}/\rho$, which implies that that $\alpha=\rho^2$, or $a=\rho$. (There is a slight problem here in that correlation can be negative, but it is the magnitude of the correlation that really matters. As a temporary fix, correlation now means "absolute value of the correlation".) This can be used to relate $\rho$ to $SNR$ and mutual informtaion:
\[SNR=\frac{\rho^2}{1-\rho^2}\]
\[I=\frac{1}{2}lg{\frac{1}{1-\rho^2}}=-\frac{1}{2}lg(1-\rho^2)\]
As a corollary, if $\phi=\sqrt{1-\rho^2}$ is the correlation of $y$ and the noise $n$, then information is simply $I=-lg(\phi)$. Mean squared error ($MSE$) is also related :
\[MSE=(1-\rho)^2+(1-\rho^2)=1-2\rho+1=2(1-\rho)\]
which implies that
\[\rho=1-\frac{1}{2}MSE\]
and gives a relationship between mutual information and mean squared error:
\[I=-\frac{1}{2}lg(1-\rho^2)=-\frac{1}{2}lg(1-(1-MSE/2)^2)\]

The relationships between correlation $\rho$, root mean squared error $RMSE$, information $I$, and signal to noise ratio $SNR$, all increase monotonically, implying that correlation, SNR, and mutual information, all give the same quality ranking for a collection of channels. If a previous post I wrote holds, this implies that greedy selection of a subset of possible zero-mean unit-variance gaussian channels to be used in reconstruction is in some sense close to optimal, whether you use correlation, RMSE. SNR, or mutual information to rank the reconstruction quality.

Further Speculation

This can be generalized (as in chapter 3 of Spikes) to vector-valued Gaussian variables by transforming into a space where $Y=AX+BN$ is diagonal, treating each component independently, and then transforming back into the original space.

Similarly to how chapter 3 of Spikes generalizes mutual information of a Gaussian channel into a bound on mutual information for possibly non-gaussian, vector valued, channels, these relationships can be generalized to inequalities for non-Gaussian channels :

\[I\geq-lg(\Phi)=-\frac{1}{2}lg(1-\Sigma^2)=-\frac{1}{2}lg(1-(1-MSE/2)^2)\]
Where, for vector valued variables, $\phi$, $\rho$, and $MSE$ become matrices $\Phi$, $\Sigma$, and $MSE$.

This does not describe what happens when you combine multiple decoders, which may or may not be gaussian, which is what is happening when we greedily search for a subset of cells to use for decoding a single kinematic variable. It only describes reconstructing one random variable, possibly vector valued, with no treatment of how the reconstruction is done or what happens when we add or remove channels. This will require further investigation, but I would start with the idea that, when you use two decoders, the total mutual information is the sum of the individual mutual informations, minus the 3-way mutual information between then two decoders and the decoded variable.
\[I(X;(Y_1,Y_2))=I(X;Y_1)+I(X;Y_2)-I(X;Y_1;Y_2)\]

The redundancy between the multiple channels might be well approximated by the matrix of self-information, or auto-correation, of the channels (cells). I expect that, for collections of gaussian variables, PCA will find an informative reduced-dimension set of cells, and that there will be some relationship between PCA and ICA that resembles the relationship between correlation and information discussed in section 1.

It may also be possible to generalize to multiple time lags simply by adding time-shifted copies of the cell population as new variables, and reducing the dimension to remove the redundancy introduced.

No comments:

Post a Comment