20100420

Notes on detecting repeated sequences in data

how to statistically extract sequences from a firing rate model ?

note that certain traveling plane waves ... are somewhat organized, but not necessarily repeating. it may be possible to get structured propagation that is not repeating for other network topologies ?

sequence autocorrelation can detect repeated firing patterns in a single unit.

translate "sequence" to "sequence of population rate vectors" and I think you're good. Besides, to define autocorrelation in time you need to take a window of T time-steps.

this could probably be done more quickly with bit-ops on a boolean spike vector


match vector size :
T timesteps
N units

number of matches :
K timesteps, squared

K*K*T*N is a bad number to be looking at.

pairwise metrics are many :
angle between vectors
Pearson's coefficient
RMS distance

we don't really want pearson's coefficient,maybe some sort of distance metric would be good.

we don't want pearson's coefficient ( propotional to dot product of Z scored ) if we want the mean/varience of the population to be important.

what about repeated subsequences ? like, if some fraction M of N units are participating in a repeating chain ? If the rest of the population in noisy it will be hard to detect this. Searching for subsets... might be bad

rememeber to check if your results could be generated by random chance

lets try RMS distance

you'll want to estimate probability of collisions ( again, easier on a bit vector )

grumble.. shall we try some point process modeling ?

motifs are low energy points in sequence space, orbits pass through them statistically more often ( gibb's sampleing ? energy of sequence can be inferred by its probability and vice versa )

can you look at the connectivity graph and assign energies to various sequences ? this would be good. It can probably be done for random walks. Can it be done for other systems ? Feed foreward networks with global uniform inhibition (normalized probability ) ... these look a lot like random walks.

if the inhibition is structured, or you need to use attenuation or other slow inhibitory process, I'm not really sure what to do.

ok, I like the idea of RMS distance (best i can think of). How do you simplify this computationally ? I can skip the "root" part, and the normalization ( un-normalized sum of squared deviations ), but this is the least computationally intensive part. Luckily, computing sum squared deviations can be parallelized fairly easily

trivially I can do it in either depth T*N
or depth K*K

if I estimate the time constants of the computation I can write an algorithm that switches strategy based on the shape of the problem. The T*N depth problem might use less memory, the K*K problem will rely on PyCuda built in reduction framework and that will result in annoying data copying (would have to write in-place reduction.. blech, no time for that )

Yeah, parallelize on K*K ( T*N depth ) .. this is the best.

Hmm... I think ... no, can I simplify this by doing pairwise time step differences first ?
hmm.. this generates a different K*K matrix, probably just need to do something simple...
like box convolve (2D) the output.

yes thats a simple plan.

seems like something is wrong though...t+1 doesn't line up like that, um...

yeah
something like this

def GPUPointAutoDistance(t,k,n,data):
'''
t=length of data in time
k=number of time bins to use
n=size of vector datapoints
data=t*n data matrix, n is inner dimension
'''
kern1=kernel('float *data, int n, int t, float *out',
'''
int outx = tid%t;
int outy = tid/t;
float *inx = &data[outx];
float *iny = &data[outy];
float sumsq = 0.0f;
for (int j=0; j
float a=inx[j]-iny[j];
sumsq+=a*a;
}
out[tid]=sumsq/n;
''')
kern2=kernel('float *in, int k, int w, int t, float *out',
'''
int index = (tid%w)+(tid/w)*t;
float sumsq = 0.0f;
for (int i=0; i
sumsq+=in[index];
index+=1+t;
}
out[tid]=__sqrtf(sumsq/k);
''')

yes.. this has complexity
K*K*(N+T), slightly better (much much better for large N)
if I have 2**16 units how much time can I store on the card ?
2**32 max
t*(4*2**16)+t*t*4
t*(2**18+t*4)==2**32
t*2**18+t*t*4-2**32==0
-2**32
13572.950011841582
thats not bad.


No comments:

Post a Comment