5

I have a series of n=400 sequences of varying length containing the letters ACGTE. For example, the probability of having C after A is:

enter image description here

and which can be calculated from the set of empirical sequences, thus

enter image description here

Assuming: enter image description here

Then I get a transition matrix:

enter image description here

But I'm interested in calculating the confidence intervals for Phat, any thoughts on how I could I go about it?

Amro
  • 123,847
  • 25
  • 243
  • 454
HCAI
  • 2,213
  • 8
  • 33
  • 65
  • 1
    this is better suited on http://stats.stackexchange.com/ – Amro Jul 15 '13 at 22:05
  • I think the reason for it being here is to attract Matlab specialists, programmers and enthusiasts. the stats forum question (http://stats.stackexchange.com/questions/64309/what-is-the-relevance-of-bootstrapped-confidence-intervals-on-markov-chain-trans?noredirect=1#comment124036_64309) is not attacting any useful answers – HCAI Jul 16 '13 at 08:26
  • 1
    ok good point. I just figured you'd get better explanations there, I am not a statistician myself :) – Amro Jul 16 '13 at 16:41

2 Answers2

7

You could use bootstrapping to estimate confidence intervals. MATLAB provides bootci function in the Statistics toolbox. Here is an example:

%# generate a random cell array of 400 sequences of varying length
%# each containing indices from 1 to 5 corresponding to ACGTE
sequences = arrayfun(@(~) randi([1 5], [1 randi([500 1000])]), 1:400, ...
    'UniformOutput',false)';

%# compute transition matrix from all sequences
trans = countFcn(sequences);

%# number of bootstrap samples to draw
Nboot = 1000;

%# estimate 95% confidence interval using bootstrapping
ci = bootci(Nboot, {@countFcn, sequences}, 'alpha',0.05);
ci = permute(ci, [2 3 1]);

We get:

>> trans         %# 5x5 transition matrix: P_hat
trans =
      0.19747       0.2019      0.19849       0.2049      0.19724
      0.20068      0.19959      0.19811      0.20233      0.19928
      0.19841      0.19798       0.2021       0.2012      0.20031
      0.20077      0.19926      0.20084      0.19988      0.19926
      0.19895      0.19915      0.19963      0.20139      0.20088

and two other similar matrices containing the lower and upper bounds of confidence intervals:

>> ci(:,:,1)     %# CI lower bound
>> ci(:,:,2)     %# CI upper bound

I am using the following function to compute the transition matrix from a set of sequences:

function trans = countFcn(seqs)
    %# accumulate transition matrix from all sequences
    trans = zeros(5,5);
    for i=1:numel(seqs)
        trans = trans + sparse(seqs{i}(1:end-1), seqs{i}(2:end), 1, 5,5);
    end

    %# normalize into proper probabilities
    trans = bsxfun(@rdivide, trans, sum(trans,2));
end

As a bonus, we can use bootstrp function to get the statistic computed from each bootstrap sample, which we use to show a histogram for each of the entries in the transition matrix:

%# compute multiple transition matrices using bootstrapping
stat = bootstrp(Nboot, @countFcn, sequences);

%# display histogram for each entry in the transition matrix
sub = reshape(1:5*5,5,5);
figure
for i=1:size(stat,2)
    subplot(5,5,sub(i))
    hist(stat(:,i))
end

bootstrap_histograms

Amro
  • 123,847
  • 25
  • 243
  • 454
  • This is fantastic! Thank you so much! Exactly why it's not in the stats forum! Giving bounty as soon as available! However, I do have a concern that actually row entries to Phat are not univariate normal distributions because they cannot vary individually. In fact each row forms a multivariate normal distribution don't you think? e.g. as one entry increases, at least one other in the same row must necessarily decrease to maintain sum(P,2)=1 – HCAI Jul 16 '13 at 08:23
  • Wow, lots of effort given here! – Hugh Nolan Jul 16 '13 at 15:34
  • 1
    Glad I could help. As for `P_hat` the rows should indeed sum to one (up to a certain precision). In the calculation function `countFcn`, I first accumulate the counts of co-occurrences from each sequence (using a less-known syntax of `sparse` function but we could have also used `accumarray`), then I divide by the row sums (the `bsxfun` call). One caveat: if a symbol doesn't occur in a sequence (say we only had `[1 2 4 5]`), then we could be dividing by zero, which would cause `NaN` values in the transition matrix. A common solution is to add +1 to the entire matrix before normalizing the rows – Amro Jul 16 '13 at 16:34
  • @HughNolan Indeed this is the most helpful answer I could have hoped for! And worth every bit of rep! :) – HCAI Jul 16 '13 at 17:07
  • @Amro Lovely! Is it possible to calculate the covariance of entries in the transition matrix P_hat? – HCAI Jul 16 '13 at 17:11
  • 1
    @HCAI: you can compute that from the output of `bootstrp` above as: `C = cov(stat)`. Each row of `stat` is a flattened 5x5 transition matrix from one of the drawn samples. `cov` works on the columns of the input matrix, the result would be a 25x25 matrix corresponding to the covariances between each pair of entries of `P_hat` – Amro Jul 16 '13 at 18:33
  • 1
    on another note, if you have the [PCT toolbox](http://www.mathworks.com/products/parallel-computing/), you can parallelize the computation of `bootci` and `bootstrp` (see the 'UseParallel' option in the docs) – Amro Jul 16 '13 at 18:37
  • I'll have a look at parallelising this too. I'm considering the fact that actually my P_hat is sparse and as such does not well approximate P even with a bootstrap method. Do you think I should open a new question about Laplace smoothing to help this? E.g. http://people.cs.umass.edu/~dasmith/inlp/lect5-cs585.pdf – HCAI Jul 17 '13 at 12:07
  • @HCAI: I think [Laplace smoothing](http://en.wikipedia.org/wiki/Additive_smoothing) is just that, adding a small *lambda* to smooth the probabilities (the `+1` I mentioned above in the comments). This is a sort of [regularization](http://en.wikipedia.org/wiki/Regularization_%28mathematics%29) if you think about it, which should also prevent [overfitting](http://en.wikipedia.org/wiki/Overfitting). – Amro Jul 17 '13 at 18:03
  • ... Now I dont know the about the domain background, but perhaps you could also try a more restricted MC structure which is not fully-connected, and not all transitions between states are permitted. For example a left-to-right model where state sequences are only allowed to move right (or stay the same) as time increases, but never go back.. As I said, I don't know if this is applicable for DNA sequences, just something to consider :) – Amro Jul 17 '13 at 18:03
  • @Amro Hmmm, this chain is actually attempting to represent nurses touching surfaces when in a patient's room (ACGTE attracts attention better). So technically all transitions are physically possible. However the number of observed sequences is clearly not big enough to account for this. So the $\lambda$ idea is worth pursuing... But technically this Markov chain's transition matrix can be represented by a multivariate normal distribution with `mean(trans,3)` and `cov(stat)`? – HCAI Jul 18 '13 at 12:12
  • @HACI: oops, of course (after all there is no "E" in [DNA](http://en.wikipedia.org/wiki/ACGT)) :) Anyway, since we have [discrete](http://en.wikipedia.org/wiki/Discrete_distribution#Discrete_probability_distribution) categorical variables here ("A,C,G,T,E" as opposed to continuous like "1.2, -3.4"), then the transition matrix is actually representing [multinomial distributions](http://en.wikipedia.org/wiki/Multinomial_distribution) one for each state, rather than a MVN distribution... – Amro Jul 18 '13 at 16:48
1

Not sure whether it is statistically sound, but an easy way to get an indicative upper and lower bound:

Cut your sample in n equal pieces (for example 1:40,41:80,...,361:400) and calculate the probability matrix for each of these subsamples.

By looking at the distribution of probabilities amongst subsamples you should get a pretty good idea of what the variance is.

The disadvantage of this method is that it may not be possible actually calculate an interval with a desired given probability. The advantage is that it should give you good feeling for how the series behaves, and that it may capture some information that could be lost in other methods due to the assumptions that other methods (for example bootstrapping) are based on.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
  • 1
    If you actually have multiple series of length n, you are probably better off calculating one probability matrix per series (rather than chopping up each series in smaller pieces) – Dennis Jaheruddin Jul 22 '13 at 10:21