Heterodox use of Deep Learning to find hidden patterns

Question

I would appreciate your comments/help about a strategy I am applying in one of my analysis. In short, my case is:

1) My data have biological origin, collected in a period of 120s, from a
 subject receiving, each time, one of possible three stimuli (response label 1
 to 3), in a random manner, one stimulus per second (trial). Sampling 
 frequency is 256 Hz and 61 different sensors (input variables). So, my 
 dataset has 120x256 rows and 62 columns (1 response label + 61 input 
 variables);
2) My goal is to identify if there is an underlying pattern for each stimulus.
 For that, I would like to use deep learning neural networks to test my
 hypothesis, but not in a conventional way (to predict the stimulus from a
 single observation/row).
3) My approach is to divide the whole dataset, after shuffling per row
 (avoiding any time bias), in training and validation sets (50/50) and then to
 run the deep learning algorithm. The division does not segregate trial events
 (120), so each training/validation sets should contain data (rows) from the
 same trial (but never the same row). If there is a dominant pattern per
 stimulus, the validation confusion matrix error should be low. If there is a
 dominant pattern per trial, the validation confusion matrix error should be
 high. So, the validation confusion matrix error is my indicator of the
 presence of a hidden pattern per stimulus;

I would appreciate any input you could provide me regarding the validity of my logic. I would like to emphasize that I am not trying to predict the stimulus based on row inputs.

Thanks.

score 1 · Answer 1 · answered Apr 19 '16 at 02:13

1

Yes, you can use the classification performance in the cross-validation set that exceeds chance to argue that there is a pattern or relationship within the exemplars for each class. The argument will be stronger if similar performance is found in a separate, never-before seen, testing set.

If a deep neural network, SVM, or any other classifier can classify better than chance, it implies:

There is information (a pattern) among the training set exemplars with regards to each predicted class
AND That pattern is learn-able by the classifier
AND That information is not specific to the training set (no over-learning)

So, if classification performance exceeds chance, then the above 3 conditions are true. If it does not, then one or more of the conditions could be false. The training variables might not contain any information that's helpful in predicting the class. Or predictive variables are chosen, but the relationship between them and the class is too complicated for the classifier to learn. Or the classifier over-learned, and the CV set performance is at chance level or worse.

Here is a paper (open-access) that used similar logic to argue that fMRI activity contains information about images that a person is looking at:

Natural Scene Categories Revealed in Distributed Patterns of Activity in the Human Brain

NOTE: Depending on a classifier used (esp. DNN's but less so with decision trees), this will only tell you IF there is a pattern, it will not tell you WHAT that pattern is.

answered Apr 19 '16 at 02:13

Justas

5,718
2
34
36

Thank you for your answer and article. I totally agree with your points. The main criticism I am facing is related to **how** the DNN is correctly guessing the response label. Once there is no segregation per trial event (different rows/observations from the same trial/event can belong to training and validation sets), it has been said that DNN is "finding" the right class based on the trial it belongs (eg. by similarity) and not by the stimulus. I have tested both cases and different DNN were built (variable importance in different order). How can I disprove such doubt? – RgrNormand Apr 19 '16 at 16:05
Ah, contamination. Is there a way you can ensure that the CV set does not contain rows from trials that were present in the training set? E.g. divide the data as if it came from two different experiments, such that you're training on data from exp1 and testing the DNN on data from exp2, so that everything is separated. – Justas Apr 19 '16 at 18:01
Total segregation leads to chance error. But it doesn't seems correct to use per row inputs to predict whole events (equivalent to 256 rows). If treated as whole events, with segregation, the error is half of the chance. My second approach was to check cosine similarity between the rows of each trial. On average, the angle is lower than 20 degrees only for **n x n+1**. Then I have reduced the ratio between training/validation datasets (less rows to training). As expected error grows, but even with 2% training x 98% validation, the error is half of the chance. Any comments/suggestions? – RgrNormand Apr 20 '16 at 06:43
Have you tried combining subsequent rows from each event and then using those as input, like in a convolutional dnn? Eg row1,row2,row3=A, row2,row3,row4=A .... row257,row258,row259=B, r258,r259,r260=B... etc. In the example,I'm using a window of size 3 rows, but you could experiment with different window sizes, up to 256. – Justas Apr 20 '16 at 16:32
I didn't try your suggested approach. But, how would it help me to prove/disprove my hypothesis, assuming I would keep the same approach (no segregation per event). Once there is more similarity between **n x n+1**, there is a chance this will improve the results. As you can see, my goal is to check if the method I am using is correct for the goal. – RgrNormand Apr 20 '16 at 17:51
You should do total segregation no matter what -- that's the only way to address contamination concerns. If you're getting chance error when the events are segregated, it means your inputs are not informative or the NN cannot learn the pattern. You said these are time series values from sensors, then it's possible the pattern is only apparent from several time steps. So, in that case you should combine several consecutive time steps (as described above) and see if the stimulus can be predicted from several time steps combined. – Justas Apr 20 '16 at 18:24
You should still ensure your combined inputs only contain rows from one stimulus. – Justas Apr 20 '16 at 18:25
1

It appears to me that you're trying to do multi-channel time series classification, e.g.: http://staff.ustc.edu.cn/~cheneh/paper_pdf/2014/Yi-Zheng-WAIM2014.pdf – Justas Apr 20 '16 at 18:32
Thank you for the article and comments. If my goal would be to predict the stimulus class by whole event rows (all 256) with segregation, one approach that I already have produces results with half error than by chance. Nevertheless, it does not prove that there is a relationship between the rows, by stimulus class. Perhaps we could attack the problem from a different angle. If I could prove that the rows are interconnected by trial/round I could assume the DNN is correctly identifying that pattern and not by stimulus class (disproving my hypothesis). Any suggestions on how to prove that? – RgrNormand Apr 20 '16 at 20:39
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/109734/discussion-between-rgrnormand-and-justas). – RgrNormand Apr 20 '16 at 21:12

Heterodox use of Deep Learning to find hidden patterns

1 Answers1