1

The title may sound nonsense but let me explain. I need to filter a txt file. The operations I should perform are highly basic as I said. The file I am talking about is this one: http://gdac.broadinstitute.org/runs/analyses__2014_10_17/reports/cancer/BRCA-TP/Mutation_Assessor/BRCA-TP.maf.annotated

At first, I focused on this task: Please find Tumor_Sample_Barcode column in the data file. As you can see, all rows correspond to that column are in such a format: TCGA-02-0001-01C-01D-0182-01

Two characters before "C" is critical here. In the example format, these characters are "01". I am looking for these rows which contains "01" there. Namely, the rows which have any other character couple there should be eliminated.

If the size of the file is not 56.2 MB, I may handle it with MATLAB with ease. However, when I tried to split the columns of the file in MATLAB with following line, I got an error.

[numData,textData,rawData] = xlsread('BRCA-TP.maf.annotated.csv');

Although I maximized Java Heap Memory of MATLAB, I get the error of no sufficient memory to realize this task in editor.

I looked for any alternative method. JMP may help me but I have no experience on that software. Even a basic operation just like I described above may be painful for me.

Is there a way to achieve the operation I explained above in MATLAB? If not, can you help me to figure out how can I write a script in JMP to do it?

1201ProgramAlarm
  • 32,384
  • 7
  • 42
  • 56
Dorukhan Arslan
  • 2,676
  • 2
  • 24
  • 42

2 Answers2

1

This can be done with a simple "awk" command:

awk '$16 ~ /....-..-....-01C-...-....-../' BRCA-TP.maf.annotated > BRCA-TP.maf.annotated.filtered

The 16 means look at the 16th column, the term inside the // is a regular expression (where dots represent any letter)

"awk" is available on any unix-like operating system such as Mac OS X and Ubuntu, but if you're running windows you'd have to download and install Cygwin or other such utility.

Hamid
  • 90
  • 1
  • 7
  • What if I need to get rows contain not only "01C", but also "02C" and "03C"? Should I perform three different awk commands or it can be achieved with a little modification in this line? – Dorukhan Arslan Mar 11 '15 at 18:43
  • @DorukhanArslan yes, there is a simple way to specify multiple patterns such as 0C, etc., I suggest you read up on a regexp tutorial to see how you can match the pattern that best represents your use case, such as this one: [link](http://regexone.com) – Hamid Jun 11 '15 at 20:30
0

if you want to do it in matlab for a specific reason here is another solution. Basically it goes through each line in the file, and isolates the 16th tab separated value (the barcode). This could potentially be shorter with a newer version of matlab (that has strsplit) but regexp works for older versions

fid = fopen('tumor.csv');

%Tumor_Sample_Barcode is the 16th column
col_of_interest = 16;

sline = fgetl(fid);

while ischar(sline)
    %splits the line by tabs
    tokenized_line = regexp(sline,'\t','split');

    %makes sure the line contains the token (this should always be true for
    %your file, but just in case
    if (col_of_interest <= numel(tokenized_line))
        tumor_barcode = tokenized_line{col_of_interest};

        if not(isempty(regexp(tumor_barcode,'....-..-....-01C-...-....-..','match')))
            %if so display the line, or do other processing
            disp(tumor_barcode)
        end
    end

    sline = fgetl(fid);
end

fclose(fid);

edit

I saw your comment on the other answer, if you wanted to search 01C 02C and 03C you do it all at once in the regular expression using a range. [1-3] means take anything between 1 and 3

if not(isempty(regexp(tumor_barcode,'....-..-....-0[1-3]C-...-....-..','match')))
andrew
  • 2,451
  • 1
  • 15
  • 22
  • I checked this Matlab script but there must be a fault. sline is null. I am going to try the awk command as well. Thank you. I hope I have enough rep to give you two +1 vote. – Dorukhan Arslan Mar 11 '15 at 19:03
  • @Dorukhan Arslan Thats probably because the first line `fid = fopen('tumor.csv');` instead of **tumor.csv** put your actual file name. See if that helps – andrew Mar 11 '15 at 19:10
  • Of course, I changed there. Nevertheless no result. Can you work it? – Dorukhan Arslan Mar 11 '15 at 19:21
  • That's because there are no '01C' in the file. You can verify it works though, try '01A' or '01B' – andrew Mar 11 '15 at 20:24