0

How I transform column to row in Terminal Linux? But more complex... The following is an example of my data:

SNP_Name        ID_Animal        Allele        Chr        Position
rs01            215               AB            1            100
rs02            215               AA            2            200
rs03            215               BA            3            300
rs04            215               AA            4            400
rs01            300               AB            1            100
rs02            300               BB            2            200
rs03            300               AA            3            300
rs04            300               AB            4            400
rs01            666               BB            1            100
rs02            666               AA            2            200
rs03            666               AB            3            300
rs04            666               AB            4            400

I want to transform this into the following:

SNP_Name     Chr     Position   215(ID_animal)  300(ID_Animal) 666(ID_Animal)
rs01         1       100        AB              AB            BB
rs02         2       200        AA              BB            AA
rs03         3       300        BA              AA            AB
rs04         4       400        AA              AB            AB

The row ID_animal change in column with respective allele. How I do this? But I will work with 55,000 repetition per ID_animal. So, I want to be only 55,000 row and (animal number+SNP_Name+Chr+Position) of column.

Thank you.

Community
  • 1
  • 1
  • So for each SNP (`rs`-ID), you'd like to list the animals having that SNP and what allele they have. Are all animals genotyped with every `rs`-ID? (i.e. will every `ID_Animal` occur exactly once for each `SNP_Name`?) – Kusalananda Jul 12 '16 at 10:48
  • Yes, all ID_animal occur exactly once for each SNP_Name. Do you know how I can to do? Thank you. – Amanda Botelho Alvarenga Jul 12 '16 at 17:18
  • But I must have ID_animal being in column, because I will use the GenABEL package in R, to calculate the linkage disequilibrium of the genome, so the input need this structure. – Amanda Botelho Alvarenga Jul 12 '16 at 17:28
  • @Kusalananda Do you can help me? Please... – Amanda Botelho Alvarenga Jul 12 '16 at 20:17
  • I've been thinking about a solution during the day. I might have an answer tomorrow, but I can make no promises. – Kusalananda Jul 12 '16 at 20:19
  • @Kusalananda , I've been thinking about a solution, and I have an idea: put in one file (A) only the ID_Animal and the Allele, and will use the programm used here (http://unix.stackexchange.com/questions/79642/transposing-rows-and-columns) ... In other file (B) I put only SNP_ID, chr and position, and try to keep the same structure. After merge the file (A) and file (B). What do you think about this? – Amanda Botelho Alvarenga Jul 12 '16 at 23:34

1 Answers1

1

The issue here is the amount of data, and I don't want to give a solution that reads everything into memory and then outputs it.

To do this, I'd like to parse and output the data for each SNP (rs number) in turn rather than for each animal in turn. But the data is given to us in the wrong order (it's sorted by animal).

So the first thing we need to do is to sort the data by SNP (the first column). I will also remove the header row at the same time as it is not needed for the data transformation.

I'm assuming that the data is stored in the file data.in:

$ sed '1d' data.in | sort -o data.tmp

We now have:

$ cat data.tmp
rs01            215               AB            1            100
rs01            300               AB            1            100
rs01            666               BB            1            100
rs02            215               AA            2            200
rs02            300               BB            2            200
rs02            666               AA            2            200
rs03            215               BA            3            300
rs03            300               AA            3            300
rs03            666               AB            3            300
rs04            215               AA            4            400
rs04            300               AB            4            400
rs04            666               AB            4            400

Then I run the following to produce the result:

$ awk -f script.awk data.tmp >data.new

The awk script is quite long, so it makes sense to have it in it's own script file rather than as a "one-liner":

FNR == 1    {
    # at first line of input

    rsid        = $1;
    chr         = $4;
    pos         = $5;

    c           = 0;
    aid[c]      = $2; # animal ID
    all[c++]    = $3; # allele

    do_header   = 1;  # output header when done parsing this SNP

    next;
}

rsid == $1 {
    # still reading animal ID/allele for this SNP
    aid[c]      = $2;
    all[c++]    = $3;
    next;
}

{
    if (do_header) {
        # output header

        printf("SNP_name\tChr\tPosition\t");
        for (c in aid) {
            printf("%d\t", aid[c]);
        }
        printf("\n");

        do_header = 0;
    }

    # output line with data from previous SNP    
    printf("%s\t%d\t%d\t", rsid, chr, pos);
    for (c in all) {
        printf("%s\t", all[c]);
    }
    printf("\n");

    # store data for this SNP
    rsid        = $1;
    chr         = $4;
    pos         = $5;

    c           = 0;
    aid[c]      = $2;
    all[c++]    = $3;
}

END {
    # output line for last SNP

    printf("%s\t%d\t%d\t", rsid, chr, pos);
    for (c in all) {
        printf("%s\t", all[c]);
    }
    printf("\n");
}

This produces the tab-delimited file data.new with the following contents, for the given input:

SNP_name    Chr Position    215 300 666
rs01    1   100 AB  AB  BB
rs02    2   200 AA  BB  AA
rs03    3   300 BA  AA  AB
rs04    4   400 AA  AB  AB

NOTE: This requires that all animals were genotyped for exactly the same SNPs. The same animal IDs needs to occur for every SNP. No exceptions.

Kusalananda
  • 14,885
  • 3
  • 41
  • 52