rearrange data on a file (NOT a direct transpose)

Question

I have a file (more than 2.5k lines) like this:

NAME YEAR A B C
JOHN Y1 10,00 19,00 65,00
JOHN Y2 11,00 23,00 64,00
JOHN Y3 12,00 33,00 34,00
JOHN Y4 13,00 34,00 32,00
PAUL Y1 14,00 43,00 23,00
PAUL Y2 15,00 90,00 34,00
PAUL Y3 16,00 32,00 56,00
PAUL Y4 20,00 45,00 65,00
RINGO Y1 25,00 60,00 87,00
RINGO Y2 24,00 30,00 23,00
RINGO Y3 31,00 20,00 54,00
RINGO Y4 75,00 10,00 12,00

As you can see, each name repeats 4 times (4 lines) in order to "store" the 4 years values, and for each year there are 3 values (A, B and C).

I need to rearrange the data, so that each name will be shown in just ONE LINE. So the 4 years that originaly are shown in lines must be shown in new columns, like this:

NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

Also, an acceptable output format could be:

NAME Y1/A Y1/B Y1/C Y2/A Y2/B Y2/C Y3/A Y3/B Y3/C Y4/A Y4/B Y4/C

I'm not shure which one would be "easier" to be implemented but, both output formats are OK.

As far as I can see, its not a "direct transpose", and I have not found any similar question, and thats why I made the question again with more details.

It is not advisable to request members to give you an answer. This site is to help people to refine/correct your attempts. — Fazlin, Sep 21 '16 at 13:25
There are NO blank lines between each input / outtput, sorry, my fault when pasting the data. — silvio, Sep 21 '16 at 13:49

score 3 · Answer 1 · answered Sep 21 '16 at 14:00

Using GNU awk for true multi-dimensional arrays:

$ cat tst.awk
NR==1 { split($0,hdr); next }
{
    idx = (NR-2)%4+1
    val[idx][0]
    split($0,val[idx])
}
NR==5 {
    printf "%s", hdr[1]
    for (j=3; j in hdr; j++) {
        for (i=1; i<=idx; i++) {
            printf "%s%s", OFS, hdr[j]"/"val[i][2]
        }
    }
    print ""
}
idx==4 {
    printf "%s", $1
    for (j=3; j<=NF; j++) {
        for (i=1; i<=idx; i++) {
            printf "%s%s", OFS, val[i][j]
        }
    }
    print ""
}

$ awk -f tst.awk file
NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

James Brown · Answer 2 · 2016-09-21T20:15:56.703

2

$ cat foo.awk
NR==1{next}                                              # skip the header
{
    printf "%s", (b!=$1?(b==""?"":ORS) $1:"") OFS; b=$1  # print name or OFS
} 
{
    printf "%s", $3 OFS $4 OFS $5                        # print fields
} 
END {print ""}                                           # finish with ORS

Spin it:

$ awk -f foo.awk foo.txt
JOHN 10,00 19,00 65,00 11,00 23,00 64,00 12,00 33,00 34,00 13,00 34,00 32,00
PAUL 14,00 43,00 23,00 15,00 90,00 34,00 16,00 32,00 56,00 20,00 45,00 65,00
RINGO 25,00 60,00 87,00 24,00 30,00 23,00 31,00 20,00 54,00 75,00 10,00 12,00

edited Sep 21 '16 at 20:15

answered Sep 21 '16 at 14:42

James Brown

36,089
7
43
59

you've missed a detail too.. all values from A first, followed by B and C.. can't simply concatenate values... – Sundeep Sep 22 '16 at 02:09
1

@Sundeep _Also, an acceptable output format could be: `NAME Y1/A Y1/B Y1/C Y2/A...`_ – James Brown Sep 22 '16 at 03:47
oh! that makes for easier solution indeed! and I wrongly seem to have got another answer deleted! :-/ :-/ – Sundeep Sep 22 '16 at 03:51

Jean-François Fabre · Answer 3 · 2016-09-21T13:41:13.013

My proposal in python (sed is clearly not up to the task, maybe awk, but that's a challenge). I have hardcoded the 4*3 aspect of the "matrix". Maybe something more elegant could have been done:

import collections

nb_year = 4
d = collections.defaultdict(lambda: [None]*nb_year*3)    


with open("input_file") as infile:

    next(infile)  # skip title

    for l in infile:  # read line by line
        fields = l.strip().split()  # extract blank-separated fields
        if len(fields)<3: continue  # protection against "accidental" blank lines
        target = d[fields[0]]       # name
        offset = int(fields[1][1])-1    # extract year index 1 to 4
        for i,f in enumerate(fields[2:]):  # interleaved matrix fill
            target[offset+i*nb_year] = f      # fill "matrix"

    print("NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4")
    for k,v in sorted(d.items()):
        print("{} {}".format(k," ".join(v)))

output

NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

Can't imagine why you'd think that'd be a challenge for awk but anyway - more important than hard-coding the matrix, it's worth mentioning that you're printing a hard-coded header line instead of reading it from the input file! — Ed Morton, Sep 21 '16 at 14:02
it's a challenge for me+awk :) you're right, that could have been more generic. — Jean-François Fabre, Sep 21 '16 at 14:07

score 1 · Answer 4 · answered Sep 21 '16 at 14:18

awk solution:

$ cat script.awk
#!/bin/awk

{
    if( length($1) > 0 )
    {
        if( prev != $1 )
        {
            str = ""
            n = 0
        }

        str = str FS $0

        n = n + 1

        if( n == 4 )
        {
            split( str, a, FS )
            print a[1],a[3],a[8],a[13],a[18],a[4],a[9],a[14],a[19],a[5],a[10],a[15],a[20]
        }

        prev = $1
    }

}

# eof #

Testing:

$ awk -f script.awk -- input.txt 
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

Hope It Helps!

agc · Answer 5 · 2017-03-06T12:43:27.497

Almost complete solution (the header line was done manually) with datamash's collapse function:

echo \
  "NAME    A/Y1  A/Y2  A/Y3  A/Y4  B/Y1  B/Y2  B/Y3  B/Y4  C/Y1  C/Y2  C/Y3  C/Y4"
tr ',' '.' < input.txt | \
datamash --header-in -W -g1 collapse A collapse B collapse C | \
tr '[.,]' '[, ]'

Output:

    NAME    A/Y1  A/Y2  A/Y3  A/Y4  B/Y1  B/Y2  B/Y3  B/Y4  C/Y1  C/Y2  C/Y3  C/Y4
    JOHN    10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
    PAUL    14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
    RINGO   25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

Notes: the tr stuff is because collapse uses commas as output separators, so to avoid too many commas, those are shuffled around a bit, then moved back.

If need be the header could be generated from input.txt with code, (but it's longer and uglier than a simple hard-coded echo):

echo -n "NAME    " ; set -- `head -n 1 input.txt` ; shift 2 ; \
tail -n +2 input.txt | cut -d' ' -f2 | head -n 4 | \
while read x ; do for f in $@ ; do echo -n "$f/$x  " ; done ; done ; echo

score 1 · Answer 6 · answered Sep 22 '16 at 04:33

With perl, generic in the sense there can be different number of years and different number of columns

$ cat ip.txt 
NAME YEAR A B C
JOHN Y1 10,00 19,00 65,00
JOHN Y2 11,00 23,00 64,00
JOHN Y3 12,00 33,00 34,00
JOHN Y4 13,00 34,00 32,00
PAUL Y1 14,00 43,00 23,00
PAUL Y2 15,00 90,00 34,00
PAUL Y3 16,00 32,00 56,00
PAUL Y4 20,00 45,00 65,00
RINGO Y1 25,00 60,00 87,00
RINGO Y2 24,00 30,00 23,00
RINGO Y3 31,00 20,00 54,00
RINGO Y4 75,00 10,00 12,00

Assumes that sorting names while printing output is good enough..

$ perl -ae '
@h = @F[0,2..$#F] if $. == 1;
if($. > 1)
{
    $d{$F[0]} .= " ".join(" ",@F[2..$#F]);
    $hh[$i++] = $F[1] if !$seen{$F[1]}++;
}
END
{
     print "$h[0] ";
     foreach (@hh){ for($j=1; $j <= $#h; $j++) {print "$_/$h[$j] "} }
     print "\n";
     print "$_$d{$_}\n" foreach (sort keys %d);
}
' ip.txt
NAME Y1/A Y1/B Y1/C Y2/A Y2/B Y2/C Y3/A Y3/B Y3/C Y4/A Y4/B Y4/C 
JOHN 10,00 19,00 65,00 11,00 23,00 64,00 12,00 33,00 34,00 13,00 34,00 32,00
PAUL 14,00 43,00 23,00 15,00 90,00 34,00 16,00 32,00 56,00 20,00 45,00 65,00
RINGO 25,00 60,00 87,00 24,00 30,00 23,00 31,00 20,00 54,00 75,00 10,00 12,00

Input lines are split on space and saved to @F array
From header line, get columns except 2nd one
For other lines
- use 1st column as key and append columns 3rd one onwards
- save the 2nd column uniquely to construct output header later on
At the end
- construct header in the format saved
- print keys (sorted alphabetically) and corresponding values from hash variable

For modified example:

$ cat ip1.txt 
NAME YEAR A B
JOHN Y1 10,00 19,00
JOHN Y2 11,00 23,00
PAUL Y1 14,00 43,00
PAUL Y2 15,00 90,00

$ perl -ae '
@h = @F[0,2..$#F] if $. == 1;
if($. > 1)
{
    $d{$F[0]} .= " ".join(" ",@F[2..$#F]);
    $hh[$i++] = $F[1] if !$seen{$F[1]}++;
}
END
{
     print "$h[0] ";
     foreach (@hh){ for($j=1; $j <= $#h; $j++) {print "$_/$h[$j] "} }
     print "\n";
     print "$_$d{$_}\n" foreach (sort keys %d);
}
' ip1.txt
NAME Y1/A Y1/B Y2/A Y2/B 
JOHN 10,00 19,00 11,00 23,00
PAUL 14,00 43,00 15,00 90,00

rearrange data on a file (NOT a direct transpose)

6 Answers6