2

I have a file (more than 2.5k lines) like this:

NAME YEAR A B C
JOHN Y1 10,00 19,00 65,00
JOHN Y2 11,00 23,00 64,00
JOHN Y3 12,00 33,00 34,00
JOHN Y4 13,00 34,00 32,00
PAUL Y1 14,00 43,00 23,00
PAUL Y2 15,00 90,00 34,00
PAUL Y3 16,00 32,00 56,00
PAUL Y4 20,00 45,00 65,00
RINGO Y1 25,00 60,00 87,00
RINGO Y2 24,00 30,00 23,00
RINGO Y3 31,00 20,00 54,00
RINGO Y4 75,00 10,00 12,00

As you can see, each name repeats 4 times (4 lines) in order to "store" the 4 years values, and for each year there are 3 values (A, B and C).

I need to rearrange the data, so that each name will be shown in just ONE LINE. So the 4 years that originaly are shown in lines must be shown in new columns, like this:

NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

Also, an acceptable output format could be:

NAME Y1/A Y1/B Y1/C Y2/A Y2/B Y2/C Y3/A Y3/B Y3/C Y4/A Y4/B Y4/C

I'm not shure which one would be "easier" to be implemented but, both output formats are OK.

As far as I can see, its not a "direct transpose", and I have not found any similar question, and thats why I made the question again with more details.

silvio
  • 85
  • 6

6 Answers6

3

Using GNU awk for true multi-dimensional arrays:

$ cat tst.awk
NR==1 { split($0,hdr); next }
{
    idx = (NR-2)%4+1
    val[idx][0]
    split($0,val[idx])
}
NR==5 {
    printf "%s", hdr[1]
    for (j=3; j in hdr; j++) {
        for (i=1; i<=idx; i++) {
            printf "%s%s", OFS, hdr[j]"/"val[i][2]
        }
    }
    print ""
}
idx==4 {
    printf "%s", $1
    for (j=3; j<=NF; j++) {
        for (i=1; i<=idx; i++) {
            printf "%s%s", OFS, val[i][j]
        }
    }
    print ""
}

$ awk -f tst.awk file
NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
2
$ cat foo.awk
NR==1{next}                                              # skip the header
{
    printf "%s", (b!=$1?(b==""?"":ORS) $1:"") OFS; b=$1  # print name or OFS
} 
{
    printf "%s", $3 OFS $4 OFS $5                        # print fields
} 
END {print ""}                                           # finish with ORS

Spin it:

$ awk -f foo.awk foo.txt
JOHN 10,00 19,00 65,00 11,00 23,00 64,00 12,00 33,00 34,00 13,00 34,00 32,00
PAUL 14,00 43,00 23,00 15,00 90,00 34,00 16,00 32,00 56,00 20,00 45,00 65,00
RINGO 25,00 60,00 87,00 24,00 30,00 23,00 31,00 20,00 54,00 75,00 10,00 12,00
James Brown
  • 36,089
  • 7
  • 43
  • 59
1

My proposal in python (sed is clearly not up to the task, maybe awk, but that's a challenge). I have hardcoded the 4*3 aspect of the "matrix". Maybe something more elegant could have been done:

import collections

nb_year = 4
d = collections.defaultdict(lambda: [None]*nb_year*3)    


with open("input_file") as infile:

    next(infile)  # skip title

    for l in infile:  # read line by line
        fields = l.strip().split()  # extract blank-separated fields
        if len(fields)<3: continue  # protection against "accidental" blank lines
        target = d[fields[0]]       # name
        offset = int(fields[1][1])-1    # extract year index 1 to 4
        for i,f in enumerate(fields[2:]):  # interleaved matrix fill
            target[offset+i*nb_year] = f      # fill "matrix"

    print("NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4")
    for k,v in sorted(d.items()):
        print("{} {}".format(k," ".join(v)))

output

NAME A/Y1 A/Y2 A/Y3 A/Y4 B/Y1 B/Y2 B/Y3 B/Y4 C/Y1 C/Y2 C/Y3 C/Y4
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • Can't imagine why you'd think that'd be a challenge for awk but anyway - more important than hard-coding the matrix, it's worth mentioning that you're printing a hard-coded header line instead of reading it from the input file! – Ed Morton Sep 21 '16 at 14:02
  • it's a challenge for me+awk :) you're right, that could have been more generic. – Jean-François Fabre Sep 21 '16 at 14:07
1

awk solution:

$ cat script.awk
#!/bin/awk

{
    if( length($1) > 0 )
    {
        if( prev != $1 )
        {
            str = ""
            n = 0
        }

        str = str FS $0

        n = n + 1

        if( n == 4 )
        {
            split( str, a, FS )
            print a[1],a[3],a[8],a[13],a[18],a[4],a[9],a[14],a[19],a[5],a[10],a[15],a[20]
        }

        prev = $1
    }

}

# eof #

Testing:

$ awk -f script.awk -- input.txt 
JOHN 10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
PAUL 14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
RINGO 25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

Hope It Helps!

Lacobus
  • 1,590
  • 12
  • 20
1

Almost complete solution (the header line was done manually) with datamash's collapse function:

echo \
  "NAME    A/Y1  A/Y2  A/Y3  A/Y4  B/Y1  B/Y2  B/Y3  B/Y4  C/Y1  C/Y2  C/Y3  C/Y4"
tr ',' '.' < input.txt | \
datamash --header-in -W -g1 collapse A collapse B collapse C | \
tr '[.,]' '[, ]'

Output:

    NAME    A/Y1  A/Y2  A/Y3  A/Y4  B/Y1  B/Y2  B/Y3  B/Y4  C/Y1  C/Y2  C/Y3  C/Y4
    JOHN    10,00 11,00 12,00 13,00 19,00 23,00 33,00 34,00 65,00 64,00 34,00 32,00
    PAUL    14,00 15,00 16,00 20,00 43,00 90,00 32,00 45,00 23,00 34,00 56,00 65,00
    RINGO   25,00 24,00 31,00 75,00 60,00 30,00 20,00 10,00 87,00 23,00 54,00 12,00

Notes: the tr stuff is because collapse uses commas as output separators, so to avoid too many commas, those are shuffled around a bit, then moved back.

If need be the header could be generated from input.txt with code, (but it's longer and uglier than a simple hard-coded echo):

echo -n "NAME    " ; set -- `head -n 1 input.txt` ; shift 2 ; \
tail -n +2 input.txt | cut -d' ' -f2 | head -n 4 | \
while read x ; do for f in $@ ; do echo -n "$f/$x  " ; done ; done ; echo
agc
  • 7,973
  • 2
  • 29
  • 50
1

With perl, generic in the sense there can be different number of years and different number of columns

$ cat ip.txt 
NAME YEAR A B C
JOHN Y1 10,00 19,00 65,00
JOHN Y2 11,00 23,00 64,00
JOHN Y3 12,00 33,00 34,00
JOHN Y4 13,00 34,00 32,00
PAUL Y1 14,00 43,00 23,00
PAUL Y2 15,00 90,00 34,00
PAUL Y3 16,00 32,00 56,00
PAUL Y4 20,00 45,00 65,00
RINGO Y1 25,00 60,00 87,00
RINGO Y2 24,00 30,00 23,00
RINGO Y3 31,00 20,00 54,00
RINGO Y4 75,00 10,00 12,00

Assumes that sorting names while printing output is good enough..

$ perl -ae '
@h = @F[0,2..$#F] if $. == 1;
if($. > 1)
{
    $d{$F[0]} .= " ".join(" ",@F[2..$#F]);
    $hh[$i++] = $F[1] if !$seen{$F[1]}++;
}
END
{
     print "$h[0] ";
     foreach (@hh){ for($j=1; $j <= $#h; $j++) {print "$_/$h[$j] "} }
     print "\n";
     print "$_$d{$_}\n" foreach (sort keys %d);
}
' ip.txt
NAME Y1/A Y1/B Y1/C Y2/A Y2/B Y2/C Y3/A Y3/B Y3/C Y4/A Y4/B Y4/C 
JOHN 10,00 19,00 65,00 11,00 23,00 64,00 12,00 33,00 34,00 13,00 34,00 32,00
PAUL 14,00 43,00 23,00 15,00 90,00 34,00 16,00 32,00 56,00 20,00 45,00 65,00
RINGO 25,00 60,00 87,00 24,00 30,00 23,00 31,00 20,00 54,00 75,00 10,00 12,00
  • Input lines are split on space and saved to @F array
  • From header line, get columns except 2nd one
  • For other lines
    • use 1st column as key and append columns 3rd one onwards
    • save the 2nd column uniquely to construct output header later on
  • At the end
    • construct header in the format saved
    • print keys (sorted alphabetically) and corresponding values from hash variable

For modified example:

$ cat ip1.txt 
NAME YEAR A B
JOHN Y1 10,00 19,00
JOHN Y2 11,00 23,00
PAUL Y1 14,00 43,00
PAUL Y2 15,00 90,00

$ perl -ae '
@h = @F[0,2..$#F] if $. == 1;
if($. > 1)
{
    $d{$F[0]} .= " ".join(" ",@F[2..$#F]);
    $hh[$i++] = $F[1] if !$seen{$F[1]}++;
}
END
{
     print "$h[0] ";
     foreach (@hh){ for($j=1; $j <= $#h; $j++) {print "$_/$h[$j] "} }
     print "\n";
     print "$_$d{$_}\n" foreach (sort keys %d);
}
' ip1.txt
NAME Y1/A Y1/B Y2/A Y2/B 
JOHN 10,00 19,00 11,00 23,00
PAUL 14,00 43,00 15,00 90,00
Sundeep
  • 23,246
  • 2
  • 28
  • 103