5

I have been working on a script to concatenate multiple csv files into a single, large csv. The csv's contain names of folders and their respective sizes, in a 2-column setup with the format "Size, Projectname"

Example of a single csv file:

49747851728,ODIN
32872934580,_WORK
9721820722,LIBRARY
4855839655,BASELIGHT
1035732096,ARCHIVE
907756578,USERS
123685100,ENV
3682821,SHOTGUN
1879186,SALT
361558,SOFTWARE
486,VFX
128,DNA

For my current test I have 25 similar files, with different numbers in the first column.

I am trying to get this script to do the following:

  • Read each csv file
  • For each Project it sees, scan the outputfile if that Project was already printed to the file. If not, print the Projectname
  • For each file, for each Project, if the Project was found, print the Size to the output csv.

However, I need the Projects to all be on textline 1, comma separated, so I can use this outputfile as input for a javascript graph. The Sizes should be added in the column below their projectname.

My current script:

csv_folder=$(echo "$1" | sed 's/^[ \t]*//;s/\/[ \t]*$//')
csv_allfiles="$csv_folder/*.csv"
csv_outputfile=$csv_folder.csv
echo -n "" > $csv_outputfile

for csv_inputfile in $csv_allfiles; do
  while read line && [[ $line != "" ]]; do
    projectname=$(echo $line | sed 's/^\([^,]*\),//')
    projectfound1=$(cat $csv_outputfile | grep -w $projectname)
if [[ ! $projectfound1 ]]; then
  textline=1
  sed "${textline}s/$/${projectname}, /" >> $csv_outputfile
    for csv_foundfile in $csv_allfiles; do
    textline=$(echo $textline + 1 | bc )
    projectfound2=$(cat $csv_foundfile | grep -w $projectname)
    projectdata=$(echo $projectfound2 | sed 's/\,.*$//')
        if [[ $projectfound2 ]]; then
          sed "${textline}s/$/$projectdata, /" >> $csv_outputfile
        fi
      done
    fi
  done < $csv_inputfile
done

My current script finds the right information (projectname, projectdata) and if I just 'echo' those variables, it prints the correct data to a file. However, with echo it only prints in a long list per project. I want it to 'jump back' to line 1 and print the new project at the end of the current line, then run the loop to print data at the end of each next line.

I was thinking this should be possible with sed or awk. sed should have a way of inserting text to a specific line with

sed '{n}s/search/replace/'

where {n} is the line to insert to

awk should be able to do the same thing with something like

awk -v l2="$textline" -v d="$projectdata" 'NR == l2 {print d} {print}' >> $csv_outputfile

However, while replacing the sed commands in the script with

echo $projectname 
echo $projectdata 

spit out the correct information (so I know my variables are filled correctly) the sed and awk commands tend to spit out the entire contents of their current inputcsv; not just the line that I want them to.

Pastebin outputs per variant of writing to file

As you see, the sed output tends to paste the whole contents of inputcsv, making the loop stop after one iteration. (since it finds the other Projects after one loop)

So my question is one of these;

  • How do I make sed / awk behave the way I want it to; i.e. print only the info in my var to the current textline, instead of the whole input csv. Is sed capable of this, printing just one line of variable? Or
  • Should I output the variables through 'echo' into a temp file, then loop over the temp file to make sed sort the lines the way I want them to? (Bear in mind that more .csv files will be added in the future, I can't just make it loop x times to sort the info)
  • Is there a way to echo/print text to a specific text line without using sed or awk? Is there a printf option I'm missing? Other thoughts?

Any help would be very much appreciated.

JasonD
  • 51
  • 3
  • 2
    this is an X-Y problem. You're trying to solve the problems that are caused by the choice of your solution. Still not clear to me what you're trying to do other than concatenate the files? Is a project in multiple files? If so, do you add up the sizes? Or, pick the first one? What is the expected output given the input file(s)? – karakfa Mar 20 '18 at 17:39
  • 1
    Whatever it is you're trying to do will be very trivial to solve with a small awk script but we can't tell what it is you're trying to do as you haven't provided clear, concise, testable sample input and expected output. Just do that and you'll get answers. – Ed Morton Mar 20 '18 at 23:41
  • I had no idea I was being unclear in my request, really. A sample input is given near the top of my post. I have a lot of csv files with a list containing foldersizes in the format of 'size, foldername'. I want these folders displayed in columns, with their sizes below them, one line for each .csv file that I have in my folder. Desired output: https://pastebin.com/5wiuq53n I will add that pastebin to my original post, too. – JasonD Mar 21 '18 at 10:07

3 Answers3

1

A way to accomplish this transposition is to save the data to an associative array.

In the following example, we use a two dimensional array to keep track of our data. Because ordering seems to be important, we create a col array and create a new increment whenever we see a new projectname -- this col array ends up being our first index into our data. We also create a row array which we increment whenever we see a new data for the current column. The row number is our second index into data. At the end, we print out all the records.

#! /usr/bin/awk -f
BEGIN {
    FS  = ","
    OFS = ", "
    rows=0
    cols=0
    head=""
    split("", data)
    split("", row)
    split("", col)
}
!($2 in col) { # new project
    if (head == "")
        head = $2
    else
        head = head OFS $2
    i = col[$2] = cols++
    row[i] = 0
}
{
    i = col[$2]
    j = row[i]++
    data[i,j] = $1
    if (j > rows)
        rows = j
}
END {
    print head
    for (j=0; j<=rows; ++j) {
        if ((0,j) in data)
            x = data[0,j]
        else
            x = ""
        for (i=1; i<cols; ++i) {
            if ((i,j) in data)
                x = x OFS data[i,j]
            else
                x = x OFS
        }
        print x
    }
}

As a bonus, here is a script to reproduce the detailed output from one of your pastebins.

#! /usr/bin/awk -f
BEGIN {
    FS = ","
    split("", data) # accumulated data for a project
    split("", line) # keep track of textline for data
    split("", idx)  # index into above to maintain input order
    sz = 0
}
$2 in idx { # have seen this projectname
    i = idx[$2]
    x =   ORS "textline = " ++line[i]
    x = x ORS "textdata = " $1
    data[i] = data[i] x
    next
}
{ # new projectname
    i = sz++
    idx[$2] = i
    x =       "textline = 1"
    x = x ORS "projectname = " $2
    x = x ORS "textline = 2"
    x = x ORS "projectdata = " $1
    data[i] = x
    line[i] = 2
}
END {
    for (i=0; i<sz; ++i)
        print data[i]
} 
Michael Back
  • 1,821
  • 1
  • 16
  • 17
0

Fill parray with project names and array with values, then print them with bash printf, You can choose column width in printf command (currently 13 characters - %13s)

#!/bin/bash
declare -i index=0
declare -i pindex=0
while read project; do
  parray[$pindex]=$project
  index=0
  while read;do
    array[$pindex,$index]="$REPLY"
    index+=1
  done <<< $(grep -h "$project" *.csv|cut -d, -f1)
  pindex+=1
done <<< $(cat *.csv|cut -d, -f 2|sort -u)
maxi=$index
maxp=$pindex

for (( pindex=0; $pindex < $maxp ; pindex+=1 ));do
 STR="%13s $STR"
 VAL="$VAL ${parray[$pindex]}"
done
printf "$STR\n" $VAL
for (( index=0; $index < $maxi;index+=1 ));do
  STR=""; VAL=""
  for (( pindex=0; $pindex < $maxp;pindex+=1 )); do
    STR="%13s $STR"
    VAL="$VAL ${array[$pindex,$index]}"
  done
  printf "$STR\n" $VAL
done
Saboteur
  • 1,331
  • 5
  • 12
0

If you are OK with the output being sorted by name this one-liner might be of use:

awk 'BEGIN {FS=",";OFS=","} {print $2,$1}' * | sort | uniq

The files have to be in the same directory. If not a list of files replaces the *. First it exchanges the two fields. Awk will take a list of files and do the concatenation. Then sort the lines and print just the unique lines. This depends on the project size always being the same.

The simple one-liner above gives you one line for each project. If you really want to do it all in awk and use awk write the two lines, then the following would be needed. There is a second awk at the end that accumulates each column entry in an array then spits it out at the end:

awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | awk 'BEGIN {n=0}
{p[n]=$1;s[n++]=$2}
END {for (i=0;i<n;i++) printf "%s,",p[i];print "";
for (i=0;i<n;i++) printf "%s,",s[i];print ""}'

If you have the rs utility then this can be simplified to

awk 'BEGIN {FS=","} {print $2,$1}' *| sort |uniq | rs -C',' -T
kd4ttc
  • 1,075
  • 1
  • 10
  • 28