How to merge rows from the same column using unix tools

Question

I have a text file that looks like the following:

1000000    45    M    This is a line        This is another line  Another line
                      that breaks into      that also breaks      that has a blank
                      multiple rows         into multiple rows -  row below.
                                            How annoying!

1000001    50    F    I am another          I am well behaved.    
                      column that has
                      text spanning
                      multiple rows

I would like to convert this into a csv file that looks like:

1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.

The text file output comes from a program that was written in 1984, and I have no way to modify the output. I want it in csv format so that I can convert it to Excel as painlessly as possible. I am not sure where to start, and rather than reinvent the wheel, was hoping someone could point me in the right direction. Thanks!

== EDIT ==

I've modified the text file to have \n between rows - maybe this will be helpful?

== EDIT 2 ==

I've modified the text file to have a blank row.

Are the widths of all the columns known in advance? Also, is python scripting legitimate? — Barak Itkin, Sep 09 '14 at 18:18
The only way I can think of to do this is to read the file line by line and process it as you go. — Bacon Bits, Sep 09 '14 at 18:19
Can every column do this or just some columns? Are the widths known? Or do they at least always follow the first column? Are any columns ever missing? — Etan Reisner, Sep 09 '14 at 18:21
I am not sure if the column positions are fixed, but a rule that seems to hold is that each column is separated from the next by at least 2 spaces, therefore the first column starts at the first character and at most extends 2 spaces before the next column — oort, Sep 09 '14 at 18:22
One thing is that we cannot assume that the column widths are the same for each text file created by the program — oort, Sep 09 '14 at 20:15

glenn jackman · Answer 1 · 2014-09-09T19:36:48.617

Using GNU awk

gawk '
    BEGIN { FIELDWIDTHS="11 6 5 22 22" }
    length($1) == 11 {
        if ($1 ~ /[^[:blank:]]/) { 
            if (f1) print_line()
            f1=$1; f2=$2; f3=$3; f4=$4; f5=$5
        }
        else { 
            f4 = f4" "$4; f5 = f5" "$5
        }
    }
    function rtrim(str) {
        sub(/[[:blank:]]+$/, "", str)
        return str
    }
    function print_line() {
        gsub(/[[:blank:]]{2,}/, " ", f4); gsub(/"/, "&&", f4)
        gsub(/[[:blank:]]{2,}/, " ", f5); gsub(/"/, "&&", f5)
        printf "%s,%s,%s,\"%s\",\"%s\"\n", rtrim(f1), rtrim(f2), rtrim(f3),f4,f5
    }
    END {if (f1) print_line()}
' file

1000000,45,M,"This is a line that breaks into multiple rows ","This is another line that also breaks into multiple rows - How annoying!"
1000001,50,F,"I am another column that has text spanning multiple rows","I am well behaved. "

I've quoted the last 2 columns in case they contain commas, and doubled any potential inner double quotes.

Tom Fenech · Answer 2 · 2014-09-13T11:30:35.997

Here's a Perl script that does what you want. It uses unpack to split the fixed width columns into fields, adding to the previous fields if there is no data in the first column.

As you've mentioned that the widths vary between files, the script works out the widths for itself, based on the content of the first line. The assumption is that there are at least two space characters between each field. It creates a format string like A11 A6 A5 A22 A21, where "A" means any character and the numbers specify the width of each field.

Inspired by glenn's version, I have wrapped any field containing spaces in double quotes. Whether that's useful or not depends on how you're going to end up using the data. For example, if you want to parse it using another tool and there are commas within the input, it may be helpful. If you don't want it to happen, you can change the grep block in both places to simply grep { $_ ne "" }:

use strict;
use warnings;

chomp (my $first_line = <>);
my @fields = split /(?<=\s{2})(?=\S)/, $first_line;
my $format = join " ", map { "A" . length } @fields;
my @cols = unpack $format, $first_line;

while(<>) {    
    chomp( my $line = $_ );
    my @tmp = unpack $format, $line;
    if ($tmp[0] ne '') {
        print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } @cols), "\n";
        @cols = @tmp;
    }
    else {
        for (1..$#tmp) {
            $cols[$_] .= " $tmp[$_]" if $tmp[$_] ne "";
        }
    }    
}

print join(", ", grep { $_ ne "" && /\s/ ? qq/"$_"/ : $_ } @cols), "\n";

Output:

1000000, 45, M, "This is a line that breaks into multiple rows", "This is another line that also breaks into multiple rows - How annoying!"
1000001, 50, F, "I am another column that has text spanning multiple rows", "I am well behaved."

This one almost works, except that in the cases where one column is blank, then it doesn't have the correct output, is there anyway to overcome this? Also some of the strings are quoted, others are not. — oort, Sep 11 '14 at 16:16
If you edit your question to include a sample of some data that this approach doesn't work for, I'll take a look at it. — Tom Fenech, Sep 11 '14 at 16:18
I've updated the script so that it does what you want. I've also explained the reason behind using double quotes and shown how you can get rid of them. Hope that helps. — Tom Fenech, Sep 13 '14 at 11:31

anubhava · Answer 3 · 2014-09-09T19:02:01.090

Using this awk:

awk -F ' {2,}' -v OFS=', ' 'NF==5{if (p) print a[1], a[2], a[3], a[4], a[5]; 
   for (i=1; i<=NF; i++) a[i]=$i; p=index($0,$4)}
   NF<4 {for(i=2; i<=NF; i++) index($0,$i) == p ? a[4]=a[4] " " $i : a[5]=a[5] $i}
   END { print a[1], a[2], a[3], a[4], a[5] }' file
1000000, 45, M, This is a line that breaks into multiple rows, This is another line that also breaks into multiple rows - How annoying!
1000001, 50, F, I am another column that has text spanning multiple rows, I am well behaved.

ventsyv · Answer 4 · 2014-09-09T19:21:41.913

You can write a script in python that does that. Read each line, call split on it, if the line is not empty append to the previous line. If it is, then add the next line to the result set. Finally use the csv write to write the result set to file.

Something along the lines of :

#import csv

inputFile = open(filename, 'r')
isNewItem = True
results = []
for line in inputFile:
    if len(results) == 0:
        isNewItem = True
    else if line == '':
        isNewItem = True
        continue
    else:
        inNewItem = False

    temp = line.split()
    if isNewItem:
        results.append(temp)
    else
        lastRow = results[-1]
        combinedRow = []
        for leftColumn, rigtColumn in lastRow, temp:
            combinedRow.append(leftColumn + rightColumn)

with open(csvOutputFileName, 'w') as outFile:
    csv.write(results)

How to merge rows from the same column using unix tools

4 Answers4

Linked