5

Intro:

I have been given a CSV file in which the field delimiter is the pipe characted (i.e., |). This file has a pre-defined number of fields (say N). I can discover the value of N by reading the header of the CSV file, which we can assume to be correct.

Problem:

Some of the fields contain a newline character by mistake, which makes the line appear shorter than required (i.e., it has M fields, with M < N).

What I need to create is a sh script (not bash) to fix those lines.

Attempted solution:

I tried creating the following script to try fixing the file:

if [ $# -ne 1 ]
then
    echo "Usage: $0 <filename>"
    exit
fi

# get first line
first_line=$(head -n 1 $1)

# get number of fields
num_separators=$(echo "$first_line" | tr -d -c '|' | awk '{print length}')

cat $1  | awk -v numFields=$(( num_separators + 1 )) -F '|' '
{
    totRecords = NF/numFields
    # loop over lines
    for (record=0; record < totRecords; record++) {
        output = ""
        # loop over fields
        for (i=0; i<numFields; i++) {
            j = (numFields*record)+i+1 
            # replace newline with question mark
            sub("\n", "?", $j)
            output = output (i > 0 ? "|" : "") $j 
        }
        print output
    }
}
'

However, the newline character is still present. How can I fix that problem?

Example of the CSV:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a
newline
Foo|Bar|Baz

Expected output:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz

* I don't care about the replacement, it could be a space, a question mark, whatever except a newline or a pipe (which would create a new field)
user2340612
  • 10,053
  • 4
  • 41
  • 66
  • Can you include a sample input and expected output? – anubhava Jun 27 '16 at 16:42
  • Is it safe to assume that if a field contains a line feed that the record will contain less fields then the header as will the next record(s)? – JNevill Jun 27 '16 at 16:49
  • @anubhava Yes, I will include it ASAP – user2340612 Jun 27 '16 at 16:51
  • @JNevill if I understood it right, yes: if line X contains a newline then lines X and X+1 will be shorter too. If line X contains 3 line feeds than lines X, X+1, X+2 and ax+3 will be shorter than required – user2340612 Jun 27 '16 at 16:56
  • As a clarification (emerged in comments on @JNevill's answer): all lines have the same number of fields, but some of the fields contain a newline/line feed. As a result, these lines are "split" across multiple lines. In other words, instead of having 1 line with 10 fields, you have 2 lines with, say, 7 and 3 fields. – user2340612 Jun 27 '16 at 18:07
  • If multiple lines are split after each other, it becomes really messy. Challenging problem, this. Why, by the way, do you call it a CSV file? It doesn't look comma-separated to me. – Kusalananda Jun 27 '16 at 18:13
  • [edit] your question to show the expected output given that input. – Ed Morton Jun 27 '16 at 18:18
  • Well, it's a "pipe-separated value". It's equivalent to a CSV, but with a different separator :) – user2340612 Jun 27 '16 at 18:21
  • @EdMorton I added the expected output, thanks! – user2340612 Jun 27 '16 at 18:23
  • 1
    W.r.t. the sample input, without further information I don't think there's anyway to know whether the extra newline was in the last field of the first record, or the first field of the second record. Of course as humans we can see you mean for it to be in the last field of the first record, but how would a program know? – jas Jun 27 '16 at 18:23
  • 1
    @jas Good observation, I forgot to mention it. For what I know (I am not the producer of such file), the first field **should not** be inputed by a human (which has other input fields and can insert newlines by mistake), but it a sort of ID, so we can safely assume that the first field does not contain any weird character. In this scenario the error must reside in the last field. – user2340612 Jun 27 '16 at 18:30

2 Answers2

7
$ cat tst.awk
BEGIN { FS=OFS="|" }
NR==1 { reqdNF = NF; printf "%s", $0; next }
{ printf "%s%s", (NF < reqdNF ? " " : ORS), $0 }
END { print "" }

$ awk -f tst.awk file.csv
FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a newline
Foo|Bar|Baz

If that's not what you want then edit your question to provide more truly representative sample input and associated output.

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    This is much better +1 – anubhava Jun 27 '16 at 18:32
  • 1
    Excellent answer, I do not even need the "initialization" script. I didn't know `awk`, but I must admit that it is really powerful. – user2340612 Jun 27 '16 at 18:37
  • 2
    Yes it is. If you might have to do any text manipulation and can use a CLI I highly recommend the book Effective Awk Programming, 4th Edition, by Arnold Robbins. – Ed Morton Jun 27 '16 at 20:12
  • 1
    This is beautiful. I use awk for simple field cutting, but this is so elegant. Just bought the text @EdMorton recommended. – Peter Dec 04 '16 at 03:31
1

Based on the assumption that the last field may contain one newline. Using tac and sed:

tac file.csv | sed -n '/|/!{h;n;x;H;x;s/\n/ * /p;b};p' | tac 

Output:

FIRST_NAME|LAST_NAME|NOTES
John|Smith|This is a field with a * newline
Foo|Bar|Baz

How it works. Read the file backwards, sed is easier without forward references. If a line has no '|' separator, /|/!, run the block of code in curly braces {};, otherwise just p print the line. The block of code:

  1. h; stores the delimiter-less line in sed's hold buffer.
  2. n; fetches another line, since we're reading backwards, this is the line that should be appended to.
  3. x; exchange hold buffer and pattern buffer.
  4. H; append pattern buffer to hold buffer.
  5. x; exchange newly appended lines to pattern buffer, now there's two lines in one buffer.
  6. s/\n/ * /p; replace the middle linefeed with a " * ", now there's only one longer line; and print.
  7. b start again, leave the code block.

Re-reverse the file with tac; done.

agc
  • 7,973
  • 2
  • 29
  • 50