How to parse CSV with pgloader or PostgreSQL copy when there's a \n (line break) inside double quotes?

Question

I have a csv line like this:

"aaa"|"bbb"|"ccc"|"dddd
eeeee"

I want to import this data using pgloader (http://pgloader.io/) or PostgreSQL COPY (https://www.postgresql.org/docs/9.6/static/sql-copy.html). My issue here is the fact tha according to the CSV standards it's possible to have a line break (\n) inside a quoted field value. But pgloader and also the COPY command treat it like a brand new line of data, instead of one column with line break inside.

COPY schema.table (
    col_aaa,
    col_bbb,
    col_ccc,
    col_ddd
) 
    FROM 'file.csv' WITH DELIMITER '|' ENCODING 'LATIN1' CSV;

My COPY command My pgloader command

LOAD CSV
FROM 'file.csv' 
INTO postgresql://user:password@host:5432/database?schema.table (col_aaa, col_bbb, col_ccc, col_ddd)

WITH   
skip header = 0,
fields optionally enclosed by '"',
fields escaped by double-quote,
fields terminated by '|'  

SET client_encoding to 'latin1'
BEFORE LOAD DO
    $$ TRUNCATE anac.aerodromos_csv RESTART IDENTITY; $$;

I've search a lot from the PostgreSQL docs and also from google.

The only thing that I've found was this: Parsing CSV file with \n in double quoted fields but awk is too slow for a file with more than 1 million lines.

Any hint about how to do it?

My preference is for pgloader, but I can accept the use of sed or perl as regex agent to treat file at a linux shell script.

Any clues about how to do it?

You have to check the column numbers in every line If it is less than the actual number of columns, You should add next line to the current line and check again. — AbhiNickz, Jun 07 '17 at 18:28
Sorry @AbhiNickz I can't see how that idea solves the problem. The matter is that I have an extra line break inside the field data, and adding a new one doesn't seems to help. — Jonathas Pacífico, Jun 08 '17 at 11:13
Please check the answer, I am not adding a new line, I am basically concatenating the lines which are broken into two lines due to **\n**. — AbhiNickz, Jun 08 '17 at 17:54

score 0 · Answer 1 · answered Jun 08 '17 at 07:27

To give you an idea to solve this problem, I am writing this example.

I have just assumed that file will contain only 4 column and will contain only 1 line break. If It isn't the case then you need to change this.

Input File:

"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"dddd
eeeee"
"aaa"|"bbb"|"cc
c"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"b
bb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"a
aa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"dddd
eeeee"

script.pl

#!/usr/bin/perl
use strict; use warnings; use Data::Dumper;

open ( my $RFH, '<', 'input.io' ) or die ($!);
open ( my $WFH, '>', 'output.o' ) or die ($!);

my $line_break = 0;
my $old_line = '';

while ( my $line = <$RFH> ) {
    chomp($line);
    if ( ! $line_break ){
        my @columns = split( /\|/, $line );
        if ( scalar( @columns ) == 4 && $columns[3] =~ m/"$/ ){
            print $WFH $line."\n";
        }
        else{
            $line_break = 1;
            $old_line = $line;
            next;
        }
    }
    else{
        $line = $old_line . $line;
        $old_line = '';
        $line_break = 0;
        print $WFH $line."\n";
    }
}

close($RFH);
close($WFH);

OUTPUT File:

"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"
"aaa"|"bbb"|"ccc"|"ddddeeeee"

Change this example to your need. Hope this helps.

score 0 · Accepted Answer · answered Jun 21 '17 at 19:49

The problem was that at the end of line there was an \r\n. And to solve that I removed the \r only from the line break inside the data.

That way pgloader was able to do the job.

I've done that by using an one line perl.

perl -0777 -pi -e 's/(?<="[^"|])*(?<!["|])\r\n(?=[^"]*")/\n/smg' $csv_file_name # O -0777 is explained at em https://stackoverflow.com/questions/9670426/perl-command-line-multi-line-replace

Perl command line multi-line replace

How to parse CSV with pgloader or PostgreSQL copy when there's a \n (line break) inside double quotes?

2 Answers2