0

To create two csv-files:

echo -e "123\n456" > t0.txt
echo -e '"foo","bar"\n"foo\"bar\"","baz"' > t1.txt

Now, I want append the columns in t1.txt to t0.txt, so that the result becomes this:

123,"foo","bar"
456,"foo\"bar\"","baz"

First try, using csvtool

csvtool paste t0.txt t1.txt 
Fatal error: exception Csv.Failure(2, 1, "Bad '"' in quoted field")

So, csvtool doesn't seem to handle the escaped quotation mark in "foo\"bar\"".

My real world use case has two CSV-files with +150.000.000 rows and 11 columns so I need a tool which can do the task without having all the data simultaneously in RAM.

Can csvtool be used with escaped quotation marks, or is there another tool that could solve this?

The final target for the CSV-file is a database in mariadb, so an efficient import to mariadb using t0.txt and t1.txt directly would be even better, but as far as I know LOAD DATA INFILE only works on a single CSV-file.

I definitely prefer a ready-made tool, but if there is none, then some C, Perl or Python snippets would be appreciated too.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Hans Ekbrand
  • 405
  • 5
  • 13
  • 2
    In normal csv, quotes are escaped by doubling them up... `"foo ""bar"""` – Shawn Jun 24 '19 at 21:53
  • It doesn't matter how many rows there are. In C you can read, modifiy and write each row of a CSV text file to a new file, one line (row) at a time. – Weather Vane Jun 24 '19 at 21:55
  • 2
    If the files have the same number of lines, `paste -d, t0.csv t1.csv`? Or fix them and use what you were originally trying. – Shawn Jun 24 '19 at 21:56
  • @Shawn doubling the quotation marks instead of escaping them seems to work! Nice. Your second comment `paste -d,` do you mean `cvstool paste -d,`? – Hans Ekbrand Jun 24 '19 at 21:59
  • 1
    I mean [paste](http://man7.org/linux/man-pages/man1/paste.1.html), which, unlike csvtool, doesn't care about the content of the lines. – Shawn Jun 24 '19 at 22:01
  • paste seems to work out of the box on my invalid CSV-file, so if you turn your comment into an answer I would accept it. – Hans Ekbrand Jun 24 '19 at 22:06
  • 1
    Does your data really have the invalid (or, at least, non-standard) CSV format, or is that only a problem with your example? Can you upgrade the data generation process to generate valid CSV data? Getting the input data correct will be far simpler than mangling it. You also need to know what MariaDB requires as the input format when processing CSV data. It may be using a non-standard variation on CSV (the 'standard' is either that used by Microsoft in Office (Excel, etc) or [RFC 4180 Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180)). – Jonathan Leffler Jun 24 '19 at 22:23
  • @JonathanLeffler Thanks for you input. While I can regenerate the data (it was generated with `mysqldump`), as suggested by @Shawn `paste` seems to do the job. – Hans Ekbrand Jun 24 '19 at 22:30
  • 1
    That's good — it appears that MySQL and MariaDB work with a non-standard variant of CSV format. There is another tool, [CSVfix](https://bitbucket.org/neilb/csvfix/src/default/), that you might be able to use if you have Mercurial installed on your machine. OTOH, it has had somewhat minimal maintenance of late — there was a release 1.6 in 2014, but it isn't clear that anything has happened since then. It has had a chequered career — it was once hosted on Google Code (and a prior version is still available via https://code.google.com/archive/p/csvfix/). The URLs in the 'updates' are wrong. – Jonathan Leffler Jun 24 '19 at 22:43
  • Yeah, I was quite surprised to learn that `mysqldump` doesn't appear to be able to create valid CSV files when there are quotation marks in the content. `mysqldump` can be instructed to use double quotations, but then null values break, they will be exported as "N. – Hans Ekbrand Jun 24 '19 at 23:19

2 Answers2

2

Here's a quick perl script that reads your broken CSV files, merges them, and outputs properly escaped CSV all in one pass:

#!/usr/bin/env perl
use warnings;
use strict;
use autodie;
# Install through your OS package manager or CPAN client.
# libtext-csv-xs-perl on Debian/Ubuntu and family.
use Text::CSV_XS; 

open my $file0, "<", $ARGV[0];
open my $file1, "<", $ARGV[1];

my $csv = Text::CSV_XS->new({ binary => 1, escape_char => "\\",
                              auto_diag => 2, strict => 0});
my $out = Text::CSV_XS->new({ binary => 1 });

while ((my $row0 = $csv->getline($file0)) &&
       (my $row1 = $csv->getline($file1))) {
  push @$row0, @$row1;
  $out->say(\*STDOUT, $row0);
}

Example:

$ perl mergecsv.pl t0.txt t1.txt
123,foo,bar
456,"foo""bar""",baz
Shawn
  • 47,241
  • 3
  • 26
  • 60
1

CSV files generally escape double quotes by repetition ("" rather than \"), so your files could be considered invalid.

You could use a find and replace tool, such as sed on Unix, to fix the escaped quotes to this more common format.

If you're looking for some other command line tool to work with CSV files, I've authored one that's available at https://github.com/pjshumphreys/querycsv

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Paul Humphreys
  • 338
  • 2
  • 9