4

Given common bash-tools, it is easy to split a big file (in my case a MySQL dump and thus a TSV-file) into smaller parts using the split command. Furthermore, this command supports splitting a file after n new lines (i.e. -l argument). But this command does not distinguished between escaped and unescaped newline characters and thus might break a single table row into two incomplete parts.

Example (TSV with 2 columns)

cool    2014-12-15 17:31:00
do not censor it ...^M\\n      2016-01-24 22:33:00
watch out ari, you've got compeition! hahah     2001-12-05 19:11:01
Oh God, the poor guy!  xD\\nCan't wait to watch this!      2011-07-11 22:01:20
wish i could do that.\\n       2001-02-07 00:24:11
Funny! I will use this reason when I drink something in other houses    2015-06-10 12:20:00

As you can see, there are two columns (first contains the comment and the second the date), which are separated by an tab. I visualised just the escaped newlines, tabs and unescaped newlines are not printed. If you put these lines into a file and split it (e.g., split example.tsv -l 1) you will get 9 files, but there are only 6 comments (3 contain escaped newlines)! This is because escaped newlines are treated as regular newlines prefixed with a backslash. This is a huge problem for me, because splitting the file might lead to incomplete table rows in the output-files.

Is it somehow possible to ignore escaped newlines or does someone know another command which can do this?

NaN
  • 3,501
  • 8
  • 44
  • 77
  • 1
    Can you show us with some sample text – Inian Dec 08 '17 at 18:00
  • @Inian voila, I added an example. I hope this helps understanding my problem. – NaN Dec 08 '17 at 18:56
  • It would help if your sample text actually showed content with the linebreaks in question -- in your current sample, *no* line at all ends in a backslash literal. – Charles Duffy Dec 08 '17 at 19:20
  • I thought you mean "\" at the end of the line, this is "escaping new line". Your current example should have no problem wth `split` – thanasisp Dec 08 '17 at 19:25

1 Answers1

2

This will break the file every 20 lines (or whatever you set n to) but not on lines that end with a backslash:

awk -v n=20 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file

How it works

  1. -v n=20

    This creates an awk variable n which we will use to decide when to split the file.

  2. NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"}

    Every time that we need to start a new file, we (a) set the line counter, c, to zero, (b) close the previous file, and (c) define a name for the next file.

    We need to start a new file when (i) we are on the first input line, NR==1, or else when (ii) the line counter c exceeds the limit n and the last line did not end with \.

  3. c++; print>f; last=$0

    This increments the line counter, c, prints the current line to file f, and updates last to the value of the current line.

Example

Let's try this test file:

$ cat file
text1   2014-12-15 17:31:01
text2\  
        2014-12-15 17:31:02
text3   2014-12-15 17:31:03
text4a\
text4b\ 
        2014-12-15 17:31:04
text5   2014-12-15 17:31:05

Now, let's run our command. To keep the example short, we set n=2:

$ awk -v n=2 'NR==1 || (c>n && !(last~/\\$/)){c=0; close(f); f="file" ++count ".out"} {c++; print>f; last=$0}' file

After the command is run, new files appear in the directory:

$ ls
file  file1.out  file2.out  file3.out

The new files contain the old contents split every 2 lines except not split on lines ending in \:

$ cat file1.out
text1   2014-12-15 17:31:01
text2\
        2014-12-15 17:31:02
$ cat file2.out
text3   2014-12-15 17:31:03
text4a\
text4b\
        2014-12-15 17:31:04
$ cat file3.out
text5   2014-12-15 17:31:05
John1024
  • 109,961
  • 14
  • 137
  • 171