1

I'm facing issues when trying to split larger files into bunch of smaller ones where one column has new lines in them. In the CSV file that I'm trying to split, it has delimiters that are pipes (|) and each row is separated by newline (\n). Since 1 column has a bunch of newlines in it, it can cause that CSV file to look something like this:

col1 | col2 | col3| insert something in here

that is meaning

new documents

or formats

random text

text | col5 | col6 | col7

When splitting this, it can cause my document (if using either split by lines, or bytes) to split just in the middle of the col4. If that happens, the file is messed up and I am unable to process it later on to insert that data into my table.

I tried both using split and csplit but I am unsure I can achieve a good split based on the lines + delimiter. If I try to use csplit regex where it matches (| and newline), it would just pick up this: text | col5 | col6 | col7 -> so it wouldn't work either unfortunately.

Running out of solutions in here, maybe it is not possible with split and csplit at all but I'm open to suggestions. Thank you!

  • Are you sure there are no quotes around that field? – Danny_ds Apr 16 '20 at 07:47
  • You mean col4 with the new lines? I just double checked, no quotes, only new lines between words which causes the file to generate something like I described above. And unfortunately I have no control of how I receive those files so I have to work with what I get in here. – Nurdin Ibrisimovic Apr 16 '20 at 08:20
  • Why not replace the new line in field value with some other char ? – Digvijay S Apr 16 '20 at 08:25
  • @Nurdin Ibrisimovic - You should add the used regex where you wrote _If I try to use csplit regex where it matches (| and newline), it would just pick up this: text | col5 | col6 | col7_. – Armali Apr 16 '20 at 08:35
  • @DigvijayS because I'm dealing with huge files, it would drastically drop my performance if I did that rather than not splitting that file as a whole. Most of the files I can split but it depends on the content inside that are bothering me. – Nurdin Ibrisimovic Apr 17 '20 at 10:55

0 Answers0