0

I've got one crappy CSV file with multiple headers inside it. It looks like this:

File1:
    #HEADER COL1 COL2
    data
    data
    data
    #HEADER COL1 COL2 COL3
    data
    data
    data
    data
    data
    #HEADER COL1 COL2 COL3 COL4
    data
    data
    ...

Since I can't load it with pandas due to the in-the-file-headers, I'm looking to split the data at each header to a separate file (or separate data frames in pandas). Is there a way to do this?

This CSV is generated by sensors. If a sensor is added, the header will get a new column. This could also happen in the in-the-file-headers. So it is NOT a solution to delete those headers. (Clean wrong header inside Dataframe with Python/Pandas)

It would be really nice to do it in python/pandas, but I would also be happy with a bash command/script solution.

Expected output:

File1:
        #HEADER COL1 COL2
        data
        data
        data
File2:
        #HEADER COL1 COL2 COL3
        data
        data
        data
        data
        data
File3:
        #HEADER COL1 COL2 COL3 COL4
        data
        data
        ...

Thank you!

vince
  • 47
  • 1
  • 6

2 Answers2

1

awk to the rescue!

$ awk '/^#HEADER/{close(FILENAME "_" c); c++} {print > (FILENAME "_" c)}' file

will split input file into file_n parts, where n is the section counter.

karakfa
  • 66,216
  • 7
  • 41
  • 56
  • in my case this just worked beautifully! `awk '/^#/{close(FILENAME "_" c); c++} {print > (FILENAME "_" c)}' file` – vince Nov 13 '19 at 16:03
  • last question: how can I export the result file to another folder? – vince Nov 13 '19 at 16:09
1

With convenient csplit command (to split a file into sections by pattern):

csplit -b %d -f file -z input_file '/#HEADER.*/' '{*}'

Viewing results:

$ head file[0-9]
==> file0 <==
#HEADER COL1 COL2
data
data
data

==> file1 <==
#HEADER COL1 COL2 COL3
data
data
data
data
data

==> file2 <==
#HEADER COL1 COL2 COL3 COL4
data
data
...
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105