Python can recognize header position and extract header info

Question

I have a csv file in which there are 3 or more headers in one csv file. I want python or pandas to be able to recognize the header position and extract the header info in the csv file. Here I give an example of a csv file that I have.

"Level and Distortion",,,,,,,,,,,,,,,
"Ch1 (F)",,"Ch1 (H2)",,"Ch1 (H3)",,"Ch1 (Total)",,"Ch2 (F)",,"Ch2 (H2)",,"Ch2 (H3)",,"Ch2 (Total)",
X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y
Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms
20,0.00772013164376534,20,5.60982648239952E-05,20,0.000389709733151927,20,0.011492581958802,20,0.00699792689186063,20,0.000151471712877565,20,0.000389940899485093,20,0.010080448380793
..
..
"THD Ratio",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
Hz,%,Hz,%,,,,,,,,,,,,
20,83.009797319554,20,82.1460991930652,,,,,,,,,,,,
21.1179638886716,85.3656629417084,21.1179638886716,82.0338466400102,,,,,,,,,,,,
..
..
"Reference Waveform",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
s,V,s,V,,,,,,,,,,,,
0,0,0,0,,,,,,,,,,,,
2.08333333333333E-05,6.47890208369956E-08,2.08333333333333E-05,6.47890208369956E-08,,,,,,,,,,,,

I want python to be able to recognize header position automatically.

Actually there are very many numeric values in the csv file data. — Muhammad Fauzan, May 19 '23 at 07:02
I believe that you need to iterate on the CSV file line by line to achieve that, read the whole CSV, using pd.read_csv, then with a for loop, you could extract each header with its data aside. — Ali Massoud, May 19 '23 at 07:08
I don't think the `csv` module or pandas will do this automatically. They'll only use the first line of the file automatically as the header. If you want to process each section as a separate file, you'll need to code this yourself. It's hard to see how we can write an "example" that isn't the whole thing. — Barmar, May 19 '23 at 07:28
I can see it in the question. If you aren't able to program this yourself, you need to find someone to do it for you. We'll help you fix the code for you, we won't write it for you. — Barmar, May 19 '23 at 07:37

score 0 · Answer 1 · answered May 19 '23 at 07:40

Here is one way to do it with read_csv and a very basic regex approach :

import re
from io import StringIO

with open("file.csv") as f:
    d = {m.group(1).split(",")[0].strip('"'): 
           (pd.read_csv(StringIO(m.group(2)), header=[0,1,2]).dropna(how="all", axis=1))
         for m in re.finditer("(.*?)\n(.*?)(\n..\n..\n|$)", f.read(), flags=re.DOTALL)}

Output :

d is a dict where the keys are the sub-csv/tables names :

print(d.keys())

#dict_keys(['Level and Distortion', 'THD Ratio', 'Reference Waveform'])

.. and the values are a DataFrame. To access one of the dataframes, use dictionnary-key indexing :

print(d["THD Ratio"])

    Ch1 Unnamed: 1_level_0   Ch2 Unnamed: 3_level_0
      X                  Y     X                  Y
     Hz                  %    Hz                  %
0 20.00              83.01 20.00              82.15
1 21.12              85.37 21.12              82.03

**NB**: The resulting DataFrame has a [*MultiIndex*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html) that you can readapt with [`droplevel`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.droplevel.html) *and/or* [`set_axis`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_axis.html). — Timeless, May 19 '23 at 07:47

Python can recognize header position and extract header info

1 Answers1