1

I have a csv file in which there are 3 or more headers in one csv file. I want python or pandas to be able to recognize the header position and extract the header info in the csv file. Here I give an example of a csv file that I have.

"Level and Distortion",,,,,,,,,,,,,,,
"Ch1 (F)",,"Ch1 (H2)",,"Ch1 (H3)",,"Ch1 (Total)",,"Ch2 (F)",,"Ch2 (H2)",,"Ch2 (H3)",,"Ch2 (Total)",
X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y,X,Y
Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms,Hz,Vrms
20,0.00772013164376534,20,5.60982648239952E-05,20,0.000389709733151927,20,0.011492581958802,20,0.00699792689186063,20,0.000151471712877565,20,0.000389940899485093,20,0.010080448380793
..
..
"THD Ratio",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
Hz,%,Hz,%,,,,,,,,,,,,
20,83.009797319554,20,82.1460991930652,,,,,,,,,,,,
21.1179638886716,85.3656629417084,21.1179638886716,82.0338466400102,,,,,,,,,,,,
..
..
"Reference Waveform",,,,,,,,,,,,,,,
Ch1,,Ch2,,,,,,,,,,,,,
X,Y,X,Y,,,,,,,,,,,,
s,V,s,V,,,,,,,,,,,,
0,0,0,0,,,,,,,,,,,,
2.08333333333333E-05,6.47890208369956E-08,2.08333333333333E-05,6.47890208369956E-08,,,,,,,,,,,,

I want python to be able to recognize header position automatically.

1 Answers1

0

Here is one way to do it with read_csv and a very basic regex approach :

import re
from io import StringIO

with open("file.csv") as f:
    d = {m.group(1).split(",")[0].strip('"'): 
           (pd.read_csv(StringIO(m.group(2)), header=[0,1,2]).dropna(how="all", axis=1))
         for m in re.finditer("(.*?)\n(.*?)(\n..\n..\n|$)", f.read(), flags=re.DOTALL)}

Output :

d is a dict where the keys are the sub-csv/tables names :

print(d.keys())

#dict_keys(['Level and Distortion', 'THD Ratio', 'Reference Waveform'])

.. and the values are a DataFrame. To access one of the dataframes, use dictionnary-key indexing : ​

print(d["THD Ratio"])

    Ch1 Unnamed: 1_level_0   Ch2 Unnamed: 3_level_0
      X                  Y     X                  Y
     Hz                  %    Hz                  %
0 20.00              83.01 20.00              82.15
1 21.12              85.37 21.12              82.03
Timeless
  • 22,580
  • 4
  • 12
  • 30
  • **NB**: The resulting DataFrame has a [*MultiIndex*](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html) that you can readapt with [`droplevel`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.droplevel.html) *and/or* [`set_axis`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_axis.html). – Timeless May 19 '23 at 07:47