I am trying to collect data from different .csv files, that share the same column names. However, some csv files have their headers located in different rows.
Is there a way to determine the header row dynamically based on the first row that contains "most" values (the actual header names)?
I tried the following:
def process_file(file, path, col_source, col_target):
global df_master
print(file)
df = pd.read_csv(path + file, encoding = "ISO-8859-1", header=None)
df = df.dropna(thresh=2) ## Drop the rows that contain less than 2 non-NaN values. E.g. metadata
df.columns = df.iloc[0,:].values
df = df.drop(df.index[0])
However, when using pandas.read_csv()
, it seems like the very first value determines the size of the actual dataframe as I receive the following error message:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 162
As you can see in this case the header row would have been located in row 4.
When adding error_bad_lines=False
to read_csv, only the metadata will be read into the dataframe.
The files can have either the structure of:
a "Normal" File:
row1 col1 col2 col3 col4 col5
row2 val1 val1 val1 val1 val1
row3 val2 val2 val2 val2 val2
row4
or a structure with meta data before header:
row1 metadata1
row2 metadata2
row3 col1 col2 col3 col4 col5
row4 val1 val1 val1 val1 val1
Any help much appreciated!