Automatically determine header row when reading csv in pandas

Question

I am trying to collect data from different .csv files, that share the same column names. However, some csv files have their headers located in different rows.

Is there a way to determine the header row dynamically based on the first row that contains "most" values (the actual header names)?

I tried the following:

def process_file(file, path, col_source, col_target):
    global df_master
    print(file)
    df = pd.read_csv(path + file, encoding = "ISO-8859-1", header=None)
    df = df.dropna(thresh=2) ## Drop the rows that contain less than 2 non-NaN values. E.g. metadata
    df.columns = df.iloc[0,:].values
    df = df.drop(df.index[0])

However, when using pandas.read_csv(), it seems like the very first value determines the size of the actual dataframe as I receive the following error message:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 4, saw 162

As you can see in this case the header row would have been located in row 4. When adding error_bad_lines=False to read_csv, only the metadata will be read into the dataframe.

The files can have either the structure of:

a "Normal" File:

row1    col1   col2    col3    col4   col5   
row2    val1   val1    val1    val1   val1
row3    val2   val2    val2    val2   val2   
row4

or a structure with meta data before header:

row1   metadata1    
row2   metadata2
row3   col1   col2    col3    col4   col5
row4   val1   val1    val1    val1   val1

Any help much appreciated!

This may help: https://stackoverflow.com/questions/53835607/finding-the-row-number-for-the-header-row-in-a-csv-file-pandas-dataframe — Suman Niroula, Feb 27 '20 at 13:59

score 3 · Accepted Answer · answered Feb 27 '20 at 13:55

3

IMHO the simplest way if to forget pandas for a while:

you open the file as a text file for reading
you start parsing it line by line, guessing whether the line is
- metadata header
- the true header line
- data lines

A simple way is to concatenate all the lines starting from the true header line in a single string (let us call it buffer), and then use pd.read_csv(io.StringIO(buffer), ...)

answered Feb 27 '20 at 13:55

Serge Ballesta

143,923
11
122
252

Haven't thought about this. Thanks! – Maeaex1 Feb 27 '20 at 14:08

score 2 · Answer 2 · answered Feb 27 '20 at 13:53

A bit dirty, but this works. Basically it consists of trying to read the file ignoring top rows from 0 to the whole file. As soon as something is possible for a csv, it will return it. Adapt the custom_csv to your needs.

import pandas as pd

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def custom_csv(fname):
    _file_len = file_len(fname)
    for i in range(_file_len):
        try:
            df = pd.read_csv(fname, skiprows=i)
            return df
        except Exception:
            print(i)
    return 
print(custom_csv('pollution.csv'))

score 1 · Answer 3 · answered Jan 16 '21 at 11:03

Better way is to search where the data starts using csv sniffing and the row above it will give the CSV column header.

import csv 
import pandas as pd    
Expected_Delimiter= "," 
count =0

with open(path,"r+") as f:
    while True:
        sniffer = csv.Sniffer()
        line = f.readline()
        count = count+1
        # Breaking the loop if file reaches eof
        if not (line):
            break
        Dialect =sniffer.sniff(line)
        file_Delimiter = Dialect.delimiter
        # Breaking loop if delimiter is found
        if (file_Delimiter == Expected_Delimiter):
            break
        else:
            continue

skiprows = count -1     
CSV_data = pd.read_csv(path,sep=Expected_Delimiter,skiprows =skiprows, encoding = "ISO-8859-1")

I_Tried · Answer 4 · 2020-02-27T14:45:38.953

This is what I did. It doesn't give you false positives like other things I tried. You basically don't want empty records in your row.

Create your dataframe (df below) and give it a header index of 0.

Now iterate through it:

if df is not False:
    ind = 0
    notfound = True

    while notfound:
        for index, row in df.iterrows():
            s = pd.Series(row)
            if s.isnull().values.any():
                ind += 1
            else:
                notfound = False
                break
    return ind
else:
    return False

Remake your dataframe but pass it the header index returned from the above + 1. So if it returns ind = 5, you pass 6

Automatically determine header row when reading csv in pandas

4 Answers4

Linked