1

I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas.

# First Column Second Column  Address            City      State Zip           
1 House        The Clairs     4321 Main Street   Chicago   IL    54872
2 Restaurant   The Monks      6323 East Wing     Miluakee  WI    45458 

and some of them are in various formats with multiple headers,

Table 1
Comp    ID Info
# First Column  Second Column  Address            City      State Zip           
1 Office        The Fairs      1234 Main Street   Seattle   WA    54872
2 College       The Blanks     4523 West Street   Madison   WI    45875         
3 Ground        The Brewers    895 Toronto Street Madrid    IA    56487         


Table2                                  
Comp    ID Info         
# First Column  Second Column  Address            City      State   Zip         
1 College       The Banks      568 Old Street     Cleveland OH      52125           
2 Professional  The Circuits   695 New Street     Boston    MA      36521

This looks like this in Excel (I am pasting the image here to show how it actually looks in excel), enter image description here

As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column.

For an individual file like this, I can read like below, which is just fine.

xls = pd.ExcelFile(r'mypath\myfile.xlsx')    
df = pd.read_excel('xls', 'mysheet',  header=[2])

However, I need a final data frame like this (Appended with files that have only one header),

   First Column  Second Column  Address            City      State Zip
 0 House         The Clair      4321 Main Street   Chicago   IL    54872
 1 Restaurant    The Monks      6323 East Wing     Milwaukee WI    45458        
 2 Office        The Fairs      1234 Main Street   Seattle   WA    54872
 3 College       The Blanks     4523 West Street   Madison   WI    45875            
 4 Ground        The Brewers    895 Toronto Street Madrid    IA    56487
 5 College       The Banks      568 Old Street     Cleveland OH    52125            
 6 Professional  The Circuits   695 New Street     Boston    MA    36521

Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.

However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column to be header in the beginning of the file.

Additionally, I want to drop those rows that are in the middle of the the file that has First Column.

After I create a cleaned file headers starting with First Column, I can append each data frame and create my output file I need. How can I achieve this in pandas? Any help or suggestions would be great.

i.n.n.m
  • 2,936
  • 7
  • 27
  • 51
  • 1
    One way: read all rows in with `header=None` and your customize column-names. then filter out all rows based on the first column (i.e. `col1`): df = df[df.col1.astype(str).str.match('^\d+$')] – jxc May 26 '18 at 23:38

0 Answers0