I have many excel files that are in different formats. Some of them look like this, which is normal with one header can be read into pandas
.
# First Column Second Column Address City State Zip
1 House The Clairs 4321 Main Street Chicago IL 54872
2 Restaurant The Monks 6323 East Wing Miluakee WI 45458
and some of them are in various formats with multiple headers,
Table 1
Comp ID Info
# First Column Second Column Address City State Zip
1 Office The Fairs 1234 Main Street Seattle WA 54872
2 College The Blanks 4523 West Street Madison WI 45875
3 Ground The Brewers 895 Toronto Street Madrid IA 56487
Table2
Comp ID Info
# First Column Second Column Address City State Zip
1 College The Banks 568 Old Street Cleveland OH 52125
2 Professional The Circuits 695 New Street Boston MA 36521
This looks like this in Excel (I am pasting the image here to show how it actually looks in excel),
As you can see above there are three different levels of headers. For sure every file has a row that starts with First Column
.
For an individual file like this, I can read like below, which is just fine.
xls = pd.ExcelFile(r'mypath\myfile.xlsx')
df = pd.read_excel('xls', 'mysheet', header=[2])
However, I need a final data frame like this (Appended with files that have only one header),
First Column Second Column Address City State Zip
0 House The Clair 4321 Main Street Chicago IL 54872
1 Restaurant The Monks 6323 East Wing Milwaukee WI 45458
2 Office The Fairs 1234 Main Street Seattle WA 54872
3 College The Blanks 4523 West Street Madison WI 45875
4 Ground The Brewers 895 Toronto Street Madrid IA 56487
5 College The Banks 568 Old Street Cleveland OH 52125
6 Professional The Circuits 695 New Street Boston MA 36521
Since I have many files, I want to read each file in my folder and clean them up by getting only one header from a row. Had I knew the index position of the row, that I need as head, I could simply do something like in this post.
However, as some of those files can have Multiple headers (I showed 2 extra headers in above example, some have 4 headers) in different formats, I want to iterate through the file and set the row that starts with First Column
to be header in the beginning of the file.
Additionally, I want to drop those rows that are in the middle of the the file that has First Column
.
After I create a cleaned file headers starting with First Column
, I can append each data frame and create my output file I need. How can I achieve this in pandas
? Any help or suggestions would be great.