I am filtering an external data source that is formatted as an excel file. There is nothing I can do to change how the file is generated. I need to filter out rows that are useless and combine pairs of rows into one. The process I have so far works on the filtering, but not on the joining of related data in two successive rows into one row.
The dataframes are not converted nicely for stackoverflow, but I have hand-tweaked them below.
Data Transformations
Convert downloads into useful formats
import pandas as pd
from pandas import DataFrame
from pandas.io.excel import read_excel
cpath = os.path.join (download_path, classes_report)
print (pd.__version__)
df = pd.read_excel (cpath, sheetname=0, header=None)
df.to_string()
0.14.1
0 1 2 3 4 5
0 Session: 2014-2015 NaN NaN NaN NaN NaN
1 Class Information Age Enrolled Key Room NaN
2 Math 10 12 / 18 03396 110 09:00:00
3 Teacher: Joe M Teacher NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN
6 NaN NaN NaN NaN
7 NaN NaN NaN NaN NaN NaN
8 NaN NaN NaN NaN NaN NaN
9 Number of Classes: 1 Number of Students: 12 / 18 NaN NaN NaN NaN
10 Class Information Ages Enrolled Key Room NaN
11 Art 18 - 80 3 / 24 03330 110 10:00:00
12 Teacher: John A Instructor NaN NaN NaN NaN
13 NaN NaN NaN NaN NaN NaN
14 NaN NaN NaN NaN
15 NaN NaN NaN NaN
# Eliminate any rows where first column is NaN, contains 'Number of Classes', 'Class Information'
# or is blank
# The 5th column is tuition.
cf = df[df[0].notnull ()][1:]
cf = cf [~cf[0].str.contains ('Number of Classes')]
bf = cf[~cf[0].isin ([' ', 'Class Information'])]
bf.to_string()
0 1 2 3 4 5
2 Math 10 12 / 18 03396 110 09:00:00
3 Teacher: Joe M Teacher NaN NaN NaN NaN
11 Art 18 - 80 3 / 24 03330 110 10:00:00
12 Teacher: John A Instructor NaN NaN NaN NaN
left = DataFrame(bf.values [::2], index=bf.index[::2])
right = DataFrame(bf.values [1::2], index=bf.index[1::2])
pd.concat([left, right], axis=1).to_string ()
0 1 2 3 4 5 0 1 2 3 4 5
2 Math 10 12 / 18 03396 110 09:00:00 NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN Teacher: Joe M Teacher NaN NaN NaN NaN
11 Art 18 - 80 3 / 24 03330 110 10:00:00 NaN NaN NaN NaN NaN NaN
12 NaN NaN NaN NaN NaN NaN Teacher: John A Instructor NaN NaN NaN NaN
The goal here was to have the last five columns of the "Math" row to contain the columns starting with "Teacher:", and similarly for the "Art" row, leaving a dataframe with two rows instead of four.