I have a data pipeline that I'm writing in Python with Pandas. It's working, and I mostly like it. Pretty good with Python. Less good with Pandas at this point. As a student of Python, however, I'm always looking to be more idiomatic and not rely on loops and logic that I may use in other languages. Specifically, I'm always amazed at how series and frames can be iterated over using Pandas methods, rather than having to build loops around them.
Here is the source file. I also have other demographic information that comes from an API call. This code has been truncated out of my example, but assume that data['record'] has my demographic data, and I can merge that successfully using pd.merge().
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|StudentFirstName|StudentMiddleName|StudentLastName|UniqueIdentifier|Grade|GOM |SchoolYear|Fall_September|Fall_SeptemberDateGiven|Fall_SeptemberNationalPercentileRank|Winter_January|Winter_JanuaryDateGiven|Winter_JanuaryNationalPercentileRank|Spring_May|Spring_MayDateGiven|Spring_MayNationalPercentileRank|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| | | |100 |1 |LNF |2017 |29 |9/11/2017 |14 | | | | | | |
| | | |100 |1 |LSF |2017 |31 |9/11/2017 |51 | | | | | | |
| | | |100 |1 |M-COMP|2017 |8 |9/20/2017 |48 |15 |2/5/2018 |17 |42 |5/8/2018 |65 |
| | | |100 |1 |MNM |2017 |9 |9/11/2017 |36 |7 |2/1/2018 |3 |5 |5/8/2018 |1 |
| | | |100 |1 |NIM |2017 |35 |9/11/2017 |34 |62 |2/1/2018 |52 |51 |5/8/2018 |18 |
| | | |100 |1 |NWF |2017 |28 |9/11/2017 |37 |69 |2/1/2018 |71 |31 |5/8/2018 |5 |
| | | |100 |1 |OCM |2017 |62 |9/11/2017 |30 |89 |2/1/2018 |58 |94 |5/8/2018 |51 |
| | | |100 |1 |PSF |2017 |14 |9/11/2017 |10 | | | |49 |5/8/2018 |33 |
| | | |100 |1 |QDM |2017 |23 |9/11/2017 |57 |31 |2/1/2018 |46 |26 |5/8/2018 |15 |
| | | |100 |1 |R-CBM |2017 |8 |9/11/2017 |36 |17 |2/1/2018 |22 |29 |5/8/2018 |15 |
| | | |200 |1 |LNF |2017 |47 |9/11/2017 |51 | | | | | | |
| | | |200 |1 |LSF |2017 |47 |9/11/2017 |86 | | | | | | |
| | | |200 |1 |M-COMP|2017 |27 |9/22/2017 |92 |34 |2/2/2018 |67 |47 |5/8/2018 |88 |
| | | |200 |1 |MNM |2017 |11 |9/11/2017 |48 |23 |2/1/2018 |80 |21 |5/9/2018 |55 |
| | | |200 |1 |NIM |2017 |56 |9/11/2017 |81 |80 |2/1/2018 |95 |80 |5/9/2018 |92 |
| | | |200 |1 |NWF |2017 |63 |9/11/2017 |87 | | | | | | |
| | | |200 |1 |OCM |2017 |107 |9/11/2017 | |109 |2/1/2018 | |109 |5/9/2018 | |
| | | |200 |1 |PSF |2017 |50 |9/11/2017 |73 | | | | | | |
| | | |200 |1 |QDM |2017 |28 |9/11/2017 |75 |38 |2/1/2018 |78 |40 |5/9/2018 |84 |
| | | |200 |1 |R-CBM |2017 |40 |9/11/2017 |80 |76 |2/1/2018 |80 |84 |5/9/2018 |65 |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Because of where this is going, I need that split into two files, one for Reading data and one for Mathematics data. Which file it goes into is determined by the GOM column. About half of the GOM values go into Reading and the other in Math.
Here is what the two output files look like. Reading:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date |LNF 1|LNF 2|LNF 3 |LSF 1|LSF 2|LSF 3 |PSF 1|PSF 2|PSF 3 |NWF 1|NWF 2|NWF 3 |R-CBM 1|R-CBM 2|R-CBM 3|MAZE 1|MAZE 2|MAZE 3|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | | |1 |Fall |9/11/2017 |29 |14 |Level 2 |31 |51 |Level 3 |14 |10 |Level 1 |28 |37 |Level 3|8 |36 |Level 3| | | |
|100 | | |1 |Spring|5/8/2018 | | | | | | |49 |33 |Level 3 |31 |5 |Level 1|29 |15 |Level 2| | | |
|100 | | |1 |Winter|2/1/2018 | | | | | | | | | |69 |71 |Level 3|17 |22 |Level 2| | | |
|200 | | |1 |Fall |9/11/2017 |47 |51 |Level 3 |47 |86 |Level 4 |50 |73 |Level 3 |63 |87 |Level 4|40 |80 |Level 4| | | |
|200 | | |1 |Spring|5/9/2018 | | | | | | | | | | | | |84 |65 |Level 3| | | |
|200 | | |1 |Winter|2/1/2018 | | | | | | | | | | | | |76 |80 |Level 4| | | |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
and Math:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date |OCM 1|OCM 2|OCM 3 |NIM 1|NIM 2|NIM 3 |QDM 1|QDM 2|QDM 3 |MNM 1|MNM 2|MNM 3 |M-COMP 1|M-COMP 2|M-COMP 3|M-CAP 1|M-CAP 2|M-CAP 3|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 | | |1 |Fall |9/11/2017|62 |30 |Level 3|35 |34 |Level 3|23 |57 |Level 3|9 |36 |Level 3|8 |48 |Level 3 | | | |
|100 | | |1 |Spring|5/8/2018 |94 |51 |Level 3|51 |18 |Level 2|26 |15 |Level 2|5 |1 |Level 1|42 |65 |Level 3 | | | |
|100 | | |1 |Winter|2/1/2018 |89 |58 |Level 3|62 |52 |Level 3|31 |46 |Level 3|7 |3 |Level 1|15 |17 |Level 2 | | | |
|200 | | |1 |Fall |9/11/2017|107 | | |56 |81 |Level 4|28 |75 |Level 3|11 |48 |Level 3|27 |92 |Level 5 | | | |
|200 | | |1 |Spring|5/8/2018 |109 | | |80 |92 |Level 5|40 |84 |Level 4|21 |55 |Level 3|47 |88 |Level 4 | | | |
|200 | | |1 |Winter|2/1/2018 |109 | | |80 |95 |Level 5|38 |78 |Level 4|23 |80 |Level 4|34 |67 |Level 3 | | | |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
The trickiest thing was finding the right date for each subject. The kids take each GOM on roughly the same day, but the file that comes out gives a different date for each GOM. When the data go into our other system, it's one date per subject, which is composed of several GOMs. I resorted to writing a function to iterate through them, but this can probably be done more easily.
Usually, I will be running this script for just one season at a time and commenting out the others, but I'm importing last year's data now and am doing all three seasons at once.
Quick explanations on output columns. For each GOM, 1 is the numeric score, 2 is percentile, and 3 is an alpha description of the performance level. It's a horrid way of organizing that file, but that's what the vendor needs.
I was hoping to achieve this in fewer steps or at least in a way that is probably more idiomatic. I found myself undoing some of what pivot_table produced just so I could work with the df again in the ways I know how, which is still pretty limited.
My biggest concerns are lines 65-77 where I melt and pivot_table. Not sure if that could be done more efficiently or not. Also, lines 108-111 are really stupid from a Python standpoint. I just didn't want to list out the GOMs a second time, and I hoped to produce a list of LNF 1, LNF 2, LNF 3, LSF 1 ... quickly without retyping. There has to be a better way of doing that.
Please let me know if you have any suggestions for how this process could be improved, either the straight Python or (especially) the Pandas.
import pandas as pd
import os
import json
import glob
import re
import datetime
import numpy as np
import itertools
# build of list of seasons to process. usually only one.
seasons = []
seasons.append('Fall')
seasons.append('Winter')
seasons.append('Spring')
# Read demographic data into dataframe
# data['record'] comes from something that was truncated for simplicity
demographics = pd.DataFrame(data['record'], columns=['student_number', 'state_studentnumber', 'last_name', 'first_name', 'dob'])
# iterate through score files
for file in glob.glob('To Do/*.csv'):
# read score data into dataframe
aimsweb = pd.read_csv(file)
aimsweb["Fall_SeptemberDateGiven"] = pd.to_datetime(aimsweb['Fall_SeptemberDateGiven']).dt.strftime('%m/%d/%Y');
aimsweb["Winter_JanuaryDateGiven"] = pd.to_datetime(aimsweb['Winter_JanuaryDateGiven']).dt.strftime('%m/%d/%Y');
aimsweb["Spring_MayDateGiven"] = pd.to_datetime(aimsweb['Spring_MayDateGiven']).dt.strftime('%m/%d/%Y');
# perform left outer join of score data to demographics. this ensures these are real students.
merged = aimsweb.merge(demographics, how='left', left_on=['UniqueIdentifier', 'StudentLastName', 'StudentFirstName'], right_on=['student_number', 'last_name', 'first_name'], indicator=True)
renames = {}
# rename table
renames['Fall_September'] = 'Fall_1'
renames['Fall_SeptemberDateGiven'] = 'Fall_Date'
renames['Fall_SeptemberNationalPercentileRank'] = 'Fall_2'
renames['Winter_January'] = 'Winter_1'
renames['Winter_JanuaryDateGiven'] = 'Winter_Date'
renames['Winter_JanuaryNationalPercentileRank'] = 'Winter_2'
renames['Spring_May'] = 'Spring_1'
renames['Spring_MayDateGiven'] = 'Spring_Date'
renames['Spring_MayNationalPercentileRank'] = 'Spring_2'
merged.rename(index=str, columns=renames, inplace=True)
# function for converting percentile into level 1-5
def percentile2level(percentile):
if percentile >= 91:
return 'Level 5'
if percentile >= 76:
return 'Level 4'
if percentile >= 26:
return 'Level 3'
if percentile >= 11:
return 'Level 2'
if percentile >= 0:
return 'Level 1'
# convert percentile to level
for season in ['Fall', 'Winter', 'Spring']:
merged[season + '_3'] = merged[season + '_2'].apply(percentile2level);
# melt data from columns into rows
merged = merged.melt(id_vars = ['student_number', 'GOM', 'last_name', 'first_name', 'Grade'],value_vars=['Fall_Date', 'Fall_1', 'Fall_2', 'Fall_3', 'Winter_Date', 'Winter_1', 'Winter_2', 'Winter_3', 'Spring_Date', 'Spring_1', 'Spring_2', 'Spring_3'])
# split the variable into season and attribute
merged['Season'], merged['Attribute'] = merged['variable'].str.split('_', 1).str
# pivot data back to columns
merged = pd.pivot_table(merged, index=['student_number', 'Season', 'last_name', 'first_name', 'Grade'], values=['value'], columns=['GOM', 'Attribute'], aggfunc=np.max, fill_value='')
# condense the levels and reset index to get back to flat df
merged.columns = merged.columns.droplevel(0)
merged.columns = [' '.join(col).strip() for col in merged.columns.values]
merged.reset_index(inplace=True);
# build subject-GOM mappings
subjects = {};
subjects['Reading'] = ['LNF', 'LSF', 'PSF', 'NWF', 'R-CBM', 'MAZE']
subjects['Mathematics'] = ['OCM', 'NIM', 'QDM', 'MNM', 'M-COMP', 'M-CAP']
# function for finding the min non-null date from a list
def mindate(row, headings):
dates = list(row[headings])
dates = list(filter(None, dates))
if len(dates) > 0:
return min(dates)
return "";
# iterate through two subjects
for subject in subjects:
# build list of date headers for all GOMs
headings = [];
for gom in subjects[subject]:
headings.append(gom + " Date");
# create df for this subject
df = merged
# create date column with value of minimum from list of dates
df['Date'] = df.apply(mindate, axis=1, args=(headings,))
# filter out records with no real date
df = df[ df['Date'] != 'NaT' ]
# build a list of GOM headers with 1, 2, 3
headings = [];
for combo in list(itertools.product( subjects[subject], [1,2,3] )):
headings.append( combo[0] + " " + str(combo[1]) )
# save csv
df[ df['Season'].isin(seasons) ][ ['student_number', 'last_name', 'first_name', 'Grade', 'Season', 'Date'] + headings].to_csv("./" + os.path.splitext(os.path.basename(file))[0] + "_" + "".join(seasons) + "_" + subject + ".csv", index=False);