Python Pandas merge, melt, pivot_table for data pipeline

Question

I have a data pipeline that I'm writing in Python with Pandas. It's working, and I mostly like it. Pretty good with Python. Less good with Pandas at this point. As a student of Python, however, I'm always looking to be more idiomatic and not rely on loops and logic that I may use in other languages. Specifically, I'm always amazed at how series and frames can be iterated over using Pandas methods, rather than having to build loops around them.

Here is the source file. I also have other demographic information that comes from an API call. This code has been truncated out of my example, but assume that data['record'] has my demographic data, and I can merge that successfully using pd.merge().

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|StudentFirstName|StudentMiddleName|StudentLastName|UniqueIdentifier|Grade|GOM   |SchoolYear|Fall_September|Fall_SeptemberDateGiven|Fall_SeptemberNationalPercentileRank|Winter_January|Winter_JanuaryDateGiven|Winter_JanuaryNationalPercentileRank|Spring_May|Spring_MayDateGiven|Spring_MayNationalPercentileRank|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                |                 |               |100             |1    |LNF   |2017      |29            |9/11/2017              |14                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |100             |1    |LSF   |2017      |31            |9/11/2017              |51                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |100             |1    |M-COMP|2017      |8             |9/20/2017              |48                                  |15            |2/5/2018               |17                                  |42        |5/8/2018           |65                              |
|                |                 |               |100             |1    |MNM   |2017      |9             |9/11/2017              |36                                  |7             |2/1/2018               |3                                   |5         |5/8/2018           |1                               |
|                |                 |               |100             |1    |NIM   |2017      |35            |9/11/2017              |34                                  |62            |2/1/2018               |52                                  |51        |5/8/2018           |18                              |
|                |                 |               |100             |1    |NWF   |2017      |28            |9/11/2017              |37                                  |69            |2/1/2018               |71                                  |31        |5/8/2018           |5                               |
|                |                 |               |100             |1    |OCM   |2017      |62            |9/11/2017              |30                                  |89            |2/1/2018               |58                                  |94        |5/8/2018           |51                              |
|                |                 |               |100             |1    |PSF   |2017      |14            |9/11/2017              |10                                  |              |                       |                                    |49        |5/8/2018           |33                              |
|                |                 |               |100             |1    |QDM   |2017      |23            |9/11/2017              |57                                  |31            |2/1/2018               |46                                  |26        |5/8/2018           |15                              |
|                |                 |               |100             |1    |R-CBM |2017      |8             |9/11/2017              |36                                  |17            |2/1/2018               |22                                  |29        |5/8/2018           |15                              |
|                |                 |               |200             |1    |LNF   |2017      |47            |9/11/2017              |51                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |LSF   |2017      |47            |9/11/2017              |86                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |M-COMP|2017      |27            |9/22/2017              |92                                  |34            |2/2/2018               |67                                  |47        |5/8/2018           |88                              |
|                |                 |               |200             |1    |MNM   |2017      |11            |9/11/2017              |48                                  |23            |2/1/2018               |80                                  |21        |5/9/2018           |55                              |
|                |                 |               |200             |1    |NIM   |2017      |56            |9/11/2017              |81                                  |80            |2/1/2018               |95                                  |80        |5/9/2018           |92                              |
|                |                 |               |200             |1    |NWF   |2017      |63            |9/11/2017              |87                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |OCM   |2017      |107           |9/11/2017              |                                    |109           |2/1/2018               |                                    |109       |5/9/2018           |                                |
|                |                 |               |200             |1    |PSF   |2017      |50            |9/11/2017              |73                                  |              |                       |                                    |          |                   |                                |
|                |                 |               |200             |1    |QDM   |2017      |28            |9/11/2017              |75                                  |38            |2/1/2018               |78                                  |40        |5/9/2018           |84                              |
|                |                 |               |200             |1    |R-CBM |2017      |40            |9/11/2017              |80                                  |76            |2/1/2018               |80                                  |84        |5/9/2018           |65                              |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Because of where this is going, I need that split into two files, one for Reading data and one for Mathematics data. Which file it goes into is determined by the GOM column. About half of the GOM values go into Reading and the other in Math.

Here is what the two output files look like. Reading:

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date      |LNF 1|LNF 2|LNF 3    |LSF 1|LSF 2|LSF 3   |PSF 1|PSF 2|PSF 3   |NWF 1|NWF 2|NWF 3  |R-CBM 1|R-CBM 2|R-CBM 3|MAZE 1|MAZE 2|MAZE 3|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100           |         |          |1    |Fall  |9/11/2017 |29   |14   |Level 2  |31   |51   |Level 3 |14   |10   |Level 1 |28   |37   |Level 3|8      |36     |Level 3|      |      |      |
|100           |         |          |1    |Spring|5/8/2018  |     |     |         |     |     |        |49   |33   |Level 3 |31   |5    |Level 1|29     |15     |Level 2|      |      |      |
|100           |         |          |1    |Winter|2/1/2018  |     |     |         |     |     |        |     |     |        |69   |71   |Level 3|17     |22     |Level 2|      |      |      |
|200           |         |          |1    |Fall  |9/11/2017 |47   |51   |Level 3  |47   |86   |Level 4 |50   |73   |Level 3 |63   |87   |Level 4|40     |80     |Level 4|      |      |      |
|200           |         |          |1    |Spring|5/9/2018  |     |     |         |     |     |        |     |     |        |     |     |       |84     |65     |Level 3|      |      |      |
|200           |         |          |1    |Winter|2/1/2018  |     |     |         |     |     |        |     |     |        |     |     |       |76     |80     |Level 4|      |      |      |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

and Math:

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|student_number|last_name|first_name|Grade|Season|Date     |OCM 1|OCM 2|OCM 3  |NIM 1|NIM 2|NIM 3  |QDM 1|QDM 2|QDM 3  |MNM 1|MNM 2|MNM 3  |M-COMP 1|M-COMP 2|M-COMP 3|M-CAP 1|M-CAP 2|M-CAP 3|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100           |         |          |1    |Fall  |9/11/2017|62   |30   |Level 3|35   |34   |Level 3|23   |57   |Level 3|9    |36   |Level 3|8       |48      |Level 3 |       |       |       |
|100           |         |          |1    |Spring|5/8/2018 |94   |51   |Level 3|51   |18   |Level 2|26   |15   |Level 2|5    |1    |Level 1|42      |65      |Level 3 |       |       |       |
|100           |         |          |1    |Winter|2/1/2018 |89   |58   |Level 3|62   |52   |Level 3|31   |46   |Level 3|7    |3    |Level 1|15      |17      |Level 2 |       |       |       |
|200           |         |          |1    |Fall  |9/11/2017|107  |     |       |56   |81   |Level 4|28   |75   |Level 3|11   |48   |Level 3|27      |92      |Level 5 |       |       |       |
|200           |         |          |1    |Spring|5/8/2018 |109  |     |       |80   |92   |Level 5|40   |84   |Level 4|21   |55   |Level 3|47      |88      |Level 4 |       |       |       |
|200           |         |          |1    |Winter|2/1/2018 |109  |     |       |80   |95   |Level 5|38   |78   |Level 4|23   |80   |Level 4|34      |67      |Level 3 |       |       |       |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

The trickiest thing was finding the right date for each subject. The kids take each GOM on roughly the same day, but the file that comes out gives a different date for each GOM. When the data go into our other system, it's one date per subject, which is composed of several GOMs. I resorted to writing a function to iterate through them, but this can probably be done more easily.

Usually, I will be running this script for just one season at a time and commenting out the others, but I'm importing last year's data now and am doing all three seasons at once.

Quick explanations on output columns. For each GOM, 1 is the numeric score, 2 is percentile, and 3 is an alpha description of the performance level. It's a horrid way of organizing that file, but that's what the vendor needs.

I was hoping to achieve this in fewer steps or at least in a way that is probably more idiomatic. I found myself undoing some of what pivot_table produced just so I could work with the df again in the ways I know how, which is still pretty limited.

My biggest concerns are lines 65-77 where I melt and pivot_table. Not sure if that could be done more efficiently or not. Also, lines 108-111 are really stupid from a Python standpoint. I just didn't want to list out the GOMs a second time, and I hoped to produce a list of LNF 1, LNF 2, LNF 3, LSF 1 ... quickly without retyping. There has to be a better way of doing that.

Please let me know if you have any suggestions for how this process could be improved, either the straight Python or (especially) the Pandas.

import pandas as pd
import os
import json
import glob
import re
import datetime
import numpy as np
import itertools

# build of list of seasons to process.  usually only one.
seasons = []
seasons.append('Fall')
seasons.append('Winter')
seasons.append('Spring')

# Read demographic data into dataframe
# data['record'] comes from something that was truncated for simplicity
demographics = pd.DataFrame(data['record'], columns=['student_number', 'state_studentnumber', 'last_name', 'first_name', 'dob'])

# iterate through score files
for file in glob.glob('To Do/*.csv'):
    # read score data into dataframe
    aimsweb = pd.read_csv(file)

    aimsweb["Fall_SeptemberDateGiven"] = pd.to_datetime(aimsweb['Fall_SeptemberDateGiven']).dt.strftime('%m/%d/%Y');
    aimsweb["Winter_JanuaryDateGiven"] = pd.to_datetime(aimsweb['Winter_JanuaryDateGiven']).dt.strftime('%m/%d/%Y');
    aimsweb["Spring_MayDateGiven"] = pd.to_datetime(aimsweb['Spring_MayDateGiven']).dt.strftime('%m/%d/%Y');    

    # perform left outer join of score data to demographics.  this ensures these are real students.
    merged = aimsweb.merge(demographics, how='left', left_on=['UniqueIdentifier', 'StudentLastName', 'StudentFirstName'], right_on=['student_number', 'last_name', 'first_name'], indicator=True)

    renames = {}
    # rename table
    renames['Fall_September'] = 'Fall_1'
    renames['Fall_SeptemberDateGiven'] = 'Fall_Date'
    renames['Fall_SeptemberNationalPercentileRank'] = 'Fall_2'

    renames['Winter_January'] = 'Winter_1'
    renames['Winter_JanuaryDateGiven'] = 'Winter_Date'
    renames['Winter_JanuaryNationalPercentileRank'] = 'Winter_2'

    renames['Spring_May'] = 'Spring_1'
    renames['Spring_MayDateGiven'] = 'Spring_Date'
    renames['Spring_MayNationalPercentileRank'] = 'Spring_2'    

    merged.rename(index=str, columns=renames, inplace=True)

    # function for converting percentile into level 1-5
    def percentile2level(percentile):
        if percentile >= 91:
            return 'Level 5'
        if percentile >= 76:
            return 'Level 4'
        if percentile >= 26:
            return 'Level 3'
        if percentile >= 11:
            return 'Level 2'
        if percentile >= 0:
            return 'Level 1'

    # convert percentile to level
    for season in ['Fall', 'Winter', 'Spring']:
        merged[season + '_3'] = merged[season + '_2'].apply(percentile2level);

    # melt data from columns into rows
    merged = merged.melt(id_vars = ['student_number', 'GOM', 'last_name', 'first_name', 'Grade'],value_vars=['Fall_Date', 'Fall_1', 'Fall_2', 'Fall_3', 'Winter_Date', 'Winter_1', 'Winter_2', 'Winter_3', 'Spring_Date', 'Spring_1', 'Spring_2', 'Spring_3'])

    # split the variable into season and attribute
    merged['Season'], merged['Attribute'] = merged['variable'].str.split('_', 1).str

    # pivot data back to columns
    merged = pd.pivot_table(merged, index=['student_number', 'Season', 'last_name', 'first_name', 'Grade'], values=['value'], columns=['GOM', 'Attribute'], aggfunc=np.max, fill_value='')

    # condense the levels and reset index to get back to flat df
    merged.columns = merged.columns.droplevel(0)
    merged.columns = [' '.join(col).strip() for col in merged.columns.values]       
    merged.reset_index(inplace=True);   

    # build subject-GOM mappings
    subjects = {};
    subjects['Reading'] = ['LNF', 'LSF', 'PSF', 'NWF', 'R-CBM', 'MAZE']
    subjects['Mathematics'] = ['OCM', 'NIM', 'QDM', 'MNM', 'M-COMP', 'M-CAP']   

    # function for finding the min non-null date from a list
    def mindate(row, headings):     
        dates = list(row[headings])
        dates = list(filter(None, dates))
        if len(dates) > 0:
            return min(dates)       
        return "";

    # iterate through two subjects
    for subject in subjects:
        # build list of date headers for all GOMs
        headings = [];
        for gom in subjects[subject]:
            headings.append(gom + " Date");

        # create df for this subject
        df = merged

        # create date column with value of minimum from list of dates
        df['Date'] = df.apply(mindate, axis=1, args=(headings,))        

        # filter out records with no real date
        df = df[ df['Date'] != 'NaT' ]

        # build a list of GOM headers with 1, 2, 3
        headings = [];      
        for combo in list(itertools.product( subjects[subject], [1,2,3] )):
            headings.append( combo[0] + " " + str(combo[1]) )           

        # save csv
        df[ df['Season'].isin(seasons) ][ ['student_number', 'last_name', 'first_name', 'Grade', 'Season', 'Date'] + headings].to_csv("./" + os.path.splitext(os.path.basename(file))[0] + "_" + "".join(seasons) + "_" + subject + ".csv", index=False);

There's a lot of code in the question, and I think this may be more appropriate for several different questions. Either way, your `percentage` loop can be replaced with `pd.cut` — user3483203, Aug 03 '18 at 17:04
`df['Date'] != 'NaT'`won't work. use `df['Date'].notnull()`instead. Besides read [modern pandas](https://tomaugspurger.github.io/modern-1-intro). Also stuff like `df['A']['B']`is discouraged, this is called chained indexing. besides you are using too many loops. There is group by and vectorization for that — Quickbeam2k1, Aug 03 '18 at 17:07
Thanks for pd.cut, @user3483203. That's exactly the kind of thing I'm looking for. Perfect. — Adam Larsen, Aug 03 '18 at 17:27
@Quickbeam2k1 I think `df['Date'] != 'NaT'` works because by that point, that date is actually a string representation. The `.notnull()` doesn't seem to work, and I was getting NaT in my output. Thanks for the modern pandas link. I'll check it out. — Adam Larsen, Aug 03 '18 at 17:31
sorry I assumed you had transformed it to date time already, which you should do before that anyway — Quickbeam2k1, Aug 03 '18 at 17:58

Python Pandas merge, melt, pivot_table for data pipeline

0 Answers0