3

I am using Pandas module in python 3.5 to read crosstabs recursively from sub-directories and I want to concatenate the crosstabs inside the for loop after I call pd.crosstab() and after the for loop write the output to an excel file. I tried copying table1 into table3 (see code below) after I call pd.crosstab(), but if some values are not present in the latter data files then table3 shows NaN for those entries. I looked at pd.concat, but can't find an example of how to use it in a for loop.

Data files look like (there are 100s of files and with many columns, but here showing only the columns I am interested in):

    First Data File
    StudentID    Grade      
    3            A
    2            B
    1            A

    Second Data File
    StudentID   Grade
    1            B
    2            A
    3            A

    Third Data File
    StudentID   Grade
    2            C
    1            B
    3            A

    and so on ....
    At the end the output should be like:

    Grade       A   B   C
    StudentID
    1           1   2   0
    2           1   1   1
    3           3   0   0   

My python program looks like (removing the imports from the top of the file)

.....

fields = ['StudentID', 'Grade']
path= 'C:/script_testing/'
i=0

for filename in glob.glob('C:/script_testing/**/*.txt', recursive=True):
    temp = pd.read_csv(filename, sep=',', usecols=fields)
    table1 = pd.crosstab(temp.StudentID, temp.Grade)
    # Note the if condition is executed only once to initlialize table3
    if(i==0):
        table3 = table1
        i = i + 1
    table3 = table3 + table1

writer = pd.ExcelWriter('Report.xlsx', engine='xlsxwriter')
table3.to_excel(writer, sheet_name='StudentID_vs_Grade')
writer.save()
piRSquared
  • 285,575
  • 57
  • 475
  • 624
Sikander Waheed
  • 85
  • 1
  • 1
  • 4

1 Answers1

1
pd.concat([df1, df2, df3]).pipe(lambda d: pd.crosstab(d.StudentID, d.Grade))

Grade      A  B  C
StudentID         
1          1  2  0
2          1  1  1
3          3  0  0

My attempt to translate your code

fields = ['StudentID', 'Grade']
path= 'C:/script_testing/'
i=0

parse = lambda f: pd.read_csv(f, usecols=fields)
table3 = pd.concat(
    [parse(f) for f in glob.glob('C:/script_testing/**/*.txt', recursive=True)]
).pipe(lambda d: pd.crosstab(d.StudentID, d.Grade))

writer = pd.ExcelWriter('Report.xlsx', engine='xlsxwriter')
table3.to_excel(writer, sheet_name='StudentID_vs_Grade')
writer.save()
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks a lot! It worked. One follow up question please. Is there a way to exclude some files when reading from the glob.gob()? For example, I want to read all the files with filenames Data*,txt but exclude those files that have Data*Old.txt? – Sikander Waheed Sep 21 '17 at 21:29
  • You're welcome. That's a `glob` question. or you could do something like `[parse(f) for f in glob.glob('C:/script_testing/**/Data*.txt', recursive=True) if not 'DataOld' in f]` – piRSquared Sep 21 '17 at 21:32