I have written a program that reads a couple of .csv files (they are not large, a couple of thousands rows each), I do some data cleaning and wrangling and this is the final structure of each .csv file looks (fake data for illustration purposes only).
import pandas as pd
data = [[112233, 'Rob', 99], [445566, 'John', 88]]
managers = pd.DataFrame(data)
managers.columns = ['ManagerId', 'ManagerName', 'ShopId']
print managers
ManagerId ManagerName ShopId
0 112233 Rob 99
1 445566 John 88
data = [[99, 'Shop1'], [88, 'Shop2']]
shops = pd.DataFrame(data)
shops.columns = ['ShopId', 'ShopName']
print shops
ShopId ShopName
0 99 Shop1
1 88 Shop2
data = [[99, 2000, 3000, 4000], [88, 2500, 3500, 4500]]
sales = pd.DataFrame(data)
sales.columns = ['ShopId', 'Year2010', 'Year2011', 'Year2012']
print sales
ShopId Year2010 Year2011 Year2012
0 99 2000 3000 4000
1 88 2500 3500 4500
Then I use xlsxwriter
and reportlab
Python packages for creating custom Excel sheets and .pdf reports while iterating the data frames. Everything looks great, and all of the named packages do their job really well.
My concern though is that I feel that my code gets hard to maintain as I need to access the same data frame rows multiple times in multiple calls.
Say I need to get manager names that are responsible for shops which had sales more than 1500 in year 2010. My code is filled with this kind of calls:
managers[managers['ShopId'].isin(
sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)
I think it is hard to see what is going on while reading this line of code. I could create multiple intermediate variables, but this would add multiple lines of code.
How common is it to sacrifice database normalization ideology and merge all the pieces into a single data frame to get a more maintainable code? There are obviously cons of having a single data frame as it might get messy when trying to merge other data frames that might be needed later on. Merging them of course leads to data redundancy as the same manager can be assigned to multiple shops.
df = managers.merge(sales, how='left', on='ShopId').
merge(shops, how='left', on='ShopId')
print df
ManagerId ManagerName ShopId Year2010 Year2011 Year2012 ShopName
0 112233 Rob 99 2000 3000 4000 Shop1
1 445566 John 88 2500 3500 4500 Shop2
At least this call gets smaller:
df[df['Year2010'] > 1500]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)
Maybe pandas is a wrong tool for this kind of job?
C# developers at office frown at me and tell me use the classes, but then I will have a bunch of methods like get_manager_sales(managerid)
and so forth. Iterating class instances for reporting also sounds troublesome as I would need to implement some sorting and indexing (which I get for free with pandas
).
Dictionary would work, but it makes it also difficult to modify existing data, doing merges etc. The syntax doesn't get much better either.
data_dict = df.to_dict('records')
[{'ManagerId': 112233L,
'ManagerName': 'Rob',
'ShopId': 99L,
'ShopName': 'Shop1',
'Year2010': 2000L,
'Year2011': 3000L,
'Year2012': 4000L},
{'ManagerId': 445566L,
'ManagerName': 'John',
'ShopId': 88L,
'ShopName': 'Shop2',
'Year2010': 2500L,
'Year2011': 3500L,
'Year2012': 4500L}]
Get manager names that are responsible for shops which had sales more than 1500 in year 2010.
[row['ManagerName'] for row in data_dict if row['Year2010'] > 1500]
>>> ['Rob', 'John']
In this particular case with the data I operate with, should I go all the way with pandas
or is there another way to write cleaner code while taking advantage of the power of pandas
?