11

I have written a program that reads a couple of .csv files (they are not large, a couple of thousands rows each), I do some data cleaning and wrangling and this is the final structure of each .csv file looks (fake data for illustration purposes only).

import pandas as pd
data = [[112233, 'Rob', 99], [445566, 'John', 88]]
managers = pd.DataFrame(data)
managers.columns = ['ManagerId', 'ManagerName', 'ShopId']
print managers

   ManagerId ManagerName  ShopId
0     112233         Rob      99
1     445566        John      88


data = [[99, 'Shop1'], [88, 'Shop2']]
shops = pd.DataFrame(data)
shops.columns = ['ShopId', 'ShopName']
print shops

   ShopId ShopName
0      99    Shop1
1      88    Shop2

data = [[99, 2000, 3000, 4000], [88, 2500, 3500, 4500]]
sales = pd.DataFrame(data)
sales.columns = ['ShopId', 'Year2010', 'Year2011', 'Year2012']
print sales

   ShopId  Year2010  Year2011  Year2012
0      99      2000      3000      4000
1      88      2500      3500      4500

Then I use xlsxwriter and reportlab Python packages for creating custom Excel sheets and .pdf reports while iterating the data frames. Everything looks great, and all of the named packages do their job really well.

My concern though is that I feel that my code gets hard to maintain as I need to access the same data frame rows multiple times in multiple calls.

Say I need to get manager names that are responsible for shops which had sales more than 1500 in year 2010. My code is filled with this kind of calls:

managers[managers['ShopId'].isin(
    sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)

I think it is hard to see what is going on while reading this line of code. I could create multiple intermediate variables, but this would add multiple lines of code.

How common is it to sacrifice database normalization ideology and merge all the pieces into a single data frame to get a more maintainable code? There are obviously cons of having a single data frame as it might get messy when trying to merge other data frames that might be needed later on. Merging them of course leads to data redundancy as the same manager can be assigned to multiple shops.

df = managers.merge(sales, how='left', on='ShopId').
    merge(shops, how='left', on='ShopId')
print df

   ManagerId ManagerName  ShopId  Year2010  Year2011  Year2012 ShopName
0     112233         Rob      99      2000      3000      4000    Shop1
1     445566        John      88      2500      3500      4500    Shop2

At least this call gets smaller:

df[df['Year2010'] > 1500]['ManagerName'].values
>>> array(['Rob', 'John'], dtype=object)

Maybe pandas is a wrong tool for this kind of job?

C# developers at office frown at me and tell me use the classes, but then I will have a bunch of methods like get_manager_sales(managerid) and so forth. Iterating class instances for reporting also sounds troublesome as I would need to implement some sorting and indexing (which I get for free with pandas).

Dictionary would work, but it makes it also difficult to modify existing data, doing merges etc. The syntax doesn't get much better either.

data_dict = df.to_dict('records')
[{'ManagerId': 112233L,
  'ManagerName': 'Rob',
  'ShopId': 99L,
  'ShopName': 'Shop1',
  'Year2010': 2000L,
  'Year2011': 3000L,
  'Year2012': 4000L},
 {'ManagerId': 445566L,
  'ManagerName': 'John',
  'ShopId': 88L,
  'ShopName': 'Shop2',
  'Year2010': 2500L,
  'Year2011': 3500L,
  'Year2012': 4500L}]

Get manager names that are responsible for shops which had sales more than 1500 in year 2010.

[row['ManagerName'] for row in data_dict if row['Year2010'] > 1500]
>>> ['Rob', 'John']

In this particular case with the data I operate with, should I go all the way with pandas or is there another way to write cleaner code while taking advantage of the power of pandas?

Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
Alex Tereshenkov
  • 3,340
  • 8
  • 36
  • 61

2 Answers2

4

I would choose Pandas, because it's much faster, has an excellent and extremely rich API, a source code looks much cleaner and better, etc.

BTW the following line can be easily rewritten:

managers[managers['ShopId'].isin(sales[sales['Year2010'] > 1500]['ShopId'])]['ManagerName'].values

as:

ShopIds = sales.ix[sales['Year2010'] > 1500, 'ShopId']
managers.query('ShopId in @ShopIds')['ManagerName'].values

IMO it's pretty easy to read and understand

PS you may also want to store your data in a SQL-able database and use SQL or to store it in HDF Store and use where parameter - in both cases you can benefit from indexing "search" columns

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Cool, thanks a ton for the answer, reassured that choosing pandas for writing the report program was a right thing to do. Would you also consider merging all the csv files dataframes into a single one to avoid cross-referencing multiple frames all the time? – Alex Tereshenkov Oct 14 '16 at 11:48
  • @AlexTereshenkov, sure you can try to denormalize your tables and put everything in one flat DF, but be aware of [possible pitfalls](http://stackoverflow.com/questions/40002355/pandas-left-join-why-more-results/40002535#40002535) – MaxU - stand with Ukraine Oct 14 '16 at 15:28
4

Creating classes that operate on dataframes is not a good idea, because it'll hide away the fact that you're using a data frame, and open the way to very bad decisions (like iterating over a dataframe with a for loop).

Solution 1: Denormalize the data. You don't have to keep your data in a normal form. Normal form is preferrable when you have to keep your entries consistent throughout the database. This is not a database, you don't do constant inserts, updates and deletes. So just denormalize it, and work with one large dataframe, as it's clearly more convenient, and better suits your needs.

Solution 2: Use a database. You can dump your data into a SQLite database (pandas has a built-in function for that), and execute all kinds of crazy queries on it. In my personal opition, SQL queries are much more readable than the stuff you posted. If you do this kind of analysis regularly, and the data structure remains the same, this may be a preferrable solution. You can dump data ino a db, and then use SQLAlchemy to work with it.

Solution 3. Create your own datarame. You can inherit from pandas.DataFrame and add custom methods to it. You need to dig into the guts of pandas for that, though, to see how to implement those methods. This way you can create, for example, custom methods of accessing certain parts of a dataframe.

Unless you know pandas really well, I'd go for solutions 1 or 2. If you need more flexibility, and the data manipulation is different every time, use 1. If you need to execute roughly the same analysis every time, use 2 (especially if your data analysis code is a part of a bigger application).

Also, I don't understand why "adding more lines of code" is bad. By breaking up a huge one-liner into many expressions, you don't increase the actual complexity, and you decrease the perceived complexity. Maybe all you need to do is just refactor your code, and pack some operations into reusable functions?

t_tia
  • 556
  • 1
  • 4
  • 17
  • Thanks for the feedback, very helpful. `1.` Speaking of classes, I meant classes that are not based on pandas yet those you can construct from csv files (like csv dictreader). `2.` `pack some operations into reusable functions?` is something I have also thought of, maybe adding some functions like `get_manager_shops()` and `get_shops_totalsale()` and putting them into a separate module could make my life easier (but inside them, it still will be calls to the data frames, right?) – Alex Tereshenkov Oct 14 '16 at 11:53
  • 1
    yes, you'll just hide those calls away. There's also a problem that many operations in `pandas` often return a view on a dataframe. This won't be obvious when calling a function `get_shops_totalsale(df)`. So, you'll have to keep this kind of stuff in mind, which will make your code hard to understand. If you ever lose track of this, you can get "magic" unexpected side-effects. – t_tia Oct 14 '16 at 12:06
  • Could you please elaborate on what you mean by "denormalize" the data? – KubiK888 Dec 05 '18 at 11:23