Feature Engineering in Python with Pandas Using Multiple Rows Per Calculation

Question

I have CSV data in the following format:

+-----------------+--------+-------------+
| reservation_num |  rate  | guest_name  |
+-----------------+--------+-------------+
| B874576         | 169.95 | Bob Smith   |
| H786234         | 258.95 | Jane Doe    |
| H786234         | 258.95 | John Doe    |
| F987354         | 385.95 | David Jones |
| N097897         | 449.95 | Mark Davis  |
| H567349         | 482.95 | Larry Stein |
| N097897         | 449.95 | Sue Miller  |
+-----------------+--------+-------------+

I would like to add a feature (column) to the DataFrame called 'rate_per_person'. It would be calculated by taking the rate for a particular reservation number and dividing it by the total number of guests who have that same reservation number associated with their stay.

Here is my code:

#Importing Libraries
import pandas as pd

# Importing the Dataset
ds = pd.read_csv('hotels.csv')

for index, row in ds.iterrows():
    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count

And the error message:

Traceback (most recent call last):

  File "<ipython-input-3-0668a3165e76>", line 2, in <module>
    row['rate_per_person'] = row['rate'] / ds[row['reservation_num']].count

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2062, in __getitem__
    return self._getitem_column(key)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/frame.py", line 2069, in _getitem_column
    return self._get_item_cache(key)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/generic.py", line 1534, in _get_item_cache
    values = self._data.get(item)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/internals.py", line 3590, in get
    loc = self.items.get_loc(item)

  File "/Users/<user_name>/anaconda/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2395, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))

  File "pandas/_libs/index.pyx", line 132, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5239)

  File "pandas/_libs/index.pyx", line 154, in pandas._libs.index.IndexEngine.get_loc (pandas/_libs/index.c:5085)

  File "pandas/_libs/hashtable_class_helper.pxi", line 1207, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20405)

  File "pandas/_libs/hashtable_class_helper.pxi", line 1215, in pandas._libs.hashtable.PyObjectHashTable.get_item (pandas/_libs/hashtable.c:20359)

KeyError: 'B874576'

Based on the error message, clearly there is an issue with the ds[row['reservation_num']].count portion of the last line of code. However, I am unsure the right way to obtain the number of guests per reservation in a manner that will allow me to programmatically create the new feature.

piRSquared · Accepted Answer · 2017-07-03T19:39:24.487

Option 1
pd.Series.value_counts and map

df.rate / df.reservation_num.map(df.reservation_num.value_counts())

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

Option 2
groupby, transform, and size

df.rate / df.groupby('reservation_num').rate.transform('size')

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

Option 3
np.unique and np.bincount

u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

Option 3.5
np.unique sorts and therefore doesn't scale as well as pd.factorize. In the context I use them, they do the same thing. So, I use a function that uses an anecdotal threshold around about the length of array at which one becomes more performant than the other. It gets numbered as 3.5 because it's by and large the same exact answer as 3

def factor(a):
    if len(a) > 10000:
        return pd.factorize(a)[0]
    else:
        return np.unique(a, return_inverse=True)[1]

def count(a):
    f = factor(a)
    return np.bincount(f)[f]

df.rate / count(df.reservation_num.values)  

0    169.950
1    129.475
2    129.475
3    385.950
4    224.975
5    482.950
6    224.975
dtype: float64

Timing

%timeit df.rate / df.reservation_num.map(df.reservation_num.value_counts())
%timeit df.rate / df.groupby('reservation_num').rate.transform('size')

1000 loops, best of 3: 650 µs per loop
1000 loops, best of 3: 768 µs per loop

%%timeit
u, f = np.unique(df.reservation_num.values, return_inverse=True)
df.rate / np.bincount(f)[f]

10000 loops, best of 3: 131 µs per loop

score 2 · Answer 2 · answered Jul 03 '17 at 19:36

You can do this operation with grouppby and transform:

df['rate_per_person'] = df.groupby('reservation_num')['rate'].transform(lambda x: x.iloc[0] / x.size)

Output:

     reservation_num    rate      guest_name  rate_per_person
0   B874576           169.95    Bob Smith             169.950
1   H786234           258.95   Jane Doe               129.475
2   H786234           258.95    John Doe              129.475
3   F987354           385.95     David Jones          385.950
4   N097897           449.95    Mark Davis            224.975
5   H567349           482.95    Larry Stein           482.950
6   N097897           449.95    Sue Miller            224.975

Feature Engineering in Python with Pandas Using Multiple Rows Per Calculation

2 Answers2