2

I have a 2-column DataFrame, column-1 corresponds to customer, column-2 corresponds to the city this customer has visited. The DataFrame looks like the following:

print(df)

    customer    visited_city
0   John        London
1   Mary        Melbourne
2   Steve       Paris
3   John        New_York
4   Peter       New_York
5   Mary        London
6   John        Melbourne
7   John        New_York

I would like to convert the above DataFrame into a row-vector format, such that each row represents a unique user with the row vector indicating the cities visited.

print(wide_format_df)

          London  Melbourne  New_York  Paris
John      1.0        1.0       1.0      0.0
Mary      1.0        1.0       0.0      0.0
Steve     0.0        0.0       0.0      1.0
Peter     0.0        0.0       1.0      0.0

Below is the code I used to generate the wide format. It iterates through each user one by one. I was wondering is there any more efficient way to do so?

import pandas as pd
import numpy as np

UNIQUE_CITIESS = np.sort(df['visited_city'].unique())
p = len(UNIQUE_CITIESS)
unique_customers = df['customer'].unique().tolist()

X = []
for customer in unique_customers:
    x = np.zeros(p)    
    city_visited = np.sort(df[df['customer'] == customer]['visited_city'].unique())
    visited_idx = np.searchsorted(UNIQUE_CITIESS, city_visited)
    x[visited_idx] = 1    
    X.append(x)
wide_format_df = pd.DataFrame(np.array(X), columns=UNIQUE_CITIESS, index=unique_customers)
wide_format_df
cwl
  • 501
  • 2
  • 7
  • 18

3 Answers3

3

You can use crosstab

pd.crosstab(df.customer, df.visited_city)

You get

visited_city    London  Melbourne   New_York    Paris
customer                
John            1       1           1           0
Mary            1       1           0           0
Peter           0       0           1           0
Steve           0       0           0           1
Vaishali
  • 37,545
  • 5
  • 58
  • 86
  • This is a good idea, but the issue is that potentially a given pair could appear multiple times in the original DataFrame, so doing `crosstab` will result in counts instead of the indicator vector. – cwl Aug 28 '17 at 03:10
  • Actually, I think one can use `df.drop_duplicates()` to remove duplicate rows in the original DataFrame, so `crosstab` should be good enough, thanks @Vaishali ! – cwl Aug 28 '17 at 03:21
  • cwl, you may want to look at @piRSquared answer that addresses duplicates and is more efficient. – Vaishali Aug 28 '17 at 18:08
3

Please note that your question has been edited such that the answers provided no longer answer your question. They must adjust to only return 1 for John in New York despite the fact he's been there twice.

Option 1 pir1
I like this answer because I think it's elegant.

pd.get_dummies(df.customer).T.dot(pd.get_dummies(df.visited_city)).clip(0, 1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

Option 2 pir2
This answer should be fast.

i, r = pd.factorize(df.customer.values)
j, c = pd.factorize(df.visited_city.values)
n, m = r.size, c.size
b = np.zeros((n, m), dtype=int)
b[i, j] = 1

pd.DataFrame(b, r, c).sort_index().sort_index(1)

       London  Melbourne  New_York  Paris
John        1          1         1      0
Mary        1          1         0      0
Peter       0          0         1      0
Steve       0          0         0      1

Option 3 pir3
Practical and pretty quick

df.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

visited_city  London  Melbourne  New_York  Paris
customer                                        
John               1          1         1      0
Mary               1          1         0      0
Peter              0          0         1      0
Steve              0          0         0      1

Timing
Code Below

# Multiples of Minimum time
#
           pir1  pir2      pir3       wen       vai
10     1.392237   1.0  1.521555  4.337469  5.569029
30     1.445762   1.0  1.821047  5.977978  7.204843
100    1.679956   1.0  1.901502  6.685429  7.296454
300    1.568407   1.0  1.825047  5.556880  7.210672
1000   1.622137   1.0  1.613983  5.815970  5.396008
3000   1.808637   1.0  1.852953  4.159305  4.224724
10000  1.654354   1.0  1.502092  3.145032  2.950560
30000  1.555574   1.0  1.413612  2.404061  2.299856

enter image description here

wen = lambda d: d.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)
vai = lambda d: pd.crosstab(d.customer, d.visited_city)
pir1 = lambda d: pd.get_dummies(d.customer).T.dot(pd.get_dummies(d.visited_city)).clip(0, 1)
pir3 = lambda d: d.groupby(['customer', 'visited_city']).size().unstack(fill_value=0).clip(0, 1)

def pir2(d):
    i, r = pd.factorize(d.customer.values)
    j, c = pd.factorize(d.visited_city.values)
    n, m = r.size, c.size
    b = np.zeros((n, m), dtype=int)
    b[i, j] = 1

    return pd.DataFrame(b, r, c).sort_index().sort_index(1)

results = pd.DataFrame(
    index=[10, 30, 100, 300, 1000, 3000, 10000, 30000],
    columns='pir1 pir2 pir3 wen vai'.split(),
    dtype=float
)

for i in results.index:
    d = pd.concat([df] * i, ignore_index=True)
    for j in results.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        results.at[i, j] = timeit(stmt, setp, number=10)

print((lambda r: r.div(r.min(1), 0))(results))

results.plot(loglog=True)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

Or you can using

df.pivot_table(index='customer', columns='visited_city',aggfunc=len, fill_value=0)

visited_city  London  Melbourne  New_York  Paris
customer                                        
John               1          1         1      0
Mary               1          1         0      0
Peter              0          0         1      0
Steve              0          0         0      1
BENY
  • 317,841
  • 20
  • 164
  • 234