0

Assuming I have two tables, one with metadata about a customer with field customer_id and an events table recorded from website clickstream events with fields customer_id, date. Obviously, the second table might have several non unique events (unfortunately date is really only a date not a timestamp).

When trying to create https://docs.featuretools.com/loading_data/using_entitysets.html it fails with:

Index is not unique on dataframe (Entity transactions)

How can I either make it unique or make it work?

Georg Heiler
  • 16,916
  • 36
  • 162
  • 292

1 Answers1

1

If your table doesn't have a column that can be used as an unique index, you can have featuretools automatically create one. When calling EntitySet.entity_from_dataframe(...) simply provide a column name that doesn't currently exist in the dataframe to the index parameter and set make_index=True. This will automatically create a column with unique values.

For example, in the code below the event_id index is automatically created

import pandas as pd
import featuretools as ft

df = pd.DataFrame({"customer_id": [0, 1, 0, 1, 1],
                   "date": [pd.Timestamp("1/1/2018"), pd.Timestamp("1/1/2018"),
                            pd.Timestamp("1/1/2018"), pd.Timestamp("1/2/2018"),
                            pd.Timestamp("1/2/2018")],
                   "event_type": ["view", "purchase", "view", "cancel", "purchase"]})

es = ft.EntitySet(id="customer_events")                
es.entity_from_dataframe(entity_id="events",
                         dataframe=df,
                         index="event_id",
                         make_index=True,
                         time_index="date")

print(es["events"])

in the events entity you can see event_id is now a variable even though it wasn't in the original dataframe

Entity: events
  Variables:
    event_id (dtype: index)
    date (dtype: datetime_time_index)
    customer_id (dtype: numeric)
    event_type (dtype: categorical)
  Shape:
    (Rows: 5, Columns: 4)
Max Kanter
  • 2,006
  • 6
  • 16