0

Below is an example:

buy_log_df = pd.DataFrame(
    [
        ["2020-01-02", 0, 1, 2, 2],
        ["2020-01-02", 1, 1, 1, 3],
        ["2020-01-02", 2, 2, 1, 1],
        ["2020-01-02", 3, 3, 3, 1],
    ],
    columns=['date', 'sale_id', 'customer_id', "item_id", "quantity"]
)

item_df = pd.DataFrame(
    [
        [1, 100],
        [2, 200],
        [3, 300],
    ],
    columns=['item_id', 'price']
)

item_df2 = pd.DataFrame(
    [
        [1, '1 3 10'],
        [2, '1 3'],
        [3, '2 5'],
    ],
    columns=['item_id', 'tags']
)

As you can see here, each item in item_df has multiple tag values as an one feature.

Here is what I've tried:

item_df2 = pd.concat([item_df2, item_df2['tags'].str.split(expand=True)], axis=1)
item_df2 = pd.melt(
    item_df2,
    id_vars=['item_id'],
    value_vars=[0,1,2],
    value_name="tags"
)
tag_log_df = item_df2[item_df2['tags'].notna()].drop("variable", axis=1,).sort_values("item_id")
tag_log_df

>>>

   item_id tags
0        1    1
3        1    3
6        1   10
1        2    1
4        2    3
2        3    2
5        3    5

It looks like I can't normalize this item entity (from buy_log entity) because it has multiple duplicated item_ids in the table.

How can I handle this case when I design the entityset?

user3595632
  • 5,380
  • 10
  • 55
  • 111

1 Answers1

2

Thanks for the question. To handle multiple tag values, you can normalize the tags into a data frame before structuring the entity set.

buy_log_df

       date  sale_id  customer_id  item_id  quantity
 2020-01-02        0            1        2         2
 2020-01-02        1            1        1         3
 2020-01-02        2            2        1         1
 2020-01-02        3            3        3         1

item_df

 item_id  price
       1    100
       2    200
       3    300

tag_log_df

 item_id tags
       1    1
       1    3
       1   10
       2    1
       2    3
       3    2
       3    5

With the normalized data, you can then structure the entity set.

es = ft.EntitySet()

es.entity_from_dataframe(
    entity_id='buy_log',
    dataframe=buy_log_df,
    index='sale_id',
    time_index='date',
)

es.entity_from_dataframe(
    entity_id='item',
    dataframe=item_df,
    index='item_id',
)

es.entity_from_dataframe(
    entity_id='tag_log',
    dataframe=tag_log_df,
    index='tag_log_id',
    make_index=True,
)

parent = es['item']['item_id']
child = es['buy_log']['item_id']
es.add_relationship(ft.Relationship(parent, child))

child = es['tag_log']['item_id']
es.add_relationship(ft.Relationship(parent, child))

enter image description here

Jeff Hernandez
  • 2,063
  • 16
  • 20
  • Thanks for the answer. But one additional question: if item price is included `buy_log_df` and `item_df` didn't exist at the first place, can I use `normalize_entity()` for creating `item` es from `buy_log` es? For me, it looks like I can use `normalize_entity()` for 1:N relationships and `add_relationship()` for case like tags above. Am I right? (Yes, this question is related to [this question](https://stackoverflow.com/questions/64892603/do-normalize-entity-add-relationships-are-logically-same-in-featuretool)) – user3595632 Nov 19 '20 at 02:55
  • Correct, in that case, you can use `normalize_entity()` to create `item` from `buy_log` – Jeff Hernandez Nov 19 '20 at 14:56