I have a two tables of of customer information and transaction info.
Customer information includes each person's quality of health (from 0 to 100)
e.g. if I extract just the Name
and HealthQuality
columns:
John: 70
Mary: 20
Paul: 40
etc etc.
After applying featuretools I noticed a new DIFF(HealthQuality)
variable.
According to the docs, this is what DIFF does:
"Compute the difference between the value in a list and the previous value in that list."
Is featuretools calculating the difference between Mary and John's health quality in this instance?
I don't think this kind of feature synthesis really works for customer records e.g. CUM_SUM(emails_sent)
for John. John's record is one row, and he has one value for the amount of emails we sent him.
For now I am using the ignore_variables=[all_customer_info]
option to remove all of the customer data except for transactions table of course.
This also leads me into another question.
Using data from the transactions table, John now has a DIFF(MEAN(transactions.amount))
. What is the DIFF
measured in this instance?
id MEAN(transactions.amount) DIFF(MEAN(transactions.amount))
0 1 21.950000 NaN
1 2 20.000000 -1.950000
2 3 35.604581 15.604581
3 4 NaN NaN
4 5 22.782682 NaN
5 6 35.616306 12.833624
6 7 24.560536 -11.055771
7 8 331.316552 306.756016
8 9 60.565852 -270.750700