2

I am having a problem during feature engineering. Looking for some suggestions. Problem statement: I have usage data of multiple customers for 3 days. Some have just 1 day usage some 2 and some 3. Data is related to number of emails sent / contacts added on each day etc.

I am converting this time series data to column-wise ie., number of emails sent by a customer on day1 as one feature, number of emails sent by a customer on day2 as one feature and so on. But problem is that, the usage can be of either increasing order or decreasing order for different customers.

ie., example 1: customer 'A' --> 'number of emails sent on 1st . day' = 100 . ' number of emails sent on 2nd day'=0

example 2: customer 'B' --> 'number of emails sent on 1st . day' = 0 . ' number of emails sent on 2nd day'=100

example 3: customer 'C' --> 'number of emails sent on 1st . day' = 0 . ' number of emails sent on 2nd day'=0

example 4: customer 'D' --> 'number of emails sent on 1st . day' = 100 . ' number of emails sent on 2nd day'=100

In the first two cases => My new feature will have "-100" and "100" as values. Which I guess is good for differentiating. But the problem arises for 3rd and 4th columns when the new feature value will be "0" in both scenarios Can anyone suggest a way to handle this

SSuram
  • 61
  • 4
  • Instead of printing `0`, print "No Change" or something similar when that's the case. – martineau Apr 11 '19 at 00:40
  • I thought of it , but I am confused about one thing. If I do that , I will have to make the new feature as categorical , which is not ideal as the other values will be continous. Instead I can have absolute values in the new feature and indicate the trend as "+1" or increasing "-1" for decreasing "no change" for no change and "0" if both the values have been "0". Would that be a good approach though? – SSuram Apr 11 '19 at 00:51
  • It's hard to say because you haven't precisely defined what the criteria / constraints are for judging whether a given way to handle the situation is "good" one or not. – martineau Apr 11 '19 at 01:00
  • I would want to capture the usage trend for 3 days of each of these customers for all the useful features. And based on the trend I have to classify customers into different classes. Does that answer? – SSuram Apr 11 '19 at 01:05
  • You can take the sin(#emails_in_a_day/#max_number_of_emails). Or, you can take a mean of all days and update each day to the #of_days_more_or_less_than_mean. – rhn89 Apr 12 '19 at 20:39

1 Answers1

1

You can extract the following features:

  1. Simple Moving Averages for day 2 and day 3 respectively. This means you now have two extra columns.

  2. Percentage Change from previous day

  3. Percentage Change from day 1 to 3

Pascal Zoleko
  • 691
  • 6
  • 8