-1

Here is my problem: For example, I have a table contains people's behavior information in a month (multiple features), each person has a unique ID and a unique label (0 and 1). What I want to do is using these features to predict whether a customer belongs to group 0 / 1.

However, the problem is the features of each ID are collected and recorded multiple times, which means I have multiple rows belong to a same ID. So how can I structure my data and build a feature matrix where one ID corresponds one row of features and one lable?

Feature

ID feature1 feature2 feature3 ...
1  2        1.5      1        ...
2  1        3        0        ...
3  1        2        1        ...
1  2.5      1        1        ...
3  0.8      1        0        ...
...

Lable

ID lable
1  0
2  1
3  0
...

sample: two dataframe

Is there a way that can take these multiple rows of features into account as much as possible and create a feature matrix corresponding one by one?

My personal idea so far: First, compute the time that each ID shows as a new feature. Second, clusterng each ID into two clusters and use the cluster center of the majority one as the feature array of that ID.

Anyone can help me? Thanks a lot!

WWH98932
  • 123
  • 1
  • 1
  • 9
  • 1. What is this "table"? A text file? A pandas dataframe? Something else? 2. How about taking the means for each features if IDs are duplicated? – timgeb Dec 05 '18 at 13:43
  • I get your problem, but want to point out that the way you present the table is misleading. There can only be ONE ID in your list. If your features are "updated" you should indicate that by using e.g. lists or (better) numpy arrays. Next you need to become aware of how the evolution of features leads to a classification into 0 or 1. Maybe you need the mean of that values, or maybe the min/max range, or maybe something else. It's rather a conceptional question rather than a coding problem, I suppose. – offeltoffel Dec 05 '18 at 14:18
  • @timgeb Sorry for misleading, those are two dataframes containing features (the first one) and lables (the second). I have numerical and categorical values, is taking the means still a good way? – WWH98932 Dec 05 '18 at 14:48
  • @offeltoffel Thanks for replying, is taking the means still a good way when I have numerical and categorical values? – WWH98932 Dec 05 '18 at 14:50

1 Answers1

0

Feature engineering will be influenced significantly by any hypotheses you might have about the data and the end use for the engineered features.

To start with, you can aggregate all features at an ID level by basic statistical features like MIN, MAX, NMISS, COUNT, SUM, MEAN, STDEV etc. So, if you have f features and use k statistics, you will end up with f*k independent variables.

In addition, depending on the data - you might be interested at looking at special categories - e.g. you might be interested in the number of occurences of feature_1 >= 10 for each ID and this could be an additional variable.

Mortz
  • 4,654
  • 1
  • 19
  • 35