2

Background: I have logs which are generated during the testing of the devices after manufacture. Each device has a serial number and a corresponding csv log file with all the data. Something like this.

DATE,TESTSTEP,READING,LIMIT,RESULT
01/01/2019 07:37:17.432 AM,1,23,10,FAIL
01/01/2019 07:37:23.661 AM,2,3,3,PASS

So there are many such log files. Each with the test data. I have the the serial number of devices which failed in field. I want to create a model using these log files. And then use it to predict if the given device has a chance of failing in field given its log file.

Till now as a part of learning, I have worked with data like housing price. Every row was complete. Depending on area, number of rooms etc, it was easy to define a model for expected selling price.

Here I wish to find a way to somehow flatten all the logs into a single row. I am thinking of having something like:

DATE_1,TESTSTEP_1,READING_1,LIMIT_1,RESULT_1,DATE_2,TESTSTEP_2,READING_2,LIMIT_2,RESULT_2
1/1/2019 07:37:17.432 AM,1,23,10,FAIL,01/01/2019 07:37:23.661 AM,2,3,3,PASS

Is this the right way to deal with this kind of data?

If so, then does Pandas has any inbuilt support for this?

I will be using scikit-learn to create models.

halfer
  • 19,824
  • 17
  • 99
  • 186
NotAgain
  • 1,927
  • 3
  • 26
  • 42

1 Answers1

1

First convert columns to ordered CategoricalIndex for same order of columns in output, convert DATE column by to_datetime and convert datetimes to dates by Series.dt.date with cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort second level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:

df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date

df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates, g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(drop=True)
print (df)
                   DATE_1  TESTSTEP_1  READING_1  LIMIT_1 RESULT_1  \
0 2019-01-01 07:37:17.432           1         23       10     FAIL   

                   DATE_2  TESTSTEP_2  READING_2  LIMIT_2 RESULT_2  
0 2019-01-01 07:37:23.661           2          3        3     PASS

If need also dates in separate first column:

df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date

df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates.rename('DAT'), g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
         DAT                  DATE_1  TESTSTEP_1  READING_1  LIMIT_1 RESULT_1  \
0 2019-01-01 2019-01-01 07:37:17.432           1         23       10     FAIL   

                   DATE_2  TESTSTEP_2  READING_2  LIMIT_2 RESULT_2  
0 2019-01-01 07:37:23.661           2          3        3     PASS    
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The Date field will have time part also. My plan is to do away with it and club all the log data under the first timestamp. Is there any way you think we can accomodate that part also. I am a newbie. Not sure if having something like DATE_1, DATE_2 will improve the accuracy. If you want I can update question with time part attached to date field. – NotAgain Apr 30 '19 at 06:29
  • @NotAgain - is possible change sample data? And also is order important of columns? – jezrael Apr 30 '19 at 06:30
  • I updated the data. I would like to keep same order of columns to play safe. But I am a newbie. Maybe someone with expertise in scikit-learn can comment. – NotAgain Apr 30 '19 at 06:35