How do I preprocess my data when I have too much but need it all?

Question

I am literally months out of college with a CS BS and my boss is having me build a machine learning agent to classify data into 23 categories from scratch all by myself in two months. I took a single Intro to AI class, and we didn't even cover neural networks. I think I've got the basics figured out, but I'm having trouble preparing my data for feeding into the model.

Feel free to comment on the (un)feasibility of this, but it's contextual info and not what my question is about. An example of the type of data I have for a powerstrip-type device is 1 column DeviceID (string of numbers, unique per device), 12 columns of various integers indicating which outlets are being used and how much power is being pulled, and an integer correlated to which location the device is at. I have oodles of this type of data, and I've been thinking I could use an RNN with a softmax layer to categorize into my categories. This will be supervised learning. The columns mentioned will be the input, and an integer 1-23 will be the output. I need the model to look at a timeframe and categorize it, which would include varying numbers of rows, because there are varying numbers of devices and because each device creates a row twice per minute. For example,

ID      1   2   3   4   5   RSSI Temperature R_ID TimeStamp
43713   0   0   0   0   118 -82   97         45   2019-08-27 15:38:00.387
49945   0   0   5   0   0   -88   89         45   2019-08-27 15:38:00.493
43711   0   0   0   0   5   -65   120        45   2019-08-27 15:38:00.557
43685   12  4   0   0   0   -76   110        45   2019-08-27 15:38:01.807
44041   0   0   0   12  0   -80   104        45   2019-08-27 15:38:02.277

My problem is this: for one sample timeframe I pulled from our SQL database of 35 minutes -- timeframes can vary from 1 minute to several hours -- and I got 3,747 distinct rows. This is clearly way too much to feed the model as 1 sample. If the usage on the powerstrip doesn't change from 1 minute to the next, it will create several rows identical except for the timestamp. When I removed the timestamp, I got 333 distinct rows. That still seems like an awful lot, and it's removing the necessary time data.

My questions are these: Is that actually too much? I know from my googling that I can make it work using several rows, but can I do it when I don't know how many rows? I.e., instead of saying "look at X rows" can I say "look at X minutes of rows" as 1 sample? What would an experienced dev (or data scientist? Idek) do in a situation like this? As an alternative approach, instead of trying to work with timeframes (predetermined by the data/work we're doing), I was thinking I might try using a sliding window over [please advise] minutes, get the output from that and use those as input to get the output for the timeframe. Is that a terrible idea? Would that even work? The model needs to be able to detect differences due to time of day, different people, etc.

Thanks!

Consider presenting your problem-statement with supporting data. If I were you, I would narrow down to perhaps 7 feature columns and the 1 predictor column and use that as an example dataset with may be 10 example rows. First you need to understand what you are actually trying to achieve. For instance, are you trying to study/predict the behavior and capture the temporal dependence as well? Or, are you only interested in non-time-resolved aspect of the data and hence just want to make prediction on that? Your business requirement may ask you to study both of these scenarios. — CypherX, Dec 02 '19 at 21:50

Nick Shaffer · Accepted Answer · 2019-12-05T01:13:37.940

New answer

Here is a toy example of how to do the compression in Python:

import pandas as pd
import numpy as np

# Inputs
feature_cols = list(range(1, 13))
n_samples = 1000 

# Data prep
df = pd.DataFrame({col: np.random.choice([0, 1], size=n_samples, p=[0.99, 0.01]) for col in feature_cols})
df['ID'] = '1234'
df['TimeStamp'] = pd.date_range(end='2019-12-04', freq='30s', periods=n_samples)

# Compressed df
comp_df = df.loc[df.groupby('ID')[feature_cols].diff().sum(axis=1) != 0, :]

# Results
n_comp = len(df.index)-len(comp_df.index)
print('Data compressed by {} rows ({}%)'.format(str(n_comp), str(round(n_comp/len(df.index)*100, 2))))

As I noted in the comments you really should be doing this upstream on the DB to avoid shipping unnecessary data.

As to the machine learning I think you're getting way ahead of yourself. Start with simpler models like Random Forest or GBM. You could then work up to boosted methods like XGBoost. A net will be much less interpretable and, since you mentioned you don't have a firm grasp of the concepts I'd start small; you'd hate to be asked to interpret a result from an almost-uninterpretable model from a method you don't even fully understand yet. :)

Previous answer

Okay so if I understand correctly your data are:

timestamp (of reading)
location metadata
12 binary "features"
A categorical assignment

They are generated regardless of the orientation of the plugs and thus has lots of "duplicates". For a 35 minute sample, you got 3747 records -- so roughly 1.78 records per second.

First, that's not too much data at all. Generally more data is better, constrained by your computing power of course. A decent laptop should handle hundreds of thousands of rows without breaking a sweat. That said, if any of your data are not meaningful (or invalid, malformed, etc), then preprocess it. To put a finer point on it, by including all of those records you are biasing the model towards periods with many duplicates because, naturally, they have more samples and thus a greater influence on model performance.

Okay so we're doing preprocessing, where should it be done? In the case of unnecessary data (your duplicates), ideally this is done as far up the pipeline as possible so you're not shipping around useless data. Answer: do it when you pull from the database. The exact method depends on your database/SQL dialect but you want to use whatever windowing functionality you have to filter out those duplicates. One solution:

Compress the 12 features into a string (ex. "001000011100")
Get the lag of that new column, partitioned by location, ordered by time asc
If that lag_column = columns then throw the record away, keep it otherwise

If you're using SQL you can create a view with the above logic and query it just like you were the original data (using whatever windowing/filtering you want).

Also, some StackOverflow tips since I can't comment yet:

Giving a sample of the data is helpful, I had to reread your second paragraph a few times
Leaving the context for the end makes it faster for people to get your question, both for answerers and future readers
Similarly, try to be as concise as possible (your question could probably by 1/4 the length)
When asking technical questions it is helpful to know your stack where possible (in this case the database/SQL dialect you're using)

The categorical assignment is the output not input, the 3,747 rows is one sample, not the entirety of the input, and the "features" are integers like 12 and 118, not binary. Does that change your answer at all? I have millions of rows, so I'm not worried about having too few data haha. Could you explain better what you mean by "Get the lag of that new column, partitioned by location, ordered by time asc"? Remember I'm barely more than a total ignoramus. Thanks for the SO tips. — tiffanie, Dec 02 '19 at 22:25
Fair question. So if I understand correctly you queried a database for a given device having timestamp between X and Y, returning 3,747 rows. Is that what you mean by sample? As to the second part: what database software are you using (MySQL, PostgreSQL, Cassandra, etc.)? That will dictate how the windowing needs to be done because they use different querying dialects. — Nick Shaffer, Dec 05 '19 at 00:45
See the updated answer for an example of how to do this in Python — Nick Shaffer, Dec 05 '19 at 01:14

How do I preprocess my data when I have too much but need it all?

1 Answers1

New answer

Previous answer