I am literally months out of college with a CS BS and my boss is having me build a machine learning agent to classify data into 23 categories from scratch all by myself in two months. I took a single Intro to AI class, and we didn't even cover neural networks. I think I've got the basics figured out, but I'm having trouble preparing my data for feeding into the model.
Feel free to comment on the (un)feasibility of this, but it's contextual info and not what my question is about. An example of the type of data I have for a powerstrip-type device is 1 column DeviceID (string of numbers, unique per device), 12 columns of various integers indicating which outlets are being used and how much power is being pulled, and an integer correlated to which location the device is at. I have oodles of this type of data, and I've been thinking I could use an RNN with a softmax layer to categorize into my categories. This will be supervised learning. The columns mentioned will be the input, and an integer 1-23 will be the output. I need the model to look at a timeframe and categorize it, which would include varying numbers of rows, because there are varying numbers of devices and because each device creates a row twice per minute. For example,
ID 1 2 3 4 5 RSSI Temperature R_ID TimeStamp
43713 0 0 0 0 118 -82 97 45 2019-08-27 15:38:00.387
49945 0 0 5 0 0 -88 89 45 2019-08-27 15:38:00.493
43711 0 0 0 0 5 -65 120 45 2019-08-27 15:38:00.557
43685 12 4 0 0 0 -76 110 45 2019-08-27 15:38:01.807
44041 0 0 0 12 0 -80 104 45 2019-08-27 15:38:02.277
My problem is this: for one sample timeframe I pulled from our SQL database of 35 minutes -- timeframes can vary from 1 minute to several hours -- and I got 3,747 distinct rows. This is clearly way too much to feed the model as 1 sample. If the usage on the powerstrip doesn't change from 1 minute to the next, it will create several rows identical except for the timestamp. When I removed the timestamp, I got 333 distinct rows. That still seems like an awful lot, and it's removing the necessary time data.
My questions are these: Is that actually too much? I know from my googling that I can make it work using several rows, but can I do it when I don't know how many rows? I.e., instead of saying "look at X rows" can I say "look at X minutes of rows" as 1 sample? What would an experienced dev (or data scientist? Idek) do in a situation like this? As an alternative approach, instead of trying to work with timeframes (predetermined by the data/work we're doing), I was thinking I might try using a sliding window over [please advise] minutes, get the output from that and use those as input to get the output for the timeframe. Is that a terrible idea? Would that even work? The model needs to be able to detect differences due to time of day, different people, etc.
Thanks!