Analysis of energy dataset by clustering

Question

So I am fairly new to machine learning and all and I am trying to create a python script to analyse a energy dataset of a computer. The script should in the end determine the different states of the computer (like idle, standby, working, etc...) and how much energy those states are using on average.

And I was wondering if this task could be done by some clustering method like k-means or DBSCAN.

I tinkered a bit with some clustering methods in scikit learn but the results so far where not as good as I expected. I researched a lot about clustering methods but I could never find a scenario similar to mine.

So my question is if it's even worth the trouble and if yes wich clustering method (or overall machine learning algorithm) would be best fitted for that task? or are there better ways to do it?

The energy dataset is just a single column table with one cell being one energy value per second of a few days.

I would say in this form, question is too broad, you can try to share some example data from your dataset, add well formatted code of yours to show the issues to begin with. — Semih Korkmaz, Jul 14 '18 at 20:12
well i have no problem with the code.. the question is more what kind of clustering method would be best(if there is a good one for my purpose). — Brandon H., Jul 14 '18 at 21:05
@BrandonH. it would help if you could give us a plot of your time series - you already have some answers, but this would at least give us a feel of the data. — Nelewout, Jul 15 '18 at 08:06

score 0 · Accepted Answer · answered Jul 15 '18 at 07:58

The energy dataset is just a single column table with one cell being one energy value per second of a few days.

You will not be able to apply supervised learning for this dataset as you do not have labels for your dataset (there is no known state given an energy value). This means that models like SVM, decision trees, etc. are not feasible given your dataset.

What you have is a timeseries with a single variable output. As I understand it, your goal is to determine whether or not there are different energy states, and what the average value is for those state(s).

I think it would be incredibly helpful to plot the timeseries using something like matplotlib or seaborn. After plotting the data, you can have a better feel for whether your hypothesis is reasonable and how you might further want to approach the problem. You may be able to solve your problem by just plotting the timeseries and observing that there are, say, four distinct energy states (e.g. idle, standby, working, etc.), avoiding any complex statistical techniques, machine learning, etc.

To answer your question, you can in principle use k-means for one dimensional data. However, this is probably not recommended as these techniques are usually used on multidimensional data.

I would recommend that you look into Jenks natural breaks optimization or kernel density optimization. Similar questions to yours can be found here and here, and should help you get started.

score 0 · Answer 2 · answered Jul 15 '18 at 10:55

Don't ignore time.

First of all, if your signal is noisy, temporal smoothing will likely help.

Secondly, you'll want to perform some feature extraction first. For example, by using segmentation to cut your time series into separate states. You can then try to cluster these states, but I am not convinced that clustering is applicable here at all. You probably will want to use a histogram, or a density plot. It's one dimensional data - you can visualize this, and choose thresholds manually instead of hoping that some automated technique may work (because it may not...)

Analysis of energy dataset by clustering

2 Answers2