0

I am trying to apply KMeans clustering to a data set with Timestamp values. The data set has other columns like Id (int), Side (boolean), Quarter (int), Half (int). But I only want to perform clustering using the timestamp column. How can I create a pipeline using the MLContext to do this?

The data set looks like:

DataSet
{
        public int Contract_Id { get; set; }
        public System.DateTime TimeStamp { get; set; }
        public bool Side { get; set; }
        public int Quarter { get; set; }
        public int Half { get; set; }
}

I have a utility returning the data set I need to work with from a SQL database. I use the data set to load the MLContext and then use the Kmeans function to create a pipeline, using the TimeStamp column as the feature input, and ClusterId as the name of the output.

var data = unitOfWork.Repository.GetPastFiveSeconds(); // get past 5 sec data from now

var trainData = mlContext.Data.LoadFromEnumerable(data);

var pipeline = mlContext.Clustering.Trainers.KMeans("TimeStamp", "ClusterId", Convert.ToInt32(Math.Ceiling(  (double)5 / data.Count()   ))); // C# stuff to convert to int :/

var model = pipeline.Fit(trainData);

I want to get an array of clusters with the data points properly assigned to a cluster, but I'm getting an exception - ClusterId column 'Weight' not found

EDIT: Tried removing the ClusterId parameter from the KMeans function and set it to null and added a conversion step

mlContext.Transforms.Conversion.ConvertType("TimeStampFloat", "TimeStamp", DataKind.Single)
.Append(mlContext.Clustering.Trainers.KMeans("TimeStampFloat",null, Convert.ToInt32(Math.Ceiling(  (double)5 / trades.Count()   ))))

but I'm getting the error "Schema mismatch for feature column 'TimeStampFloat': expected Vector, got R4\r\nParameter name: inputSchema" now

Sagar Limaye
  • 49
  • 1
  • 12

2 Answers2

0

The second parameter to KMeans, in your case you are passing "ClusterId", is the name of the starting weights column.

You don't seem to have a ClusterId property in your DataSet type, therefore it is failing to find it.

Also the third parameter is the number of clusters that you expect to see in your data. I'd play with it and try a few values, if you don't know what to expect.

So try:

var pipeline = mlContext.Clustering.Trainers.KMeans("TimeStamp");

You will need some pre-processing of your TimeStamp, as it is of a System.DateTime type. KMeans (and most ML.NET algorithms) will expect float types. Add a Transforms.Conversion.ConvertType to your pipeline.

amy8374
  • 1,450
  • 3
  • 17
  • 26
0

Answer for the edit:

KMeans Feature column should be a vector of floats, because usually there are many feature columns concatenated together. It is a hack but if you add a Concatenate to your pipeline, after Convert, and before KMeans it should succeed:

mlContext.Transforms.Conversion.ConvertType("TimeStampFloat", "TimeStamp", DataKind.Single)
.Append(ml.Transforms.Concatenate("TimeStampFloat", new [] {"TimeStampFloat"}))
.Append(mlContext.Clustering.Trainers.KMeans("TimeStampFloat",null, 5))
amy8374
  • 1,450
  • 3
  • 17
  • 26
  • Yep that worked. But all the data is being assigned to a single cluster, even if they are more than a minute apart! – Sagar Limaye Apr 03 '19 at 18:23
  • How does the data look like? If they are big floats, one minute apart might not be enough to cluster them separately. – amy8374 Apr 03 '19 at 18:40
  • Some data timestamp values are few seconds apart, and few minutes apart like: 15:47:00 15:47:03 15:49:00 15:49:03 15:49:03 15:49:52 15:54:57 16:00:00 17:07:49 17:08:12 17:08:12 17:09:13 17:09:13 10:00:05 10:01:00 10:01:00 10:01:00 11:10:00 11:10:00 11:10:15 11:01:00 11:02:00 11:03:00 – Sagar Limaye Apr 03 '19 at 19:06
  • Actually, if I set number of clusters to 5 the KMeans throws an exception - "Too few examples" and if I set it to Math.Ceil(5 / data.count()) , which probably evaluates to 1, it puts all the examples in one cluster. For the above data, how do I make it cluster into 13 clusters - one for each minute – Sagar Limaye Apr 04 '19 at 12:31
  • 1
    If the data is contiguous, like this, might help to pre-process and separate the minute from the second, each in its own column. You can use the minute as the Feature column, if that is your only criteria. But, if you know that you want to group data in clusters, and each cluster will be the minute indicator, it might be easier to do it with just simpler code (linq, just a switch statement?), rather than machine learning. Machine learning would be better if you wanted to discover your clusters and centroids. – amy8374 Apr 17 '19 at 05:56