I am trying to apply KMeans clustering to a data set with Timestamp values. The data set has other columns like Id (int), Side (boolean), Quarter (int), Half (int). But I only want to perform clustering using the timestamp column. How can I create a pipeline using the MLContext to do this?
The data set looks like:
DataSet
{
public int Contract_Id { get; set; }
public System.DateTime TimeStamp { get; set; }
public bool Side { get; set; }
public int Quarter { get; set; }
public int Half { get; set; }
}
I have a utility returning the data set I need to work with from a SQL database. I use the data set to load the MLContext and then use the Kmeans function to create a pipeline, using the TimeStamp column as the feature input, and ClusterId as the name of the output.
var data = unitOfWork.Repository.GetPastFiveSeconds(); // get past 5 sec data from now
var trainData = mlContext.Data.LoadFromEnumerable(data);
var pipeline = mlContext.Clustering.Trainers.KMeans("TimeStamp", "ClusterId", Convert.ToInt32(Math.Ceiling( (double)5 / data.Count() ))); // C# stuff to convert to int :/
var model = pipeline.Fit(trainData);
I want to get an array of clusters with the data points properly assigned to a cluster, but I'm getting an exception - ClusterId column 'Weight' not found
EDIT: Tried removing the ClusterId parameter from the KMeans function and set it to null and added a conversion step
mlContext.Transforms.Conversion.ConvertType("TimeStampFloat", "TimeStamp", DataKind.Single)
.Append(mlContext.Clustering.Trainers.KMeans("TimeStampFloat",null, Convert.ToInt32(Math.Ceiling( (double)5 / trades.Count() ))))
but I'm getting the error "Schema mismatch for feature column 'TimeStampFloat': expected Vector, got R4\r\nParameter name: inputSchema" now