0

I export my current ML.Net Data to a CSV with this function

using (var stream = File.Create("c:\\temp\\aidata.csv"))
             mlContext.Data.SaveAsText(trainData, stream);          

The saved data looks ok, but when I try to load the CSV with the ML.Net Model Builder I get this error:

Data Error: Unable to infer column types of the file provided.

Note: I also have vector columns in the CSV

The CSV file looks like this (I have removed some colums to be smaller).

If I load the data file with mlContext.Data.LoadFromTextFile("c:\temp\aidata.csv"); it loads it without any error, but the Model Builder fails to load it.

CSDev
  • 3,177
  • 6
  • 19
  • 37
Mario
  • 13,941
  • 20
  • 54
  • 110
  • What does the data look like? Just tried the `SaveAsText` on a project I had and didn't get the error. – Jon Aug 09 '19 at 11:16
  • it looks like this https://pastebin.com/xBBrpcaM (I have removed some some colums to be smaller) – Mario Aug 09 '19 at 11:26
  • 1
    You do not have a CSV file. CSV has one header row (optional) with data normally separated with character like a comma. You have lots of header rows and the data is separated with TABs. You may be able to import as a Tab delimited file. – jdweng Aug 12 '19 at 05:13
  • @jdweng this is the CSV file exported by ML.Net, it is a CSV file which contains vectors! (sub-arrays for each record). – Mario Aug 12 '19 at 07:17
  • 1
    You may want to change the extension from csv (comma separated values) to tsv (tab separated values) like the examples on on the link you provided. The error may be due to the extension of the filename being wrong. – jdweng Aug 12 '19 at 08:49
  • @MarioM Was data removed as well as the columns? It does look like there are more data than there are columns from the paste bin. – Jon Aug 15 '19 at 22:45
  • @Jon No, the data was left there, here is a complete data file, it gets deleted after the first download https://file.io/euxcP5 – Mario Aug 16 '19 at 18:47

2 Answers2

1

As a matter of fact the file can not be .csv just because it's saved with the extension. It needs transformation like this:

static class MLCsvHelper
{
    private class ColumnDefinition
    {
        private readonly int end;

        public string Name { get; }
        public int Start { get; }
        public int Count { get; }

        public ColumnDefinition(string name, int start, int count) =>
            (Name, Start, Count, end) = (name, start, count, start + count - 1);

        public override string ToString() =>
            $"{Name}:\"{Start}:{end}\"";
    }

    public static void Patch(string file, out string csv)
    {
        csv = Path.ChangeExtension(file, "patched.csv");
        var lines = File.ReadAllLines(file);

        var columns = lines.TakeWhile(line => line.Contains("#@"))
            .Where(line => line.Contains("col=")).Select(line => GetColumn(line))
            .ToArray();

        var data = lines.SkipWhile(line => line.Contains("#@")).Skip(1)
            .Select(line => line.Split('\t')).ToArray();

        var res = new[] { string.Join("\t", columns.Select(column => column.Name)) }
            .Concat(data.Select(item => string.Join("\t", columns.Select(column => GetValue(column, item)))));

        File.WriteAllLines(csv, res.ToArray());
    }

    private static ColumnDefinition GetColumn(string line)
    {
        var items = line.Split(new[] { '=', ':' });
        var name = items[1];
        var range = items.Last().Split('-');
        var start = int.Parse(range.First());
        var last = int.Parse(range.Last());
        var count = last - start + 1;
        return new ColumnDefinition(name, start, count);
    }

    private static string GetValue(ColumnDefinition column, string[] data)
    {
        var chunk = data.Skip(column.Start).Take(column.Count);
        var value = string.Join("\t", chunk);
        if (chunk.Skip(1).Any())
            value = $"\"{value}\"";
        return value;
    }
}

MLCsvHelper.Patch("zvVEYT", out var csv);

CSDev
  • 3,177
  • 6
  • 19
  • 37
  • I have tried this method, and yes the patched file is a little smaller and contains the quotes which are not in the file after the first SaveAsText....but the model builder still cannot load the file, I get the same error...:( – Mario Aug 16 '19 at 18:26
  • Here is my complete CSV file, saved with your patch https://file.io/UvlDDi – Mario Aug 16 '19 at 18:40
  • @Mario M, it's file not found. – CSDev Aug 16 '19 at 18:41
  • they have deleted the file, I don't know why – Mario Aug 16 '19 at 18:42
  • Try this now https://file.io/zvVEYT I believe after the first download it gets deleted – Mario Aug 16 '19 at 18:43
  • Did you get it ? – Mario Aug 16 '19 at 18:49
  • @MarioM, yes, thanks, got the file. Will take a look. – CSDev Aug 16 '19 at 19:00
  • @MarioM, I wrote a util to transform the file to real csv. See renewed answer. – CSDev Aug 17 '19 at 14:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198085/discussion-between-alex-and-mario-m). – CSDev Aug 17 '19 at 15:29
  • @MarioM, did you try it? I even got it trained with accuracy of 75%. – CSDev Aug 17 '19 at 18:07
  • wow! yes it works now! But I have to check if the vectors are loaded as vectors and not as a single value – Mario Aug 17 '19 at 18:51
  • the float vector columns are converted to strings, and the non vector columns are float. That means the vectors are loaded into a single string which is not ok – Mario Aug 17 '19 at 19:07
  • Try to generate the code with the model builder, and look in ModelInput.cs, there are no vectors in any column – Mario Aug 17 '19 at 19:10
  • @MarioM, I see it. Will try to find out something. – CSDev Aug 18 '19 at 12:46
  • @MarioM, sadly, I was looking for the solution whole day and didn't manage to. I tried different formatting, tried `CsvHelper`, some other approaches. Nothing. The Model Builder is not open source, so one can not fix it for personal use. They got [feature request](https://github.com/dotnet/machinelearning/issues/3684). Someone else is experiencing this [string-problem](https://stackoverflow.com/questions/57454880/using-ml-net-model-builder-to-predict-value-at-date-gives-strange-results). – CSDev Aug 18 '19 at 17:01
  • I know, I was also looking for the source code of model builder. – Mario Aug 21 '19 at 10:07
  • @MarioM, they try different approaches one by one searching for the best one. One can implement it with help of [samples](https://github.com/dotnet/machinelearning-samples). – CSDev Aug 21 '19 at 10:14
  • are you saying to make my own model builder with AutoML ? – Mario Aug 21 '19 at 10:19
  • @MarioM, yes, the only way, unless they fix it or go open source. I guess it will be mostly copy-pasting from those samples and simple code generation. – CSDev Aug 21 '19 at 10:24
  • @MarioM, how is it going? Any success with the problem? – CSDev Aug 29 '19 at 15:40
  • not yet, I worked on a different part of the project. Also they released a newer version of Model Builder but it still does not support vectors. – Mario Sep 03 '19 at 22:51
  • I have created an automated AutoML Regression test, I got results for 11 tests but it freezes after FastForest without any error, I got the results using a progresshandler – Mario Sep 13 '19 at 19:55
  • @MarioM, did you try inspecting event viewer or profiling? – CSDev Sep 17 '19 at 08:00
  • no, but I tried LightGbm regression which was found by AutoML to be an accurate prediction and sadly with live data is not. Now I am looking for a C# deep learning library with GPU support and I can't find one. – Mario Sep 17 '19 at 22:30
0

As @jdweng said in comments. The file you provided is not in a correct csv format (columns and values are seperated by ";"). It does however look like tsv format(columns and values are seperated by tabs).

It should work if you try saving your text as a tsv file instead.

Also, the ML.Net video example is using tsv format.

D. Dahlberg
  • 156
  • 1
  • 9
  • I have renamed the file to .tsv and tried to load it in Model Builder, but it still does not recognize it. Again, it is a file saved by ML.Net with Data.SaveAsText, it should recognize it's own format. – Mario Aug 12 '19 at 12:22
  • How can I save it as a tsv from ML.Net data? there is only one function, SaveAsText – Mario Aug 12 '19 at 12:23
  • Looks like there is a separator character option : https://learn.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/load-data-ml-net – jdweng Aug 12 '19 at 15:24
  • I found it alot easier to just convert to csv and run it from there. Does your file contain all the "#@" as in your pastebin link? if so, does it work if you remove that? – D. Dahlberg Aug 12 '19 at 15:31
  • @D.Dahlberg Yes it contains all the "#@" ...I have removed the characters and now I get 2 errors instead of one...Data Error: Unable to infer column types of the file provided. Data Error: Unable to split the file provided into multiple, consistent columns. – Mario Aug 12 '19 at 22:08
  • @D.Dahlberg if I load the data with mlContext.Data.LoadFromTextFile("c:\\temp\\aidata.csv"); it works. But the Model Builder cannot load it...I also tried a smaller file (2MB) instead of 15MB the original file and same result. – Mario Aug 12 '19 at 22:14
  • I copied the content from your link without all the ”#@” parameters and serperated the columns with ; and placed your data on separate lines and saved as csv and it worked! I could train the model and finish – D. Dahlberg Aug 13 '19 at 14:37
  • @D.Dahlberg can you paste a small portion of your CSV which have vectors to pastebin? So I can see what is the difference between yours and mine – Mario Aug 13 '19 at 18:18
  • Here's the [link](https://pastebin.com/DSgpKbST). based on your file I interpreted the columns to be like that. Was it wrong? – D. Dahlberg Aug 13 '19 at 18:49
  • It was wrong.. here's a new [link](https://pastebin.com/YvN4y7QB). Both files worked for me though – D. Dahlberg Aug 13 '19 at 18:56
  • @D.Dahlberg but you don't have vectors in this CSV, – Mario Aug 15 '19 at 08:29
  • @D.Dahlberg these normal CSV files work, but as I said I have data with vectors (more than 1 record for some of the fields) – Mario Aug 15 '19 at 11:00
  • @MarioM Hmm.. maybe [this](https://stackoverflow.com/a/3853647/8380785) answer could work then? – D. Dahlberg Aug 15 '19 at 12:03
  • @D.Dahlberg and how can I save it as TSV when there is only one SaveAsText function? – Mario Aug 16 '19 at 18:49
  • @D.Dahlberg it seems that model builder cannot load CSV files with vector columns – Mario Aug 17 '19 at 19:37
  • @MarioM Yeah sorry I can’t get that to work either! Could splitting to separate columns work for you? – D. Dahlberg Aug 17 '19 at 21:51
  • I need vectors, because I have more than half of the columns with over 1000 values each – Mario Aug 18 '19 at 07:28