using textscan to read badly formatted CSV file in Octave

Question

I am trying to read a CSV file in Octave with textscan and the CSV file isn't always correctly formatted. The following mcve should illustrate the issue:

Let's say the file is as follows:

12/01/2020,12,1,2020,0,0,Russia,RU,RUS,145872260,Europe,0
11/01/2020,11,1,2020,0,0,Russia,RU,RUS,145872260,Europe,0
10/01/2020,10,1,2020,0,0,Russia,RU,RUS,145872260,Europe,0
09/01/2020,9,1,2020,0,0,Russia,RU,RUS,145872260,Europe,0
08/01/2020,8,1,2020,0,0,Russia,RU,RUS,145872260,Europe,0
07/01/2020,7,1,2020,0,0,Russia,RU,RUS,145872260,Europe,
06/01/2020,6,1,2020,0,0,Russia,RU,RUS,145872260,Europe,
05/01/2020,5,1,2020,0,0,Russia,RU,RUS,145872260,Europe,

You will notice that the final 0 is missing in the last 3 lines. Obviously, I can go in and manually edit the CSV files in Notepad++ or similar, but we're talking several tens of thousands of lines to go through and I just feel there must be a better solution.

My code would be something like this (note that I have tried using %*f for the last element to tell Octave to skip it but that doesn't seem to work):

fname = 'mcve.csv'; % the above file
fid = fopen(fname);
csv_data = textscan(fid,'%s %d %d %d %d %d %s %s %s %d %s %*f','Delimiter',',');
fclose(fid);

If you then look at csv_data, you will see that the dates are not correct (the rest of the data looks OK):

>> csv_data{1}
ans =
{
  [1,1] = 12/01/2020
  [2,1] = 11/01/2020
  [3,1] = 09/01/2020
  [4,1] = 08/01/2020
  [5,1] = 07/01/2020
  [6,1] = /01/2020
  [7,1] = /01/2020
}

Any idea on how to solve this or what else to try other than the %*f I already tried?

is the only variation you need to accommodate the presence or lack of a trailing 0? Notepad++ with an intelligent find/replace might be much faster anyway. Replace "Europe," with "Europe,0", then replace all ",00" with ",0". other variations may be just as fast. — Nick J, Jul 13 '20 at 20:25
haven't tried this, but another approach might be to remove the delimiter statement, add the commas into the format string, and have it include the ',0' in the %s with the 'Europe'. parse and correct the missing 0 separately after the data is read. — Nick J, Jul 13 '20 at 20:30
Yes, the missing trailing 0 is the only thing I am trying to accomodate. The `csv2cell` option suggested in the answer seems to work well. — am304, Jul 14 '20 at 08:39

score 2 · Accepted Answer · answered Jul 13 '20 at 23:10

2

Use csv2cell from the io package.

answered Jul 13 '20 at 23:10

Tasos Papastylianou

21,371
2
28
57

Thank you, it worked. I had to rearange some of the downstream code because the resulting cell from `csv2cell` is structured differently from the one from `textscan`, but essentially it worked. – am304 Jul 14 '20 at 08:38
@am304 glad to hear it. Honestly, I feel like I give this answer to a variant of the textscan question in the comments every other week. Textscan is horrible. I think I might just start linking to this answer instead from now on. :) – Tasos Papastylianou Jul 14 '20 at 10:38

using textscan to read badly formatted CSV file in Octave

1 Answers1

Linked