2

I need to read in a lot of data (~10^6 data points) from a *.csv-file.

  • the data is stored in lines
  • I can't know how many data points per line and how many lines are there before I read it in
  • the amount of data points per line can be different for each line

So the *.csv-file could look like this:

x Header

x1,x2

y Header

y1,y2,y3, ...

z Header

z1,z2

...

Right now I read in every line as string and split it at every comma. This is what my code looks like:

index = 1;
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');

while ~isempty(headerLine{1})

    dummy = textscan(csvFileHandle,'%s',1,'Delimiter','\n', ...
                'BufSize',2^31 - 1);
    rawData(index) = textscan(dummy{1}{1},'%f','Delimiter',',');
    headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');

    index = index + 1;
end

It's working, but it's pretty slow. Most of the time is used while splitting the string with textscan. (~95%). I preallocated rawData with sample data, but it brought next to nothing for the speed.

Is there a better way than mine to read in something like this?

If not, is there a faster way to split this string?

Fugu_Fish
  • 131
  • 2
  • 10

1 Answers1

1

First suggestion: to read a single line as a string when looping over a file, just use fgetl (returns a nice single string so no faffing with cell arrays).

Also, you might consider (if possible), reading everything in a single go rather than making repeating reads from file:

output = textscan(fid, '%*s%s','Delimiter','\n');  % skips headers with *

If the file is so big that you can't do everything at once, try to read in blocks (e.g. tackle 1000 lines at a time, parsing data as you go).

For converting the string, there are the options of str2num or strsplit+str2double but the only thing I can think of that might be slightly quicker than textscan is sscanf. Since this doesn't accept the delimiter as a separate input put it in the format string (the last value doesn't end with ,, true, but sscanf can handle that).

for n = 1:length(output);
    data{n} = sscanf(output{n},'%f,');
end

Tests with a limited patch of test data suggests sscanf is a bit quicker (but might depend on machine/version/data sizes).

nkjt
  • 7,825
  • 9
  • 22
  • 28
  • I can't try it at my workplace until next week but I tested it on another system and so far it seems to be even slower: If I read in the whole file it's ~5-10% slower. If I read in line per line with fgetl it's ~85% slower. I'll get back to you after I tested it on the important system. – Fugu_Fish Jan 08 '15 at 17:41
  • Works like a charm on the important system, which runs on R2009b. Your method only needs ~5% of the time of my code. Thanks alot. – Fugu_Fish Jan 15 '15 at 06:21
  • Good to know! There can be a lot of variation depending on MATLAB version + system (think it's partly due to optimisation tweaks by Mathworks). – nkjt Jan 15 '15 at 10:23