2

I have a CSV file 1.6 GB large, that I need to feed into matlab. I will have to do this frequently and I need it to run quickly. The file is of the form:

20111205    00:00.2 99.18   6   E
20111205    00:00.2 99.18   5   E
20111205    00:00.2 99.18   1   E
20111205    00:00.2 99.195  5   E
20111205    00:00.2 99.195  5   E
20111205    01:27.0 99.19   5   E
20111205    02:01.4 99.185  1   E
20111205    02:01.4 99.185  1   E
20111205    02:01.4 99.185  1   E
20111205    02:01.4 99.185  1   E

The code I have right now is the following:

tic;
format long g
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
[c] = fscanf(fid, '%d,%d:%d.%d,%f,%d,%c');
c = reshape(c, 7, length(c)/7)  
toc;

But this is far too slow. I would appreciate a method of getting this CSV file into matlab in the most efficient manner possible. Thank you!

siegel
  • 819
  • 2
  • 12
  • 24
  • 2
    "far too slow" / "run quickly" -> can you express it with a unit of time? – Franck Dernoncourt Jun 12 '13 at 01:33
  • Also, how long does the reshape command take? I imagine the vast majority of the time is in reading the file, but could you check? – Engineero Jun 12 '13 at 03:33
  • Do you have sufficient RAM to hold the entire array in memory? 1.6 GB in lines of about 35 bytes (eyeballing) means about 50 million lines - times 7 numbers of 8 bytes = 2.8 GB. While you are reshaping, it's possible there are two copies of the array needed. Big enough to check that isn't an issue... – Floris Jun 12 '13 at 05:47
  • 1
    Possible duplicate: http://stackoverflow.com/q/9440592/931379 – Pursuit Jun 12 '13 at 06:02

3 Answers3

3

Consider using a binary file format. Binary files are much smaller and don't need to be converted by MATLAB into the binary format. Hence they are much faster to read and write. They may also be more accurate (precision may be higher).

http://www.mathworks.com.au/help/matlab/ref/fread.html

axon
  • 1,190
  • 6
  • 16
  • +1 for a very sensible suggestion especially when you "need to do this a lot", even though it wasn't what the OP asked for... Not having to convert ASCII characters will save a lot of time - with binary files you are typically I/O speed limited. – Floris Jun 12 '13 at 05:49
  • Thanks axon! How should I convert my CSV files to binary? – siegel Jun 12 '13 at 15:47
1

The recommended syntax is textscan (http://www.mathworks.com/help/matlab/ref/textscan.html)

Your code would look like this:

fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = textscan(fid, '%d,%d:%d.%d,%f,%d,%c');
fclose(fid);

You end up with a cell array... whether it's worth converting that to another shape really depends on how you want to access the data afterwards.

It is quite likely that this would be faster if you include a loop that allows you to use a smaller, fixed amount of memory for much of the operation. One problem with reading large files is the fact that you don't know ahead of time how big it will be - and that very likely means that Matlab guesses the amount of memory it needs, and frequently has to rescale. That is a very slow operation - if it happens every 1MB, say, then it copies 1 MB once, next 2 MB, then again 3 MB, etc - as you can see it is quadratic in the size of the array.

If instead you allocate a fixed amount of memory for the final result, and process in smaller batches, you avoid all that overhead. I'm pretty sure it will be much faster - but you would have to experiment a bit with the block size. That would look something like this:

block = 1000;
Nlines = 35E6; 
fid = fopen('C:\Program Files\MATLAB\R2013a\EDU13.csv','r');
c = struct(field1, field2, fieldn, value); %... initialize structure array or other storage for c ...
c_offset = 0;
while ~feof(fid)
  temp = textscan(fid, '%d,%d:%d.%d,%f,%d,%c', block);
    bt = size(temp, 1); % first dimension - should be `block`, except for last loop
    %... extract, process, store in c(c_offset + (1:bt))... 
    c_offset = c_offset + bt;
end
fclose(fid);
Floris
  • 45,857
  • 6
  • 70
  • 122
  • Thanks for the response! I think you are correct about rescaling taking up much of the time. I'm trying to figure out how to run your code, can you tell me what the 4th line is doing? what should field1 etc be? – siegel Jun 12 '13 at 15:46
  • The idea of line 4 is to create the storage space all at once. I don't know what you want the format of the data in memory to be - so I suggested a "general" structure initialization. I apologize for the shorthand. The following line is syntactically correct, and shows the idea: `c = struct('date', cell(1,Nlines), 'time', cell(1,Nlines), 'effort', cell(1,Nlines), 'flag', cell(1,Nlines));` - it creates an `Nlines` element structure array and you can access elements with `c(4).date` etc. But see my other answer for more speed... – Floris Jun 12 '13 at 16:52
1

Inspired by @Axon's answer, I implemented a "fast" C program to convert the file to binary, then read it in using Matlab's fread function. Spoiler alert: reading is then 20x faster... although the initial conversion takes a little bit of time.

To make the job in Matlab easier, and the file size smaller, I am converting each of the number fields into an int16 (short integer). For the first field - which looks like a yyyymmdd field - that involves splitting into two smaller numbers; similarly the decimal numbers are converted to two short integers (given the apparent range I think that is valid). All this is recognizing that "to really optimize, you must really know your problem" - so if assumptions are invalid, the results will be too.

Here is the C code:

#include <stdio.h>
int main(){
  FILE *fp, *fo;
  long int ld1;
  int d2, d3, d4, d5, d6, d7;
  short int buf[9];
  char c8;
  int n;
  short int year, monthday;
  fp = fopen("bigdata.txt", "r");
  fo = fopen("bigdata.bin", "wb");
  if (fp == NULL || fo == NULL) {
    printf("unable to open file\n");
    return 1;
  }
  while(!feof(fp)) {
    n = fscanf(fp, "%ld %d:%d.%d %d.%d %d %c\n", \
      &ld1, &d2, &d3, &d4, &d5, &d6, &d7, &c8);
    year = d1 / 10000;
    monthday = d1 - 10000 * year;
    // move everything into buffer for single call to fwrite:
    buf[0] = year;
    buf[1] = monthday;
    buf[2] = d2;
    buf[3] = d3;
    buf[4] = d4;
    buf[5] = d5;
    buf[6] = d6;
    buf[7] = d7;
    buf[8] = c8;
    fwrite(buf, sizeof(short int), 9, fo);
    }
  fclose(fp);
  fclose(fo);
  return 0;
}  

The resulting file is about half the size of the original - which is encouraging and will speed up access. Note that it would be a good idea if the output file could be written to a different disk than the input file - it really helps keep data streaming without a lot of time wasted in seek operations.

Benchmark: using a file of 2 M lines as input, this ran in about 2 seconds (same disk). The resulting binary file is read in Matlab with the following:

tic
fid = fopen('bigdata.bin');
d = fread(fid, 'int16');
d = reshape(d, 9, []);
toc

Of course, now if you want to recover the numbers as floating point numbers, you will have to do a little bit of work; but I think it's worth it. One possible problem you will have to solve is the situation where the value after the decimal point has a different number of digits: converting (a,b) into float isn't as simple as "a + b/100" when b > 100... "exercise for the student"?

A little benchmarking: The above code took about 0.4 seconds. By comparison, my first suggestion with textread took about 9 seconds on the same file; and your original code took a little over 11 seconds. The difference may get bigger when the file gets bigger.

If you do this a lot (as you said), it clearly is worth converting your files once to binary format, and using them that way. Especially if the file needs to be converted only once, and read many times, the savings will be considerable.

update

I repeated the benchmark with a 13M line file. The conversion took 13 seconds, the binary read < 3 seconds. By contrast each of the other two methods took over a minute (textscan: 61s; fscanf: 77s). It seems that things are scaling linearly (file size 470M text, 240M binary)

Floris
  • 45,857
  • 6
  • 70
  • 122
  • Using this code: `tic; format long g fid = fopen('C:\Program Files\MATLAB\R2013a\TUM12.txt','r'); Nrows = numel(textread('TUM12.txt','%1c%*[^\n]')) [c] = fscanf(fid, '%d,%d:%d:%d.%d,%f,%d,%c',Nrows); toc;` On a file 10.5 million lines long (400,000kb) takes 25 seconds – siegel Jun 12 '13 at 20:04
  • Is that considerably faster than before? Or is that what you call "much too slow"? It's about 2x faster than I measured (I got ~210k lines per second vs your 420k) - but there will be hardware differences (I was running on a single core - CPU = 100%). I would be curious to see if the binary solution helps you - it ought to make a difference especially if you read the same file many times. – Floris Jun 12 '13 at 20:38
  • I really appreciate your C solution! Do you know if it's possible to convert to a binary file using matlab? This only has to be done once so I don't mind if the conversion to binary takes a little longer. I'll store the files in binary and read them many times. `fid = fopen('myFile.txt', 'r'); F = fread(fid, 'char=>uint32')'` seems to do the trick but I'm not sure how to convert back to floating point again. – siegel Jun 12 '13 at 21:29
  • Yes, that was 1/3 the time as before. – siegel Jun 12 '13 at 21:30
  • Matlab to binary would use `fwrite(fid, data)`. But if you have all the data read into a variable in Matlab, you can do `save(fileName, 'variableName');` (note - the variable name is given as a STRING - this is important). Then you can load the variable back with `load(fileName);`. The variable will re-appear, with the same name it had when you saved it. That may just be the easiest approach for you... – Floris Jun 13 '13 at 00:22