Inspired by @Axon's answer, I implemented a "fast" C program to convert the file to binary, then read it in using Matlab's fread
function. Spoiler alert: reading is then 20x faster... although the initial conversion takes a little bit of time.
To make the job in Matlab easier, and the file size smaller, I am converting each of the number fields into an int16
(short integer). For the first field - which looks like a yyyymmdd field - that involves splitting into two smaller numbers; similarly the decimal numbers are converted to two short integers (given the apparent range I think that is valid). All this is recognizing that "to really optimize, you must really know your problem" - so if assumptions are invalid, the results will be too.
Here is the C code:
#include <stdio.h>
int main(){
FILE *fp, *fo;
long int ld1;
int d2, d3, d4, d5, d6, d7;
short int buf[9];
char c8;
int n;
short int year, monthday;
fp = fopen("bigdata.txt", "r");
fo = fopen("bigdata.bin", "wb");
if (fp == NULL || fo == NULL) {
printf("unable to open file\n");
return 1;
}
while(!feof(fp)) {
n = fscanf(fp, "%ld %d:%d.%d %d.%d %d %c\n", \
&ld1, &d2, &d3, &d4, &d5, &d6, &d7, &c8);
year = d1 / 10000;
monthday = d1 - 10000 * year;
// move everything into buffer for single call to fwrite:
buf[0] = year;
buf[1] = monthday;
buf[2] = d2;
buf[3] = d3;
buf[4] = d4;
buf[5] = d5;
buf[6] = d6;
buf[7] = d7;
buf[8] = c8;
fwrite(buf, sizeof(short int), 9, fo);
}
fclose(fp);
fclose(fo);
return 0;
}
The resulting file is about half the size of the original - which is encouraging and will speed up access. Note that it would be a good idea if the output file could be written to a different disk than the input file - it really helps keep data streaming without a lot of time wasted in seek operations.
Benchmark: using a file of 2 M lines as input, this ran in about 2 seconds (same disk). The resulting binary file is read in Matlab with the following:
tic
fid = fopen('bigdata.bin');
d = fread(fid, 'int16');
d = reshape(d, 9, []);
toc
Of course, now if you want to recover the numbers as floating point numbers, you will have to do a little bit of work; but I think it's worth it. One possible problem you will have to solve is the situation where the value after the decimal point has a different number of digits: converting (a,b) into float isn't as simple as "a + b/100" when b > 100... "exercise for the student"?
A little benchmarking: The above code took about 0.4 seconds. By comparison, my first suggestion with textread
took about 9 seconds on the same file; and your original code took a little over 11 seconds. The difference may get bigger when the file gets bigger.
If you do this a lot (as you said), it clearly is worth converting your files once to binary format, and using them that way. Especially if the file needs to be converted only once, and read many times, the savings will be considerable.
update
I repeated the benchmark with a 13M line file. The conversion took 13 seconds, the binary read < 3 seconds. By contrast each of the other two methods took over a minute (textscan: 61s; fscanf: 77s). It seems that things are scaling linearly (file size 470M text, 240M binary)