38

My MATLAB program is reading a file about 7m lines long and wasting far too much time on I/O. I know that each line is formatted as two integers, but I don't know exactly how many characters they take up. str2num is deathly slow, what matlab function should I be using instead?

Catch: I have to operate on each line one at a time without storing the whole file memory, so none of the commands that read entire matrices are on the table.

fid = fopen('file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);    
    %do stuff with nums
    tline = fgetl(fid);
end
fclose(fid);
user714403
  • 572
  • 2
  • 7
  • 15
  • 3
    How do you know that it's the I/O that's the bottleneck? I'm inclined to think that the bottleneck is more likely the operation you're doing on the numbers. If you could vectorize that operation by processing the data in chunks, you may see better performance. – gnovice Feb 25 '12 at 03:29
  • Currently using sscanf(tline, '%d %d', 2) and it's working quite a bit faster, but this still isn't great. – user714403 Feb 25 '12 at 03:30
  • 1
    @gnovice, because when I simply read the file (i.e. leave %do stuff commented out) it takes almost the same amount of time. – user714403 Feb 25 '12 at 03:31
  • you might consider investing in a SSD if you are on a HDD – zamazalotta Feb 27 '12 at 14:25
  • 3
    Instead of sscanf, try using `fscanf(fid, '%d %d', 100000)` to read a big chunk and then looping over the numbers in that chunk. And use `profile on -timer real` to confirm where you're spending your time. – Andrew Janke Feb 27 '12 at 16:48
  • some related links: [High Performance File I/O](http://blogs.mathworks.com/loren/2006/04/19/high-performance-file-io/), [Handling Large Data Sets Efficiently in MATLAB](http://www.mathworks.com/matlabcentral/fileexchange/9060) – Amro Aug 28 '12 at 16:26

4 Answers4

63

Problem statement

This is a common struggle, and there is nothing like a test to answer. Here are my assumptions:

  1. A well formatted ASCII file, containing two columns of numbers. No headers, no inconsistent lines etc.

  2. The method must scale to reading files that are too large to be contained in memory, (although my patience is limited, so my test file is only 500,000 lines).

  3. The actual operation (what the OP calls "do stuff with nums") must be performed one row at a time, cannot be vectorized.

Discussion

With that in mind, the answers and comments seem to be encouraging efficiency in three areas:

  • reading the file in larger batches
  • performing the string to number conversion more efficiently (either via batching, or using better functions)
  • making the actual processing more efficient (which I have ruled out via rule 3, above).

Results

I put together a quick script to test out the ingestion speed (and consistency of result) of 6 variations on these themes. The results are:

  • Initial code. 68.23 sec. 582582 check
  • Using sscanf, once per line. 27.20 sec. 582582 check
  • Using fscanf in large batches. 8.93 sec. 582582 check
  • Using textscan in large batches. 8.79 sec. 582582 check
  • Reading large batches into memory, then sscanf. 8.15 sec. 582582 check
  • Using java single line file reader and sscanf on single lines. 63.56 sec. 582582 check
  • Using java single item token scanner. 81.19 sec. 582582 check
  • Fully batched operations (non-compliant). 1.02 sec. 508680 check (violates rule 3)

Summary

More than half of the original time (68 -> 27 sec) was consumed with inefficiencies in the str2num call, which can be removed by switching the sscanf.

About another 2/3 of the remaining time (27 -> 8 sec) can be reduced by using larger batches for both file reading and string to number conversions.

If we are willing to violate rule number three in the original post, another 7/8 of the time can be reduced by switching to a fully numeric processing. However, some algorithms do not lend themselves to this, so we leave it alone. (Not the "check" value does not match for the last entry.)

Finally, in direct contradiction a previous edit of mine within this response, no savings are available by switching the the available cached Java, single line readers. In fact that solution is 2 -- 3 times slower than the comparable single line result using native readers. (63 vs. 27 seconds).

Sample code for all of the solutions described above are included below.


Sample code

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Create a test file
cd(tempdir);
fName = 'demo_file.txt';
fid = fopen(fName,'w');
for ixLoop = 1:5
    d = randi(1e6, 1e5,2);
    fprintf(fid, '%d, %d \n',d);
end
fclose(fid);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Initial code
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = str2num(tline);
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Initial code.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using sscanf, once per line
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
tline = fgetl(fid);
while ischar(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = fgetl(fid);
end
fclose(fid);
t = toc;
fprintf(1,'Using sscanf, once per line.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using fscanf in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
while ~isempty(scannedData)
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = reshape(fscanf(fid, '%d, %d', bufferSize),2,[])' ;
end
fclose(fid);
t = toc;
fprintf(1,'Using fscanf in large batches.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using textscan in large batches
CHECK = 0;
tic;
bufferSize = 1e4;
fid = fopen('demo_file.txt');
scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
while ~isempty(scannedData{1})
    for ix = 1:size(scannedData{1},1)
        nums = [scannedData{1}(ix) scannedData{2}(ix)];
        CHECK = round((CHECK + mean(nums) ) /2);
    end
    scannedData = textscan(fid, '%d, %d \n', bufferSize) ;
end
fclose(fid);
t = toc;
fprintf(1,'Using textscan in large batches.  %3.2f sec.  %d check \n', t, CHECK);



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, incrementing to end-of-line, sscanf
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    for ix = 1:size(scannedData,1)
        nums = scannedData(ix,:);
        CHECK = round((CHECK + mean(nums) ) /2);
    end

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Reading large batches into memory, then sscanf.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java single line readers + sscanf
CHECK = 0;
tic;
bufferSize = 1e4;
reader =  java.io.LineNumberReader(java.io.FileReader('demo_file.txt'),bufferSize );
tline = char(reader.readLine());
while ~isempty(tline)
    nums = sscanf(tline,'%d, %d');
    CHECK = round((CHECK + mean(nums) ) /2);
    tline = char(reader.readLine());
end
reader.close();
t = toc;
fprintf(1,'Using java single line file reader and sscanf on single lines.  %3.2f sec.  %d check \n', t, CHECK);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Using Java scanner for file reading and string conversion
CHECK = 0;
tic;
jFile = java.io.File('demo_file.txt');
scanner = java.util.Scanner(jFile);
scanner.useDelimiter('[\s\,\n\r]+');
while scanner.hasNextInt()
    nums = [scanner.nextInt() scanner.nextInt()];
    CHECK = round((CHECK + mean(nums) ) /2);
end
scanner.close();
t = toc;
fprintf(1,'Using java single item token scanner.  %3.2f sec.  %d check \n', t, CHECK);


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% Reading in large batches into memory, vectorized operations (non-compliant solution)
CHECK = 0;
tic;
fid = fopen('demo_file.txt');
bufferSize = 1e4;
eol = sprintf('\n');

dataBatch = fread(fid,bufferSize,'uint8=>char')';
dataIncrement = fread(fid,1,'uint8=>char');
while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
    dataIncrement(end+1) = fread(fid,1,'uint8=>char');  %This can be slightly optimized
end
data = [dataBatch dataIncrement];

while ~isempty(data)
    scannedData = reshape(sscanf(data,'%d, %d'),2,[])';
    CHECK = round((CHECK + mean(scannedData(:)) ) /2);

    dataBatch = fread(fid,bufferSize,'uint8=>char')';
    dataIncrement = fread(fid,1,'uint8=>char');
    while ~isempty(dataIncrement) && (dataIncrement(end) ~= eol) && ~feof(fid)
        dataIncrement(end+1) = fread(fid,1,'uint8=>char');%This can be slightly optimized
    end
    data = [dataBatch dataIncrement];
end
fclose(fid);
t = toc;
fprintf(1,'Fully batched operations.  %3.2f sec.  %d check \n', t, CHECK);

(original answer)

To expand on the point made by Ben ... your bottleneck will always be file I/O if you are reading these files line by line.

I understand that sometimes you cannot fit a whole file into memory. I typically read in a large batch of characters (1e5, 1e6 or thereabouts, depending on the memory of your system). Then I either read additional single characters (or back off single characters) to get a round number of lines, and then run your string parsing (e.g. sscanf).

Then if you want you can process the resulting large matrix one row at a time, before repeating the process until you read the end of the file.

It's a little bit tedious, but not that hard. I typically see 90% plus improvement in speed over single line readers.


(terrible idea using Java batched line readers removed in shame)

Pedro77
  • 5,176
  • 7
  • 61
  • 91
Pursuit
  • 12,285
  • 1
  • 25
  • 41
  • 2
    Did you test this Java thing? Matlab's fopen I/O is already buffered, just like C's stdio; switching to calling Java classes just adds overhead. It's 4x slower than OP's original fgetl for me. Overhead is probably not disk I/O per se, but the overhead of operations in the loop operating on small chunks of data. – Andrew Janke Feb 27 '12 at 16:39
  • I tested it for basic functionality, but not speed. You are right, this is a terrible idea. Major edit coming. – Pursuit Feb 27 '12 at 19:47
  • Are you sure your fully batched code is really calculating the `CHECK` value? It refers to the variable `nums`, which is never set by that code cell. – Max Feb 27 '12 at 22:49
  • @Max. True. Fixed. The `CHECK` value still does not match which is intentional. But it should at least represent actually doing something with all the data which is read. The tic/toc times did not change enough to warrant a change to the summary, (actually slightly faster, probably due to other activities on my computer.) – Pursuit Feb 27 '12 at 23:02
  • I'm trying to use your method, but the data format need to be previous know '%d, %d'. I want to read a text file containing NxK values, cols separated by space and lines by \n. Can you add a "generic" "FastFileRead" function to your answer? – Pedro77 Mar 30 '17 at 14:20
  • @Pursuit, doesn't the Textscan test read only 10,000 lines? Is fopen necessary? – gciriani Aug 19 '17 at 14:42
  • That example reads 10000 lines for each iteration though the WHILE loop. It continues in the while loop until the whole file is read. TEXTSCAN needs a file pointer, so FOPEN is needed. – Pursuit Aug 19 '17 at 15:57
  • @Pursuit, I was trying to run your code (thank you for sharing it), in my PC, but I get an error when it runs the line of code nums = [scannedData{1}(ix) scannedData{2}(ix)]; the error is: scannedData(1): out of bound 0; so I was checking the syntax of textscan. Any idea why I would get that error? – gciriani Aug 20 '17 at 02:24
4

I have had good results (speedwise) using memmapfile(). This minimises the amount of memory data copying, and makes use of the kernel's IO buffering. You need enough free address space (though not actual free memory) to map the entire file, and enough free memory to hold the output variable (obviously!)

The example code below reads a text file into a two-column matrix data of int32 type.

fname = 'file.txt';
fstats = dir(fname);
% Map the file as one long character string
m = memmapfile(fname, 'Format', {'uint8' [ 1 fstats.bytes] 'asUint8'});
textdata = char(m.Data(1).asUint8);
% Use textscan() to parse the string and convert to an int32 matrix
data = textscan(textdata, '%d %d', 'CollectOutput', 1);
data = data{:};
% Tidy up!
clear('m')

You may need to fiddle with the parameters to textscan() to get exactly what you want - see the online docs.

Max
  • 2,121
  • 3
  • 16
  • 20
  • I don't think `memmapfile` gives an advantage when it's just being used to slurp the whole file sequentially like this. You could just do the same `textscan()` call directly on the file and get the same result using less memory. Memmapfile is more for scattered (nonsequential) access on large files. – Andrew Janke Feb 27 '12 at 19:51
  • 2
    The advantage of memmapfile here is that it saves the overhead of a memory copy of the file data from kernel address space to user address space - the kernel simply allocates pages in user space backed directly by the disk blocks comprising the file. However - as always, don't guess, benchmark! – Max Feb 27 '12 at 22:05
  • But don't you just end up doing a copy anyway when you call `char(m.Data(1).asUint8)`? – Andrew Janke Feb 27 '12 at 22:23
  • 2
    That is the line that actually 'reads' the data into memory, although the data is faulted in by the virtual memory system when an access is made to the page. In buffered IO, the kernel reads from disc into a buffer in kernel space, then copies the data to user space. When memory mapping, the data goes straight to user space. – Max Feb 27 '12 at 22:54
3

Even if you can't fit the whole file in memory, you should read a large batch using the matrix read functions.

Maybe you can even use vector operations for some of the data processing, which would speed things along further.

Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • +1 Using `fscanf` to read it in chunks like this would be close to a drop-in replacement in the original code and a lot faster than repeated num2str or sscanf calls. – Andrew Janke Feb 27 '12 at 16:43
1

I have found that MATLAB reads csv files significantly faster than text files, so if it's possible to convert your text file to csv using some other software, it may significantly speed up Matlab's operations.

prototoast
  • 598
  • 1
  • 4
  • 9