MATLAB's Textscan Format Complexity vs post-processing

Question

I have a question about MATLAB's textscan. I have a file with a very large number of columns, and anywhere from 1-32 rows (very small when compared to #col). Here is a simple example:

Test1 1 2 3 4 5
Test2 6 7 8 9 10

The number of columns aren't known ahead of time, the length of the first and only string that begins a line isn't known ahead of time. I read in the first line and count the number of columns, and build the actual format string to read in the rest of the file as can be seen below:

function output = parseFile(filename) 

%% Calculate Number of Samples
% Reads in first line
tic
fid = fopen(filename);
line = fgetl(fid);
fclose(fid);
firstLine = textscan(line, '%s', 'CollectOutput', true);
numSamples = size(firstLine{1},1) - 1;
toc

%% Parse File
tic
fid = fopen(filename);
format = ['%s ' repmat('%f', [1 numSamples]) '%*[^\n\r]'];
fileData = textscan(fid, format, 'CollectOutput', true);
fclose(fid);
 toc

%% Format Output
output.names = cell2mat(fileData{1});
output.values = fileData{2};


end     % end function

I've picked a couple examples, and each time I get the following: Say I have a file with 100,000 columns and 3 lines. The tic/tocs tell me that the first line read finishes in .16 seconds. When I then build the format string and read the entire document, it finishes in 9 seconds. Why does the first line when read in as a %s read in so quickly, but the next time I read the entire file (of only +2 more lines) it takes dramatically longer? Is it because of the more complicated format string with which I'm parsing the file the second time around? Would it just make sense to parse the entire file as a space separated string and then perform post-processing (ex:str2double) to get my matrix of doubles?

EDIT: Clarification on specifics of file format:

(string of unknown length)(1 space)(-123.001)(1 space)(41.341)(1 space)...
...

So numbers are not int, and they're positive/negative.

NOTE: What Im essentially confused about, is why textscan was able to read the first line of the file very quickly, while the next two lines took much longer than the first.

Maybe. I've found textscan to be slow sometimes too. str2double takes time too though. You can run the profiler (not tic/toc) and see what is taking so long inside textscan. It may very well be the str2double function called within textscan. — Frederick, Feb 27 '14 at 06:51
Tuning string parsing in MATLAB can be a sport. If your input is as in your example, my bet is that the fastes way is to chop off the initial string, and then use sscanf(s, '%i') on the rest of the line. Parsing ints is much faster than parsing floats. — Wolfgang Kuehn, Feb 27 '14 at 08:57
Some questions: The range of numbers include negative values? Do you always have one space between numbers? — tashuhka, Feb 27 '14 at 09:42
clarified above with the format of the file, and followup question — Diego, Feb 27 '14 at 16:10

score 1 · Answer 1 · answered Feb 27 '14 at 10:23

A usual trick to convert a string into a number is to delete double('0') from the string. Besides, this way is much faster than str2double. For example, running this code:

% Using -double('0')
tic, for i=1:1e5;  aux='9'-48;      end, toc
% Using str2double()
tic, for i=1:1e5;  str2double('9'); end, toc
% Using str2num()
tic, for i=1:1e5;  str2num('9');    end, toc

I get:

Elapsed time is 0.000480 seconds.
Elapsed time is 2.445741 seconds.
Elapsed time is 2.524999 seconds.

Hence, you can construct a function that parses each line (this function could be more optimize, I guess):

function num = parseText(str)
    strCell = strsplit(str,' ');
    strNum = cellfun(@(s) s-48, strCell(2:end),'UniformOutput', false);
    nNum = numel(strNum);  num = zeros(1,nNum);
    for idxNum=1:nNum, 
        num(idxNum) = strNum{idxNum}*10.^(length(strNum{idxNum})-1:-1:0).'; 
    end
end

If you try for one line, the result is fine:

str = 'Test1 0 1 2 3 4 5 10';
num = parseText(str);

If you try for several lines, it also seems fine:

% Create text
L = 10;   str = cell(L,1);
for idx1=1:L, 
    strAux = []; for idx2=1:randi(10),  strAux = [strAux,' ',num2str(randi(10))];  end
    str{idx1} = ['Test',num2str(idx1),strAux];
end

% Parse text
num = cell(L,1);
for idx=1:L, 
    num{idx} = parseText(str{idx});
end

thanks for your answer! That would work great if I had `int`s! but I in general the file is composed of numbers that look like this: `-321.304 123.032` — Diego, Feb 27 '14 at 16:13

MATLAB's Textscan Format Complexity vs post-processing

1 Answers1