I have a question about MATLAB's textscan
. I have a file with a very large number of columns, and anywhere from 1-32 rows (very small when compared to #col). Here is a simple example:
Test1 1 2 3 4 5
Test2 6 7 8 9 10
The number of columns aren't known ahead of time, the length of the first and only string that begins a line isn't known ahead of time. I read in the first line and count the number of columns, and build the actual format string to read in the rest of the file as can be seen below:
function output = parseFile(filename)
%% Calculate Number of Samples
% Reads in first line
tic
fid = fopen(filename);
line = fgetl(fid);
fclose(fid);
firstLine = textscan(line, '%s', 'CollectOutput', true);
numSamples = size(firstLine{1},1) - 1;
toc
%% Parse File
tic
fid = fopen(filename);
format = ['%s ' repmat('%f', [1 numSamples]) '%*[^\n\r]'];
fileData = textscan(fid, format, 'CollectOutput', true);
fclose(fid);
toc
%% Format Output
output.names = cell2mat(fileData{1});
output.values = fileData{2};
end % end function
I've picked a couple examples, and each time I get the following: Say I have a file with 100,000 columns and 3 lines. The tic
/toc
s tell me that the first line read finishes in .16 seconds. When I then build the format string and read the entire document, it finishes in 9 seconds. Why does the first line when read in as a %s
read in so quickly, but the next time I read the entire file (of only +2 more lines) it takes dramatically longer? Is it because of the more complicated format
string with which I'm parsing the file the second time around? Would it just make sense to parse the entire file as a space separated string and then perform post-processing (ex:str2double
) to get my matrix of doubles?
EDIT: Clarification on specifics of file format:
(string of unknown length)(1 space)(-123.001)(1 space)(41.341)(1 space)...
...
So numbers are not int
, and they're positive/negative.
NOTE: What Im essentially confused about, is why textscan was able to read the first line of the file very quickly, while the next two lines took much longer than the first.