1

I'm going to read a big csv file in matlab which contains rows like this:

1, 0, 1, 0, 1
1, 0, 1, 0, 1, 0, 1, 0, 1
1, 0, 1
1, 0, 1
1, 0, 1, 0, 1
0, 1, 0, 1, 0, 1, 0, 1, 0

For reading big files I'm using textscan however I should define number of expected parameters in each line of text file.

Using csvread helps but it is too slow and seems to be not efficient. Are there any methods to use textscan with uknown number of inputs in each line? or do you have any other suggestion for this situation?

VSB
  • 9,825
  • 16
  • 72
  • 145
  • 1
    For your example, what do you expect the created MATLAB variable to look like? Will it be a numeric matrix (or perhaps logical matrix) with "short" rows padded with zeros? Or a cell array with each row being in a separate element of the cell array? – Phil Goddard May 12 '19 at 18:36
  • @PhilGoddard Numerical matrix padded with zeros would be good. Cell array with one column which each cell contains a row would be good too. – VSB May 13 '19 at 05:23
  • Have you tried iterating yourself line by line, counting the commas (+ 1), so that you have the needed number of expected parameters - before the `textscan`? Maybe, it's faster than `csvread`!? And/Or, you really have no upper boundary for the amount of values per line, not even a silly one like 1000? Cleaning your (maybe too big initialized) array afterwards might also be faster!? – HansHirse May 13 '19 at 05:36
  • Please show your slow implementation of `csvread` – Sardar Usama May 13 '19 at 05:38

1 Answers1

2

Since you said "Numerical matrix padded with zeros would be good", there is a solution using textscan which can give you that. The catch however is you have to know the maximum number of element a line can have (i.e. the longest line in your file).

Provided you know that, then a combination of the additional parameters for textscan allow you to read an incomplete line:

If you set the parameter 'EndOfLine','\r\n', the documentation explains:

If there are missing values and an end-of-line sequence at the end of the last line in a file, then textscan returns empty values for those fields. This ensures that individual cells in output cell array, C, are the same size.

So with the example data in your question saved as differentRows.txt, the following code:

% be sure about this, better to overestimate than underestimate
maxNumberOfElementPerLine = 10 ;

% build a reading format which can accomodate the longest line
readFormat = repmat('%f',1,maxNumberOfElementPerLine) ;

fidcsv = fopen('differentRows.txt','r') ;

M = textscan( fidcsv , readFormat , Inf ,...
    'delimiter',',',...
    'EndOfLine','\r\n',...
    'CollectOutput',true) ;

fclose(fidcsv) ;
M = cell2mat(M) ; % convert to numerical matrix

will return:

>> M
M =
     1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
     1     0     1     0     1     0     1     0     1   NaN
     1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
     1     0     1   NaN   NaN   NaN   NaN   NaN   NaN   NaN
     1     0     1     0     1   NaN   NaN   NaN   NaN   NaN
     0     1     0     1     0     1     0     1     0   NaN

As an alternative, if it makes a significant speed difference, you could import your data into integers instead of double. The trouble with that is NaN is not defined for integers, so you have 2 options:

  • 1) Leave the empty entries to the default 0

just replace the line which define the format specifier with:

% build a reading format which can accomodate the longest line
readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;

This will return:

>> M
M =
1   0   1   0   1   0   0   0   0   0
1   0   1   0   1   0   1   0   1   0
1   0   1   0   0   0   0   0   0   0
1   0   1   0   0   0   0   0   0   0
1   0   1   0   1   0   0   0   0   0
0   1   0   1   0   1   0   1   0   0

  • 2) Replace the empty entries with a placeholder (for ex: 99)

Define a value which you are sure you'll never have in your original data (for quick identification of empty cells), then use the EmptyValue parameter of the textscan function:

readFormat = repmat('%d',1,maxNumberOfElementPerLine) ;
DefaultEmptyValue = 99 ; % placeholder for "empty values"

fidcsv = fopen('differentRows.txt','r') ;
M = textscan( fidcsv , readFormat , Inf ,...
    'delimiter',',',...
    'EndOfLine','\r\n',...
    'CollectOutput',true,...
    'EmptyValue',DefaultEmptyValue) ;

will yield:

>> M
M =
1   0   1   0   1   99  99  99  99  99
1   0   1   0   1   0   1   0   1   99
1   0   1   99  99  99  99  99  99  99
1   0   1   99  99  99  99  99  99  99
1   0   1   0   1   99  99  99  99  99
0   1   0   1   0   1   0   1   0   99
Hoki
  • 11,637
  • 1
  • 24
  • 43