-1

I am working with a CSV file that contains information in the following format:

      col1      col2          col3
row1  id1  , text1 (year1) , a|b|c
row2  id2  , text2 (year2) , a|b|c|d|e
row3  id3  , text3 (year3) , a|b
 ...

The number of rows in the CSV is very large. The years are embedded in col2 in parentheses. Also, as can be seen col3 can have variate number of elements.

I would like to read the CSV file EFFICIENTLY and end up for each item (id) with an array as follows:

For 'item' with id#_i :

A = [id_i,text_i,year_i,101010001] 

where if all possible features in col3 are [a,b,c,d,....,z], the binary vector shows its presence or absence.

I am interested in efficient implementation of this in MATLAB. Ideas are more than welcome. Thank You

YAS
  • 303
  • 4
  • 15
  • 1
    Possible duplicate of [Massive CSV file into Matlab](http://stackoverflow.com/questions/17055958/massive-csv-file-into-matlab) – kmac Nov 23 '15 at 03:17

2 Answers2

1

I would like to add what I have found to be one of the fastest ways of reading a CSV file:

importdata()

This will allow you to read numeric and non-numeric data, but it assumes there is some number of header lines. You can either input the number of header lines as an input argument to importdata() or you can let it try on its own...to which it didn't work for my use in the past. This was much faster than xlsread() for me, where it took 1/6th the time to read something 6 times larger!

If you are reading only numeric data, you can use csvread()--which actually uses dlmread(). Thing is, there are about 10 ways to read these files, and it is really dependent not only on your goals, but the file contents.

Raj
  • 138
  • 1
  • 11
0

You can use T = readtable(filename). This has the option for 'ReadVariableNames' which takes first row as header and 'ReadRowNames' that will take first column as row variable.

Shuaib Ahmed
  • 134
  • 3
  • Thanks. readtable is pretty interesting and fast. When I apply, T=readtable(filename) and then T.col3 I can have all the list of features ('a', 'b' , 'c' ...) for each row. Given a cell containing a string in each row, I want to a particular type of string. For example, suppose I want to find any row that contains string 'a' like 'a' , or 'a|' or '|a' or '|a|'. Ideas are more than welcome (eg. strcmp, strfind, regexp ....) – YAS Nov 24 '15 at 15:00
  • ok, I think regexp is effcient for handling seaching this type of patterns. – YAS Nov 25 '15 at 10:12