2

I have a dataset of 1911518 values. I have used the textscan function. But the function returns only 1424458 values. I again created a new variable to get the remaining values but this time it gave me around 9000 values. Does anyone know what the reason for this can be?

n=9
N=1911518

file_id=fopen('CRSP.csv');

C=textscan(file_id,'%s',n,'delimiter', ','); %To get the headers
C_text=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ','); 

%Returns 1424458

C_text1=textscan(file_id,'%s %s %s %d %d %d %d %f %f',N, 'delimiter' , ','); 

%Returns only 9000 values    

fclose(file_id);

Sample Data

DATE,COMNAM,TICKER,PERMNO,PERMCO,SHROUT,VOL,RET,sprtrn

01/02/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 23700,  0.000000,  0.007793
01/03/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 20800,  0.020000,  0.000950
01/04/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 65300, -0.026144, -0.005826
01/05/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 340600, 0.000000, -0.001587
01/08/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 3400,   0.000000,  0.002821
01/09/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 27200, -0.006711, -0.014568
01/10/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 25400, -0.033784, -0.018000
01/11/1996, ACME CLEVELAND CORP NEW, AMT, 10057, 20020, 6313, 14000,  0.000000,  0.007034
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
user2848498
  • 57
  • 2
  • 7

2 Answers2

2

I would assume, that there is some error in the data, or format pattern does not match the data. Try to extract these lines:

file_id=fopen('CRSP.csv');
for idx=1:1424456
    fgetl(file_id); %dump data
end
for idx=1:10
    fprintf('%s\n',fgetl(file_id));
end

If there is an error, it should be at the 2rd or 3nd printed line. Anything special there? Maybe a COMNAM with some special character?

To read the file, i would use the following code to read line by line:

file_id=fopen('CRSP.csv');
line=fgetl(file_id);
data={};
int ix=1;
while(ischar(line))
    [parsed,sindex,eindex] = regexpi(line,'(\d\d/\d\d/\d\d\d\d)\s*, ([\w ]+), ([\w ]+), ([\d]+), ([\d]+), ([\d]+), ([\d]+), ([\d \.]+), ([\d \.]+)','tokens')
    if ~isempty(sindex)&&numel(sindex)==1&&(sindex==1)&&(eindex==numel(x))
        data{end+1}=parsed{1};
    else
        fprintf('Unable to parse line %d with content: %S',ix,line);
    end
    line=fgetl(file_id);
    ix=ix+1;
end

Short summary of regular expressions:

'(...)' Everything between is a "token" which is returned

'([\d .]+)' Numbers, white space and "."

'([\d .]+)' Numbers and white space

'([\w ]+)' Word, including white space

'(\d\d/\d\d/\d\d\d\d)' date

This expression is a bit "lazy". It not only accepts "0.000" as a number but also "0.0 00." or some other combinations, but it should be enough to detect all errors. If not, the expression has to be improved.

Daniel
  • 36,610
  • 3
  • 36
  • 69
  • You are right, there are certain values in the data that have errors in them. for example the sprrtn, there are string values. But what I thought was to just extract all values into matlab and then check for NAs or infs – user2848498 Oct 18 '13 at 20:57
  • Now that I find there are error values, how do I remove them and store the remaining values without spoiling the order – user2848498 Oct 18 '13 at 21:20
  • I updated my answer. There might be a simpler solution using textscan as well, but I am simply used to regexp. – Daniel Oct 18 '13 at 22:10
  • I kind of understood the code. Now, if the IF condition is satisfied, I don't have to do anything. But how do I now completely remove the column at the ELSE condition?? – user2848498 Oct 18 '13 at 23:02
  • I added some additional lines how the solution could look like. To read the file I basically used the example code here: http://www.mathworks.de/de/help/matlab/ref/fgets.html At the end data should contain the whole valid data and every corrupted data should be printed. – Daniel Oct 18 '13 at 23:11
  • Hey Daniel, Will it possible for me to just use **fgetl** to get all the data into a variable and then use comma separation to seperate the data? – user2848498 Oct 19 '13 at 00:11
  • @user2848498: Basically this is possible, but nothing is checked. You can compare the output of the regexp-line I proposed and a simple split, for any valid line both are equal. – Daniel Oct 22 '13 at 12:49
1

Daniel R's answer is basically correct. To elaborate (I would have posted this as a comment but I lack the reputation), textscan in MATLAB is very finicky and it basically bails out whenever it hits something that does not PRECISELY match the format you specify.

If you have a datafile that is likely to contain some errors or inconsistencies, your main options are to pre-process the file somehow to prune out those errors, or (as Daniel suggests) read the file in yourself line-by-line and parse it however you want. The former is probably about as much work as the latter unless you plan to do it manually and there aren't many errors to fix, so in most cases it might just be easier to write your own parser.

The only other thing you could potentially do -- if the only errors are errors of type (e.g. a column is supposed to be an integer but sometimes a floating-point number slips in), you could still use textscan and replace format specifiers with more generic ones. E.g., in that example, you could replace %d (integer) with %f (floating-point number). Since all integers are floating-point numbers, that should work OK. In the most extreme case, you could read in all columns as strings (%s), but then you'd need to parse them all anyway and you're probably better off just doing that without textscan.

Matt
  • 1,101
  • 11
  • 11