Use textscan to read datablock

Question

How to extract the "mean" and "depth" data like the following of each month?

MEAN, S.D., NO. OF OBSERVATIONS


                      January                February       ...            
 Depth       Mean   S.D.  #Obs       Mean   S.D.  #Obs       ...
     0      32.92   0.43     9      32.95   0.32    21      
    10      32.92   0.43    14      33.06   0.37    48      
    20      32.88   0.46    10      33.06   0.37    50      
    30      32.90   0.51     9      33.12   0.35    48      
    50      33.05   0.54     6      33.20   0.42    41      
    75      33.70   1.11     7      33.53   0.67    37      
   100      34.77            1      34.47   0.42    10      
   150                                                                                           
   200

                         July                  August               
 Depth       Mean   S.D.  #Obs       Mean   S.D.  #Obs       
     0      32.76   0.45    18      32.75   0.80    73      
    10      32.76   0.40    23      32.65   0.92   130      
    20      32.98   0.53    24      32.84   0.84   121     
    30      32.99   0.50    24      32.93   0.59   120      
    50      33.21   0.48    16      33.05   0.47   109      
    75      33.70   0.77    10      33.41   0.73    80      
   100      34.72   0.54     3      34.83   0.62    20      
   150                              34.69            1                                                     
   200

It has undefinable number of spaces between the data, and a introduction line at the beginning.

Thank you!

score 0 · Answer 1 · answered Jul 09 '12 at 05:29

Here is an example for how to read line from file:

fid = fopen('yourfile.txt');

tline = fgetl(fid);
while ischar(tline)
    disp(tline)
    tline = fgetl(fid);
end

fclose(fid);

Inside the while loop you'll want to use strtok (or something like it) to break up each line into string tokens delimited by spaces.

score 0 · Answer 2 · answered Jul 26 '12 at 16:08

Matlab's regexp is powerful for pulling data out of less-structure text. It's really worth getting familiar with regular expressions in general: http://www.mathworks.com/help/techdoc/ref/regexp.html

In this case, you would define the pattern to capture each observation group (Mean SD Obs), e.g.: 32.92 0.43 9

Here I see a pattern for each group of data: each group is preceded by 6 spaces (regular expression = \s{6}), and the 3 data points are divided by less than 6 spaces (\s+). The data itself consists of two floats (\d+.\d+) and one integer (\d+):

So, putting this together, your capture pattern would look something like this (the brackets surround the pattern of data to capture):

expr = '\s{6}(\d+\.\d+)\s+(\d+\.\d+)\s+(\d+)';

We can add names for each token (i.e. each data point to capture in the group) by adding '?' inside the brackets:

expr = '\s{6}(?<mean>\d+\.\d+)\s+(?<sd>\d+\.\d+)\s+(?<obs>\d+)';

Then, just read your file into one string variable 'strFile' and extract the data using this defined pattern:

strFile = urlread('file://mydata.txt');
[tokens data] = regexp(strFile, expr, 'tokens', 'names');

The variable 'tokens' will then contain a sequence of observation groups and 'data' is a structure with fields .mean .sd and .obs (because these are the token names in 'expr').

score 0 · Answer 3 · answered Jul 26 '12 at 17:57

If you just want to get, for example, the first two columns, then textscan() is a great choice.

fid = fopen('yourfile.txt');

tline = fgetl(fid);
while ischar(tline)
    oneCell = textscan(tline, '%n'); % read the whole line, put it into a cell
    allTheNums = oneCell{1}; % open up the cell to get at the columns

    if isempty(allTheNums) % no numbers, header line
        continue;
    end

    usefulNums = allTheNums(1:2) % get the first two columns
end

fclose(fid);

textscan automatically splits the strings you feed it where there is whitespace, so the undefined number of strings between columns isn't an issue. A string with no numbers will give an array that you can test as empty to avoid out-of-bounds or bad data errors.

If you need to programmatically figure out which columns to get, you can scan for the words 'Depth' and 'Mean' to find the indeces. Regular expressions might be helpful here, but textscan should work fine too.

Use textscan to read datablock

3 Answers3