1

I have a csv file that consists of text or number. But some columns are corrupted as seen in the image below("<<"K.O). When I open the csv file via Matlab (without importing), it converts them to number and define undefined values such as "<<"K.O as NaN as I wanted. But when I read the file via script I wrote:

opts = detectImportOptions(filedir);
table = readtable(filedir,opts);

It reads them as char arrays. Since I have many different csv files (columns are different), I want to do it automatically rather than using textscan(since it requires file format and my file format is different for each csv file). Is there any way to convert the contents of columns containing numeric text to numbers automatically?

data

Pumpkin
  • 209
  • 2
  • 12

2 Answers2

0

Assuming you have your data stored in a table, you can attempt to convert each column of character arrays to numeric values using str2double. Any values that don't convert to a numeric value (empty entries, words, non-numeric strings, etc.) will be converted to NaN.

Since you want to do the conversions automatically, we'll have to make one key assumption: any column that converts to all NaN values should remain unchanged. In such a case, the data was likely either all non-convertable character arrays, or already numeric. Given that assumption, this generic conversion could be applied to any table T:

for varName = T.Properties.VariableNames
  numData = str2double(T.(varName{1}));
  if ~all(isnan(numData))
    T.(varName{1}) = numData;
  end
end

As a test, the following sample data:

T = table((1:5).', {'Y'; 'N'; 'Y'; 'Y'; 'N'}, {'pi'; ''; '1.4e5'; '1'; 'A'});

T = 

    Var1    Var2     Var3  
    ____    ____    _______

    1       'Y'     'pi'   
    2       'N'     ''     
    3       'Y'     '1.4e5'
    4       'Y'     '1'    
    5       'N'     'A'

Will be converted to the following by the above code:

T = 

    Var1    Var2     Var3 
    ____    ____    ______

    1       'Y'        NaN
    2       'N'        NaN
    3       'Y'     140000
    4       'Y'          1
    5       'N'        NaN
gnovice
  • 125,304
  • 15
  • 256
  • 359
  • The problem is I have many csv files and many columns in them(some texts some double values) and I want to do the same thing for all dataset (in a for loop). Thus it shouldn't be column specific such as table.grade. Actually, in matlab the table produced by uiimport is same as what I want but I could not find any function produces the same output without user interface. – Pumpkin Nov 11 '17 at 09:51
  • @Pumpkin: Do you want to convert *every* non-numeric column to numeric, or are there columns that should remain non-numeric? – gnovice Nov 11 '17 at 14:39
  • Yes there are columns that should remain non-numeric (e.g. the major of student or status ) – Pumpkin Nov 11 '17 at 22:59
0

As far as I can understand from your comments, this is what you are actually looking for:

for i = 1:numel(files)
    file = fullfile(folder,files(i).name));

    opts = detectImportOptions(file);
    idx = strcmp(opts.VariableNames,'Grade');

    if (any(idx))
        opts.VariableTypes(idx) = {'double'};
    end

    tabs(i) = readtable(file,opts);
end
Tommaso Belluzzo
  • 23,232
  • 8
  • 74
  • 98
  • There is no simple solution for this. You could pick the first non null row value of each column and match it against a regexp like "^[0-9]+(\\.[0-9]+)?" (see https://it.mathworks.com/help/matlab/ref/regexp.html). If the match is positive, you can apply my method to change the variable type to 'double' for the current column. – Tommaso Belluzzo Nov 11 '17 at 23:12