1

From the beginning.

I have data in a csv file like:

La Loi des rues,/m/0gw3lmk,/m/0gw1pvm
L'Étudiante,/m/0j9vjq5,/m/0h6hft_
The Kid From Borneo,/m/04lrdnn,/m/04lrdnt,/m/04lrdn5,/m/04lrdnh,/m/04lrdnb

etc.

This is in UTF-8 format. I import this file as follows (taken from somewhere else):

feature('DefaultCharacterSet','UTF-8');
fid = fopen(filename,'rt');         %# Open the file
  lineArray = cell(100,1);          %# Preallocate a cell array (ideally slightly
                                    %# larger than is needed)
  lineIndex = 1;                    %# Index of cell to place the next line in
  nextLine = fgetl(fid);            %# Read the first line from the file
  while ~isequal(nextLine,-1)       %# Loop while not at the end of the file
  lineArray{lineIndex} = nextLine;  %# Add the line to the cell array
  lineIndex = lineIndex+1;          %# Increment the line index
  nextLine = fgetl(fid);            %# Read the next line from the file
end
fclose(fid);                        %# Close the file

This makes an array with the UTF-8 text within it. {3x1} array:

'La Loi des rues,/m/0gw3lmk,/m/0gw1pvm'
'L''Étudiante,/m/0j9vjq5,/m/0h6hft_'
'The Kid From Borneo,/m/04lrdnn,/m/04lrdnt,/m/04lrdn5,/m/04lrdnh,/m/04lrdnb'

Now the next part separates each value into an array:

lineArray = lineArray(1:lineIndex-1);              %# Remove empty cells, if needed
  for iLine = 1:lineIndex-1                        %# Loop over lines
    lineData = textscan(lineArray{iLine},'%s',...  %# Read strings
                        'Delimiter',',');
    lineData = lineData{1};                        %# Remove cell encapsulation
    if strcmp(lineArray{iLine}(end),',')           %# Account for when the line
      lineData{end+1} = '';                        %# ends with a delimiter
    end
    lineArray(iLine,1:numel(lineData)) = lineData; %# Overwrite line data
  end

This outputs:

'La Loi des rues'   '/m/0gw3lmk'    '/m/0gw1pvm'    []  []  []
'L''�tudiante'  '/m/0j9vjq5'    '/m/0h6hft_'    []  []  []
'The Kid From    Borneo'    '/m/04lrdnn'    '/m/04lrdnt'    '/m/04lrdn5'    '/m/04lrdnh'    '/m/04lrdnb'

The problem is that the UTF-8 encoding is lost on the textscan (note the question mark I now get whereas it was fine in the previous array).

Question: How do I maintain the UTF-8 coding when it translates the {3x1} array into a 3xN array.

I can't find anything on how to keep UTF-8 encoding in a textscan of an array already in the workspace. Everything is to do with importing a text file which I have no problems with - it is the second step.

Thanks!

Amro
  • 123,847
  • 25
  • 243
  • 454
Griff
  • 2,064
  • 5
  • 31
  • 47

2 Answers2

1

Try the following code:

%# read whole file as a UTF-8 string
fid = fopen('utf8.csv', 'rb');
b = fread(fid, '*uint8')';
str = native2unicode(b, 'UTF-8');
fclose(fid);

%# split into lines
lines = textscan(str, '%s', 'Delimiter','', 'Whitespace','\n');
lines = lines{1};

%# split each line into values
C = cell(numel(lines),6);
for i=1:numel(lines)
    vals = textscan(lines{i}, '%s', 'Delimiter',',');
    vals = vals{1};
    C(i,1:numel(vals)) = vals;
end

The result:

>> C
C = 
    'La Loi des rues'        '/m/0gw3lmk'    '/m/0gw1pvm'              []              []              []
    'L'Étudiante'            '/m/0j9vjq5'    '/m/0h6hft_'              []              []              []
    'The Kid From Borneo'    '/m/04lrdnn'    '/m/04lrdnt'    '/m/04lrdn5'    '/m/04lrdnh'    '/m/04lrdnb'

Note that when I tested this, I encoded the input CSV file as "UTF-8 without BOM" (I was using Notepad++ as editor)

Amro
  • 123,847
  • 25
  • 243
  • 454
  • The question mark appears to have disappeared but now I have: 'L''tudiante' which has removed the 'E' altogether. I saved it in TextWrangler as I'm on OSX. Just the regular UTF-8 which I've read is without BOM. – Griff Jul 25 '12 at 02:17
  • OK - your code works on Windows 7 Matlab but not OSX Matlab! I tried the exact same code on my desktop and it worked! I would like to use my laptop for this however so could you describe what I must change to make it work on OSX Matlab? Thanks. – Griff Jul 25 '12 at 02:30
  • @Griff: That's interesting, are you sure it is not a font-related issue? Also where exactly in the code does the 'E accent aigu' disappear; Does `str` contain the correct characters, what about `lines`? One thing to try is to write the `str` variable back to a new file, try the following and report back: `fid = fopen('newfile.csv', 'wb'); fwrite(fid, unicode2native(str,'UTF-8'), '*uint8'); fclose(fid);`. For what its worth, I am on a 32-bit WinXP box with MATLAB R2012a. Note that I haven't changed `feature('DefaultCharacterSet')` which tells me its using the default `windows-1252` encoding – Amro Jul 25 '12 at 13:01
  • @Griff: If for some reason TEXTSCAN was the culprit, perhaps you can use alternative low-level functions such as `strtok` and the like.. – Amro Jul 25 '12 at 13:05
0

Try using the following fopen command instead of the one you currently are. It specifies UTF-8 encoding for the file.

f = fopen(filename,'rt', 'UTF-8');

You can probably shorten up some of the code using this as well:

text = fscanf(f,'%c');
Lines = textscan(text,'%s','Delimiter',',');

That might help alleviate some of the pre-allocation that you're doing there.

Ben A.
  • 1,039
  • 6
  • 13
  • you are missing an argument `fopen(filename, 'rt', 'native', 'UTF-8');`. Still this wont work with TEXTSCAN directly, you would have to read the whole file as string, then parse that with TEXTSCAN. – Amro Jul 24 '12 at 17:58
  • That does not work. I still get: 'L''�tudiante' using fopen(filename, 'rt', 'native', 'UTF-8'); – Griff Jul 25 '12 at 02:14