0

I have a .asc file 'test.asc' which consists of lines with different length and content.

  my name is blalala
  This is my home and I live in here
  12 13 10 is he
  he is my brother 12 13 14

How can I import the contents of the file into a MATLAB cell array where each row is a line separated by space delimiter?

  resultCellarray={
    'my'   'name' 'is'  'blalala' []    []   []     []   []
    'This' 'is'   'my'  'home'    'and' 'I' 'live' 'in' 'here'
    '12'   '13'   '10'  'is'      'he'  []   []     []   []
    'he'   'is'   'my'  'brother' '12' '13'  '14'   []   []
    }

I have tried inserting each line as one cell:

   content = textread('test.asc','%s','delimiter','\n','whitespace','');     

and then dividing the cell into several columns, using: separating cell array into several columns MATLAB, but it is taking a lot of time when the file is large. What is the fastest way to do this?

beaker
  • 16,331
  • 3
  • 32
  • 49
ryh12
  • 357
  • 7
  • 18

1 Answers1

1

This code should run very fast (split 1M characters in 0.2sec):

%generate random file
% w=[10,13,32*ones(1,10),97:122,97:122];
% FILE_LENGTH=10*1000*1000;mytext=char(w(randi(length(w),1,FILE_LENGTH))); 
% fileID = fopen('z:\mytest.asc','w');fprintf(fileID,'%s',mytext);fclose(fileID);
clear
tic
%settings
Filename='z:\test.asc';
LineDelimiter=newline;%=char(10)
WordDelimiter=' ';

%read file
fid=fopen(Filename,'r');
text=fread(fid,'*char')';
fclose(fid);

%fix text
text(text==char(9))=WordDelimiter; %replace tab with space
text(text==char(13))=[];%remove '\r'
if text(end)~=LineDelimiter, text(end+1)=LineDelimiter;end %add eol if needed
IdxWords=find(text==WordDelimiter);
text(IdxWords(diff(IdxWords)==1))=[];% remove 2 spaces or more

%count words per line
IdxNewline=find(text==LineDelimiter);
NumOfLines=length(IdxNewline); %2eol=2lines
WordsPerLine=zeros(1,NumOfLines); %
IdxWords=find(text==WordDelimiter|text==LineDelimiter);
iword=1; iword_max=length(IdxWords);
for i=1:NumOfLines
    while iword<=iword_max && IdxWords(iword)<=IdxNewline(i)
        WordsPerLine(i)=WordsPerLine(i)+1;
        iword=iword+1;
    end
end
MaxWords=max(WordsPerLine);
LongestWord=max(diff(IdxWords));

%split
Output=cell(NumOfLines,MaxWords);
pos=1;iword=0;
for i=1:NumOfLines
    idxline=IdxNewline(i);
    for j=1:WordsPerLine(i)
        iword=iword+1;
        Output{i,j}=text(pos:IdxWords(iword)-1);
        pos=IdxWords(iword)+1;
    end
end
toc

% disp(Output)
Mendi Barel
  • 3,350
  • 1
  • 23
  • 24
  • It is also taking a lot of time. The test file consists of around 1016000 lines – ryh12 Jul 29 '17 at 05:27
  • Lot of time? how much exactly? What is the total file size? – Mendi Barel Jul 29 '17 at 05:31
  • Changed to custom split, 40% faster. You cannot expect the split to 2d cell performance to be as 2d double array. a separate memory allocation is done for every cell. This code is close to the fastest that you can get in matlab code for this problem. – Mendi Barel Jul 29 '17 at 06:08
  • Thanks for the help. In some cases there is more than one space between two words and some times there is one space. Therefore, in some cases it is saving the space in an empty cell, I only want to keep the words. Is it possible to deal with such cases??? @Mendi Barel – ryh12 Jul 29 '17 at 09:08
  • I added some code that remove consecutive spaces in the 'fix text' phase (that fix can slow the performance), this should fix your problem. – Mendi Barel Jul 29 '17 at 09:23
  • But it is still faster than the older way :-). It took 39 secs. Thank you. It would be great if you can please add more comments for the explanation of the code. @Mendi Barel – ryh12 Jul 29 '17 at 09:30
  • why the first character of the first word in each line is always missing? @Mendi Barel – ryh12 Jul 29 '17 at 10:02
  • Ho i see, small bug in the last split loop. Fixed. Can you write how much time other methods took? – Mendi Barel Jul 29 '17 at 10:18
  • other methods are taking up to 700 seconds.But, In cases of one space, sometimes it is working fine and sometimes it is combining the two words before and after the space into one cell /);' @Mendi Barel – ryh12 Jul 29 '17 at 11:18
  • I changed the 'fix text' section, now we back close to fastest code possible in matlab. How long it runs now? – Mendi Barel Jul 29 '17 at 12:36
  • And also look again in the split loop i changed it again. – Mendi Barel Jul 29 '17 at 12:48
  • The code now is taking up to 12 secs. But I am still getting the same problem when there is only one space: in some cases it is combining the words before and after the space into one word. Although, in the old method it was working fine @Mendi Barel – ryh12 Jul 30 '17 at 01:24
  • I didn't see this problem. you better copy all code. or give a text line that does the error. – Mendi Barel Jul 30 '17 at 01:55
  • oh, I see what might be the reason now. The problem is that these words are separated by tab instead of space. Therefore, your code is just searching for space delimiter. I hope it would be possible to deal with such cases?? @Mendi Barel – ryh12 Jul 30 '17 at 02:06
  • The idea is to replace tab with spaces in the 'fix text' stage. look at the code – Mendi Barel Jul 30 '17 at 02:15
  • It is working perfectly now and taking only 10 seconds :-). Thank you very much for your help and time @Mendi Barel – ryh12 Jul 30 '17 at 02:20