0

I'm trying to extract .xml files from a .zip containing 60000+ .xml files without having to actually extract the archive. Each .xml file has the following naming format HMDB#.xml with a 5 digit number replacing the #.

Each .xml file is around 25kb in size +-5kb

I am using the following code to do this at the moment. path is a string containing the .zip file directory and hmdbid is a string containing the 5-digit number:

%// Opens the zip file and creates temporary directories for the files so data
%// can be extracted.

function data=partzip(path,hmdbid)

    zipFilename = path;
    zipJavaFile = java.io.File(zipFilename);
    zipFile=org.apache.tools.zip.ZipFile(zipJavaFile);
    entries=zipFile.getEntries;
    cnt=1;

    while entries.hasMoreElements
        tempObj=entries.nextElement;
        file{cnt,1}=tempObj.getName.toCharArray';
        cnt=cnt+1;
    end

    ind=regexp(file,sprintf('$*%s.xml$',hmdbid));
    ind=find(~cellfun(@isempty,ind));
    file=file(ind);
    file = cellfun(@(x) fullfile('.',x),file,'UniformOutput',false);

    data=extract_data(file{1});
    zipFile.close;
end

When testing the code with a .zip file containing:

  • HMDB00002.xml
  • HMDB00005.xml
  • HMDB00008.xml
  • HMDB00010.xml
  • HMDB00012.xml

The code works fine when hmdbid is 00002,00005 or 00008 when it exceeds this my data extraction function returns a file not found error.

I have tried several combinations of files with different file names withe the same result. The first 3 files work fine but the others don't, regardless the name of the file.

I have tried creating a .zip containing 100 test .xml files containing only it's file name and extracting from these work fine which leads me to believe it's a memory issue, but I'm not sure how to fix it.

Dev-iL
  • 23,742
  • 7
  • 57
  • 99
Clanrat
  • 43
  • 7
  • Why do you think it's a memory problem? Did you try [profiling the memory usage](http://undocumentedmatlab.com/blog/profiling-matlab-memory-usage) of your script, or [changing the Java heap size](http://blogs.mathworks.com/community/2010/04/26/controlling-the-java-heap-size/) to see if it changes anything? – Dev-iL Jul 02 '15 at 16:16
  • Could it be a problem when the number of digits is more than one? I would check the name of files you generate. – Navan Jul 02 '15 at 18:46
  • @Dev-iL Mostly just cause it's the only thing I could think of. I have doubling it but it still doesn't work (from 250mb to 500mb) but it still doesn't work, even on the small test file. @Navan I have checked with different combinations of files with the same result. I also stopped the function from running before it removes the other file names and all the file names get put into the `file` array. – Clanrat Jul 03 '15 at 07:47

0 Answers0