0

I have a bunch of zip folders that I have to extract and read the data (stored in a unique file). The problem is some of these folders have two files by any kind of error (instead of 1) with the same name. When I use the Matlab command "unzip", one of the files is overwrited by the other. The problem is these two files are not the same: one of them has the information I need, and the other one is almost empty. So I would like to rename these two files to file_a and file_b, extract them, and once both are extracted, keep only the larger one.

Do you know if there is any way to rename files inside a zip?

  • 1
    I don't think you can modify contents of a compressed file. Any tiny modificiation in the contents changes totally the compression result, i.e. you need to decompress it, change it, then re-compress it. You can not see what is inside the box without opening the box. – Ander Biguri Aug 20 '20 at 12:52
  • 3
    An entry name appears twice in a Zip file, in clear, so you could change it by another name with the same length. Otherwise a tool can do the renaming by creating a new Zip file where the compressed contents are copied and only the names are changed. The 7Zip GUI does it for instance. – Zerte Aug 20 '20 at 12:54
  • 2
    You could also rename your file using a shorter filename if you fill the missing bit with `00`. I've tried (on `HexEd.it`) it works. If you need a bigger filename it's going to be more complicated. – obchardon Aug 20 '20 at 13:03

1 Answers1

1

I made a function which will modify the filenames inside the zip file so they can be uncompressed seemlessly.

The function locate the file names in the zip file and change the first letter of each file it encounter with a sequence "A, B, C, D, etc ...".

function differentiateFileNames(zipFilename)

    %% get the filenames contained in the zip file
    filenames = getZipFileNames(zipFilename) ;
    nFiles = numel(filenames) ;

    %% Find the positions of the file name fields
    % read the full file as a string
    str = fileread(zipFilename) ;
    % if all filenames are identical, we only need to search for the first name
    % in our list
    idx = strfind( str , filenames{1} ) ;

    %% group indices by physical file
    % Each filename appears twice in the zip file:
    % ex for 2 files: file1 ... file2 ... file1 ...file2
    idx = reshape(idx,nFiles,2)-1 ;

    %% Now modify each filename
    % (replace the first character of each filename)
    fid = fopen(zipFilename,'r+') ;

    for k=1:nFiles
        char2write = uint8('A'+(k-1)) ; % will be: A, B, C, D, ect ...
        fseek(fid,idx(k,1),'bof') ;
        fwrite(fid,char2write,'uint8') ;

        fseek(fid,idx(k,2),'bof') ;
        fwrite(fid,char2write,'uint8') ;
    end

    fclose(fid) ;
end

function filenames = getZipFileNames(zipFilename)
    try
       % Create a Java file of the ZIP filename.
       zipJavaFile  = java.io.File(zipFilename);
       % Create a Java ZipFile and validate it.
       zipFile = org.apache.tools.zip.ZipFile(zipJavaFile);
       % Extract the entries from the ZipFile.
       entries = zipFile.getEntries;

    catch exception
       if ~isempty(zipFile)
           zipFile.close;
       end    
       delete(cleanUpUrl);
       error(message('MATLAB:unzip:invalidZipFile', zipFilename));
    end
    cleanUpObject = onCleanup(@()zipFile.close);

    k = 0 ;
    filenames = cell('') ;
    while entries.hasMoreElements
        k=k+1;
        filenames{k,1} = char(entries.nextElement.getName) ;
    end
    zipFile.close
end

Be aware that this script assumes that all the files have a similar name in the zip file. When it locate the file names position it only check versus the first file name found.

The sub function getZipFileNames is just a rip off of parts of the unzip.m, with only the necessary content to be able to read the file names contained in the zip file.


For testing: I made a zip file containing 2 files:

New Text Document1.txt
New Text Document2.txt

I modified the file names inside the zip file with a hex editor, in order to have:

New Text Document1.txt
New Text Document1.txt

so both files have the same name in the archive. If I try to unzip that file, as you described I only get one file in output (the last file overwrite the other).

If I run differentiateFileNames(zipFilename), then unzip the file, I get 2 files in the output directory:

Aew Text Document1.txt
Bew Text Document1.txt

I know it can look a bit cryptic, but it insures the files are diferentiated. If you want, as an exercise, it wouldn't take much to extend the script to directly unzip the files, find out the largest one, delete the other, then rename the file left with the proper original name.

Hoki
  • 11,637
  • 1
  • 24
  • 43