2

I have a complex file read issue....I have a need to read a DOCX file with an embedded file system, extract a ZIP file, and peruse the ZIP file's internal directory to extract the actual files I need. I already have written this code in Java successfully, so I know it can be accomplished. But, I want to do this in Rust.

Currently, I can read the DOCX file, iterate through the OLE10 objects to locate the file I need. The OLE10 file (which is actually the ZIP) has a weird extraction command header of 256 bytes, which I seek past. If I read the rest of the file stream and write it to the filesystem it will write out as a ZIP. I can use 7-zip to open the file and see all the contents.

The problem is, no matter what Rust ZIP crate I use (zip, zip_extract, zip_extensions, rc-zip) I just cannot extract the ZIP contents. I continuously run into an issue "cannot find end of central directory". I have iterated through the file, and the EOCD tag of "50 4B 05 06" is actually there. If I end the stream at the EOCD, I got an "early end of file exit" error. The file is >9M, and I am wondering if this might be the issue.

Anyone have any ideas how to use Rust to extract the ZIP directory and attach it to a buffer or the filesystem?

Here's the code that just won't extract:

let docx_path = Path::new(docx_filename);

// Capture the files from the embedded CFB filesystem
let mut comp_file = cfb::open(docx_path).unwrap();
let objpool_entries_vec: Vec<_> = comp_file                                               // Collect the entries of /ObjectPool
    .read_storage(Path::new("/ObjectPool"))
    .unwrap()
    .map(|subdir| comp_file.read_storage(subdir.path().to_owned())
        .unwrap()
        .filter(|path| path.name().contains("Ole10Native"))
        .next()
    )
    .filter(|entry| entry.is_some())                      // Filter entries with data
    .map(|entry| entry.unwrap())                               // Unwrap those entries with data
    .collect();

let mut ole10_stream = comp_file.open_stream(objpool_entries_vec[5].path())  // Create stream of the OLE10 file
    .unwrap();
ole10_stream.seek(std::io::SeekFrom::Start(256));                                           // skip the 256 byte header

let mut ole_buffer = Vec::new();
ole10_stream.read_to_end(&mut ole_buffer);

let zip_cursor = Cursor::new(ole_buffer);

zip_extract::extract(
    zip_cursor,
    &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\"),
    false)
    .unwrap();

When I run the following, it writes out the ZIP to the directory and I can extract with 7zip. But, it still panics when trying to extract to the filesystem.

let docx_path = Path::new(docx_filename);

// Capture the files from the embedded CFB filesystem
let mut comp_file = cfb::open(docx_path).unwrap();
let objpool_entries_vec: Vec<_> = comp_file                                               // Collect the entries of /ObjectPool
    .read_storage(Path::new("/ObjectPool"))
    .unwrap()
    .map(|subdir| comp_file.read_storage(subdir.path().to_owned())
        .unwrap()
        .filter(|path| path.name().contains("Ole10Native"))
        .next()
    )
    .filter(|entry| entry.is_some())                      // Filter entries with data
    .map(|entry| entry.unwrap())                               // Unwrap those entries with data
    .collect();

let mut ole10_stream = comp_file.open_stream(objpool_entries_vec[5].path())  // Create stream of the OLE10 file
    .unwrap();
ole10_stream.seek(std::io::SeekFrom::Start(256));                                           // skip the 256 byte header

let mut ole_buffer = Vec::new();
ole10_stream.read_to_end(&mut ole_buffer);

let zip_cursor = Cursor::new(ole_buffer);    

let mut zip_file = OpenOptions::new()
    .write(true)
    .create(true)
    .open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;
zip_file.write_all(&mut zip_cursor.get_ref())?;
zip_file.flush();

let mut zip_file = File::open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;

let zip_archive = zip::ZipArchive::new(&zip_file)?;

zip_extract::extract(
    zip_file,
    &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\"),
    false)
    .unwrap();
flyinggreg
  • 79
  • 1
  • 9

2 Answers2

3

AWESOME!! I figured it out!! I needed to loop through the file until the 4-byte EOCD end signature of "0x50 0x4B 0x05 0x06", then continue 17 more bytes which provides:

  • "current disk#" (2-bytes),
  • "CD disk#" (2-bytes),
  • "# of CD disk entries on disk" (2-bytes),
  • "total entries of CD" (2-byte),
  • "CD size" (4-bytes),
  • "CD start offset" (4-bytes),
  • "# of bytes for following comments" (2-bytes),
  • comments (# character bytes = previous field)

I excluded any comments, so my last two fields are 0x00 and 'blank'. Here's the code to build the EOCD signature so I could use extract with zip_extensions::read::zip_extract():

let mut zip_file = OpenOptions::new()                                                      // Create the output_stream
    .write(true)
    .create(true)
    .open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?;
let mut ole_iter = ole10_stream.bytes();

// loop through the ZIP file and write everything until comments
let mut data: u8;
let mut output_buffer = Vec::new();
loop
{
    match ole_iter.next()
    {
        None => break,
        Some(byte) =>
                data = byte.unwrap(),
    }

    if data == 80                                                                               // look for PK tags
    {
        let mut pk_array = [0u8; 4];
        pk_array[0] = data;
        output_buffer.push(data);
        for pk_idx in 1..4
        {
            pk_array[pk_idx] = match ole_iter.next()
            {
                None => break,
                Some(x) =>
                        x.unwrap(),
            };
            output_buffer.push(pk_array[pk_idx]);
        }

        if pk_array == [0x50, 0x4B, 0x05, 0x06]                                                           // look for PK EOCD
        {
            for x in 0..18                                                                  // read the next 17 bytes after the EOCD tag
            {
                data = match ole_iter.next()
                {
                    None => break,
                    Some(x) =>
                        x.unwrap(),
                };
                output_buffer.push(data);
            }
            break;
        }


    }
    else
    {
        output_buffer.push(data);
    }

}
zip_file.write(&mut output_buffer);
zip_file.flush();


let zip =  zip::read::ZipArchive::new(
    File::open("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip")?
)
    .unwrap();

zip_extensions::read::zip_extract(
    &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files\\test.zip"),
    &PathBuf::from("C:\\Users\\ra069466\\Documents\\Software_Projects\\Rust_projects\\ha420_maint_app\\test_files"),
);
flyinggreg
  • 79
  • 1
  • 9
0

I can't speak for the other crates, but zip will automatically be seeking to the end of whatever io::Read you give it (and then searching backwards). Without seeing your code, I'd guess that you're passing a reader that extends past the end of the contents of the ZIP file, so zip fails to recognise the contents.

Feel free to make an issue on our issue tracker if there's a specific feature you need. I'm happy to extend the crate's API if need be ^^

Edit: I looked into the other crates you've used and they'd share this issue. rc-zip (The only one that doesn't use zip under the hood) has a ReadZip trait that starts searching at the end of whatever buffer you give it. You'd need to call ArchiveReader::new with the size you expect the internal zip file to be

Plecra
  • 156
  • 3
  • Thanks, I tried rc-zip also and still have the issue of "not finding the EOCD". But like mentioned, I can hex dump and search through the file and see the EOCD. And, if I write the stream out, I can open it with 7zip. – flyinggreg Nov 20 '20 at 15:26
  • It's possible that you've encountered https://github.com/zip-rs/zip/pull/178 , which will be part of the next release of the crate. It'd be great if you could share the `ole_buffer` in [a new issue](https://github.com/zip-rs/zip/issues) so that we can investigate the issue :) – Plecra Nov 20 '20 at 16:40
  • Unfortunately I cannot provide that :( But, after reading the pull, I'm going to see if I can find the end of comment mark and stream to that point. Maybe I can get it to to work. Do you know why I can read the EOCD and open the rewritten ZIP in 7zip, but rs-zip does not recognize the EOCD? Is it because of the comments? – flyinggreg Nov 20 '20 at 16:53