0

Ok, so this is a unique question.

We are getting files (daily) from a company. These files are downloaded from their servers to ours (SFTP). The company that we deal with deals with a third party provider that creates the files (and reduces their size) to make downloads faster and also reduce file-size on their servers.

We download 9 files daily from the server, 3 groups of 3 files
Each group of files consists of 2 XML files and one "image" file.
One of these XML files gives us information on the 'image' file. Information in the XML file we need:

  • offset: Gives us where a section of data starts
  • length: Used with offset, gives us the end of that section
  • count: Gives us the number of elements held in the file


The 'image' file itself is unusable until we split the file into pieces based on the offset and length of each image in the file. The images are basically concatenated together. We need to extract these images to be able to view them.

An example of offset, length and count values are as follows:

offset: 0
length: 2670

offset: 2670
length: 2670

offset: 5340
length: 2670

offset: 8010
length: 2670

count: 4

This means that there are 4 (count) items. The first count item begins at offset[0] and is length[0] in length. The second item begins at offset[1] and is length[1] in length, etc.

I need to split the images at these points and these points PRECISELY without room for error. The third party provider will not provide us with the code and we are to figure this out ourselves. The image file is not readable without splitting the files and are essentially useless until then.


My question: Does anyone have a way of splitting files at a specific byte?

P.S. I do not have any code yet. I don't even know where to begin with this one. I am not new to coding, but I have never done file splitting by the byte.

I don't care which language this uses. I just need to make it work.


EDIT
The OS is Windows
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • What OS is your server running? – lurker Jun 08 '15 at 15:24
  • You should ask this question on SuperUser instead. This is not a programming-specific question. It can be easily done with some common linux utilities. – d3dave Jun 08 '15 at 15:27
  • @lurker I edited my question to include OS information. Windows – ctwheels Jun 08 '15 at 15:27
  • I think your question is a bit too broad ATM. Can you suggest a language you would most likely use to implement the solution? Also, just to be clear, you are simply asking how to accomplish file splitting. You are NOT asking for a working solution, correct? – FGreg Jun 08 '15 at 16:39
  • Here's an example of [reading a file n bytes at a time in Java](http://stackoverflow.com/a/5077349/953327). – FGreg Jun 08 '15 at 16:43
  • @FGreg Yes that is correct, I am not looking for a working solution. That is asking a bit much. If someone has one, I won't complain, but I just want an idea for moving forward. This project has me stumped and the 3rd party provider that created the concatenated file will not share any code with us (rendering their files useless to us until we can find a solution). The language we are looking to use: Any. I don't care which language, I just need it to work. Batch, java, c#, javascript, I really don't care, pick one haha – ctwheels Jun 08 '15 at 17:13
  • @FGreg I am trying to implement something with the solution from the post you commented as an example. My batch file is not reading special characters as they should be. By making the image file a text file, special characters are everywhere and it breaks my script: `½ÒÚÿ¶úŽq__ø_Âø]ælá˜3ç™À††^8Ð3dp̃/2ìÍœ2ñ ´³` that is just an example of a line that makes up the image file – ctwheels Jun 08 '15 at 17:44
  • @FGreg So we downloaded about 30 different programs today, with viruses most likely as well, and finally figured out that the filetype is not in a picture format, but is instead encoded or compressed into the CCITT 4 Fax format. This means that the underlying file is a TIFF format. We cannot open the images because we cannot decompress the CCITT 4 Fax format into a TIFF. Our issue is less the separation of the file itself and more the decompression of the file. – ctwheels Jun 10 '15 at 00:26
  • To add to the previous comment, the filetype is not given to us. It adopted an extension from its filename (given to us by the 3rd party provider) so the file's type ended up becoming something stupid like G8989 and stuff because of the naming convention used. One of those random programs we downloaded recognized the file as being CCITT 4 Fax format. – ctwheels Jun 10 '15 at 00:28
  • This might help? http://stackoverflow.com/questions/17770071/splitting-a-multipage-tiff-image-into-individual-images-java – FGreg Jun 10 '15 at 15:44
  • @FGreg Ok, so I build a PHP script around this. What an absolute pain in the @$$ this was. The file we received was a TIFF file with concatenated images (idk why they did that) where only the first file was viewable until the image files were split. I created a script to do this. I will be looking to selling our code to the company we deal with, and for that reason I will not be posting it here as an answer. I appreciate you looking into solutions for us though. It did help us find a solution in the end, even though we are using a different programming language – ctwheels Jun 17 '15 at 19:48

1 Answers1

1

You hooked me. Here's a rough Java method that can split a file based on offset and length. This requires at least Java 8.

A few of the classes used:

And an article I found helpful in producing this example.

/**
 * Method that splits the data provided in fileToSplit into outputDirectory based on the
 * collection of offsets and lengths provided in offsetAndLength.
 * 
 * Example of input offsetAndLength:
 *      Long[][] data = new Long[][]{
 *          {0, 2670},
 *          {2670, 2670},
 *          {5340, 2670},
 *          {8010, 2670}
 *      };
 * 
 * Output files will be placed in outputDirectory and named img0, img1... imgN
 * 
 * @param fileToSplit
 * @param outputDirectory
 * @param offsetAndLength
 * @throws IOException
 */
public static void split( Path fileToSplit, Path outputDirectory, Long[][] offsetAndLength ) throws IOException{

    try (SeekableByteChannel sbc = Files.newByteChannel(fileToSplit, StandardOpenOption.READ )){
        for(int x = 0; x < offsetAndLength.length; x++){

            ByteBuffer buffer = ByteBuffer.allocate(offsetAndLength[x][4].intValue());
            sbc.position(offsetAndLength[x][0]);
            sbc.read(buffer);

            buffer.flip();
            File img = new File(outputDirectory.toFile(), "img"+x);
            img.createNewFile();

            try(FileChannel output = FileChannel.open(img.toPath(), StandardOpenOption.WRITE)){
                output.write(buffer);
            }

            buffer.clear();
        }
    }

}

I leave parsing the XML file to you.

FGreg
  • 14,110
  • 10
  • 68
  • 110
  • I have one-upped your answer. It seems there is more to the issue I am having. I added another comment under my question – ctwheels Jun 10 '15 at 00:23