Read binary data (image) from separated file

Question

I have a file which has records for employees data and images. Each record for one employee and his data, his image, and his wife image. I can't change the file structure

There are separators between text data and images.

Here a sample of one record:

record number D01= employee name !=IMG1= employee image ~\IMG2= wife image ^! \r\n

(D01= & !=IMG1= & ~\IMG2= & ^!) are the separators

This is the code how the file was written:

FileStream fs = new FileStream(filePath, FileMode.Create);
StreamWriter sw = new StreamWriter(fs, Encoding.UTF8);
BinaryWriter bw = new BinaryWriter(fs);

sw.Write(employeeDataString);
sw.Write("!=IMG1=");
sw.Flush();

bw.Write(employeeImg, 0, employeeImg.Length);
bw.Flush();

sw.Write(@"~\IMG2=");
sw.Flush();

bw.Write(wifeImg, 0, wifeImg.Length);
bw.Flush();

sw.Write("^!");
sw.Flush();

sw.Write(@"\r\n");
sw.Flush();

So how to read that file?

Fundamentally, you've got problems there - unless you read the image data to determine the length of the file there (if it even supports it!) you can't reliably detect when the file ends and the next part of the entry starts. It's a broken file format, basically. — Jon Skeet, Sep 23 '14 at 14:01
And what makes you think that an image file can't contain the bytes which - when interpreted as text - are `\r\n`? — Jon Skeet, Sep 23 '14 at 14:13
\r\n are reserved for new line! image can't have it as far as i know! — 404_File_Not_Found, Sep 23 '14 at 14:16
No, you're completely incorrect. An image format can do whatever it wants to. Imagine a "raw" image type which has a header consisting of just the dimensions, then 3 bytes per RGB pixel (one byte red, one byte green, one byte blue). Now imagine a pixel which has a red value of 13 and a green value of 10. You'd end up with the bytes which are the ASCII equivalent of `\r\n` being embedded in the file. You need to understand that fundamentally, a file's contents is *just* a sequence of bytes. It's up to the reader to interpret it appropriately. — Jon Skeet, Sep 23 '14 at 14:17
there is also an end separator for the image which is **^!** so the complete record ends with **^!\r\n** what is the odds of having that sequence in an image? — 404_File_Not_Found, Sep 23 '14 at 14:22
Oh it may well be *unlikely* - but that's a long way from being *impossible*. The file format is fundamentally broken - and even if you could somehow guarantee that that byte sequence would never come up, it's painful to scan for it vs knowing how much data to read to start with. — Jon Skeet, Sep 23 '14 at 14:24
Separators make no sense in binary data. What image format is it? BMP with fixed dimensions? Then you've got a good chance..jpg? can't work reliably. But you still may (and probably will need to) write a lenient code that can read the broken data in a semi-automatic way and convert them to something proper.. — TaW, Sep 23 '14 at 14:24
So let's assume that the file is not broken for the sake of the question please, so how to read that file? — 404_File_Not_Found, Sep 23 '14 at 14:26
Well I would read it in as a whole and split it by \r\n. Then test each chunk if there is a recognizable image in it and if not re-insert the `\r\n` and join with the next chunk.. — TaW, Sep 23 '14 at 14:29
In a jpg file the first bytes are this header: `JPEG SOI marker (FFD8 hex)` That can be the (1st) test; if a chunk fails it join it with the next; if it passes, either display the image to a human being and let it be confirmed or keep going and see what happens.. — TaW, Sep 23 '14 at 14:32
Looking closer at the code, I guess you should say that the full record always end on `!^\r\n`. Now four bytes is an almost pretty good separator.. Of course it would be easy to create images that break the format, so it is hackable, but that's not your concern I guess.. — TaW, Sep 23 '14 at 14:49
@TaW yes, that's what i said the separator is pretty good and i know it is working because the issuing machine accepted that file without any problem and read the data and the image. — 404_File_Not_Found, Sep 23 '14 at 18:13
Do you know how big the file can be? Can you read it into memory completely? Since it only contains separtors and no length indicators reading it in in good chinks isn't possible.. — TaW, Sep 23 '14 at 20:20
Just because this didn't break so far, it doesn't mean it's a good format. If's actually far from good, and you've been pretty lucky no image happened to have one of these sequences. Also note that there are several pixel combinations (presuming raw BMP) which might have the end separator (`5E 21 0D 0A`) sequence (e.g. two RGB pixels `xxxx5E` `210D0A`, or `xx5E21` `0D0Axx`, or `5E210D` `0Axxxx`). Mathematically speaking, the odds are low, but with every new employee you are one tiny step closer to failure. Not to mention performance issues (i.e. no possibility for random access). — vgru, Sep 24 '14 at 07:24

TaW · Answer 1 · 2014-09-24T07:48:26.687

There many kinds of files; the three most common ways to store records are

Fixed size records, ideally with fixed size fields. Very simple to implement random access.
Tagged files with tags and data interwoven. A bit complicated, but highly flexible and still rather efficiently readable, since the tags hold the positions and lengths of the data.
And then there are Separated files. Always a pain.

Two issues:

You must be sure that the separators are never in the data. Not 100% possible when you have binary data like images..
There is no efficient way to access individual records..

Ignoring the 1st issue, here is a piece of code that will read all records into a list of class ARecord.

FileStream fs;
BinaryReader br;
List<ARecord> theRecords;

class ARecord
{
    public string name { get; set; }
    public Image img1 { get; set; }
    public Image img2 { get; set; }
}

int readFile(string filePath)
{
    fs = new FileStream(filePath, FileMode.Open);
    br = new BinaryReader(fs, Encoding.UTF8);

    theRecords = new List<ARecord>();
    ARecord record = getNextRecord();
    while (record != null)
    {
        theRecords.Add(record);
        record = getNextRecord();
    }
    return theRecords.Count;
}

ARecord getNextRecord()
{
    ARecord record = new ARecord ();

    MemoryStream ms;
    System.Text.UTF8Encoding enc = new System.Text.UTF8Encoding();
    byte[] sepImg1 = enc.GetBytes(@"!=IMG1=");
    byte[] sepImg2 = enc.GetBytes(@"~\IMG2=");
    byte[] sepRec = enc.GetBytes(@"^!\r\n");

    record.name = enc.GetString(readToSep(sepImg1));

    ms = new MemoryStream(readToSep(sepImg2));
    if (ms.Length <= 0) return null;             // check for EOF
    record.img1 = Image.FromStream(ms);

    ms = new MemoryStream(readToSep(sepRec));
    record.img2 = Image.FromStream(ms);

    return record;
}

byte[] readToSep(byte[] sep)
{
    List<byte> data = new List<byte>();
    bool eor = false;
    int sLen = sep.Length;
    int sPos = 0;
    while (br.BaseStream.Position < br.BaseStream.Length && !eor )
    {
        byte b = br.ReadByte();
        data.Add(b);
        if (b != sep[sPos]) { sPos = 0; }
        else if (sPos < sLen - 1) sPos++; else eor = true;
    }
    if (data.Count > sLen ) data.RemoveRange(data.Count - sLen , sLen );
    return data.ToArray();
}

Notes:

There is no error checking whatsoever.
Watch those separators! is the @ really right??
Expanding the code to create the record number is left to you

This should work, but I must say your naming convention is a bit weird for C# (class name with camel casing, fields in uppercase). — vgru, Sep 24 '14 at 07:31
You are right. I have corrected. It came from having two versions in one solution in parallel. Somtimes I like to use short Caps names for local object references, though, to make them stand out like `Label L = (Label)sender` — TaW, Sep 24 '14 at 07:51

Read binary data (image) from separated file

1 Answers1