Separate data in a text file

Question

I have a big chunk of data (hexdump) which includes thousands of small images and the structure of the data is something like this.

20 00 20 00 00 10 00 00 <data> 20 00 20 00 00 10 00 00 <data> ...

Where the (20 00 20 00 00 10 00 00) is the separation between each section of data (image).

The file myfile including the whole hexdump looks something like this

3C 63 9E FF 38 5F 9E FF
31 59 91 FF 20 00 20 00
00 10 00 00 55 73 A2 FF
38 5D 9C FF 3A 5E 95 FF

What I want to do is basically separate it. I want to take the part which is separated by 20 00 20 00 00 10 00 00 and put each part in a txt file as 1.txt, 2.txt ... n.txt

I tried reading by line but it causes some problems because the 20 00 .. part can be found in 2 lines at some occasions like in the example above so it won't find every occurence.

while (getline(myfile,line,'\n')){
    if (line == "20 00 20 00 00 10 00 00")
        ...
}

So the actual contents of the file is hexadecimal numbers in text-form? It's not a binary file? — Some programmer dude, Jun 15 '16 at 08:07
Correct, the content is in a text file. I thought that would be easier to work with so I dumped it in a text file. I have access to the binary file if that's better. — Michael, Jun 15 '16 at 08:10
Also, it doesn't seem that the records are necessarily line-separated, which means you can't use line-by-line reading. And even if there are line-breaks at the correct place you can't use the `==` comparison, since the lines contains more than the separator and comparing two strings using `==` is looking for an *exact* match. Ad if the file is really not a text-file but all binary data, you can't use `std::getline` and string comparison at all. — Some programmer dude, Jun 15 '16 at 08:11
It is text, and I used getline. I got only about 168 matches (should be thousands) but that's because it found the exact match, but as I said it will also occur in overlapping lines (as the example above). — Michael, Jun 15 '16 at 08:12
You're actually dealing with some kind of stream (you don't know when data starts or ends), so you should read it as a stream, detect your "magic sequences" and thus split it into chunks of data (images). — matthias, Jun 15 '16 at 08:16
Not sure how you can tell if the byte sequence is a genuine delimiter or part of one of the embedded images. Maybe use an image library to read images from the file one by one and manually skipping over the delimiters after each image read? — Galik, Jun 15 '16 at 10:43
How big is the original binary file? What OS do you have available? — Mark Setchell, Jun 18 '16 at 08:02
The binary file is about 35 MB. I'm running Arch Linux but I also got Windows available — Michael, Jun 18 '16 at 08:03

score 2 · Answer 1 · answered Jun 15 '16 at 08:16

My suggestion is to read the binary file. If it's small enough you can read it all into memory in one go, otherwise I suggest you use the operating system to map the file into memory (or at least a "window" of it).

Then it's quite easy to find the 8-byte sequence separating the records. First simply search for 0x20, and whenever that is found you see if it's the start of the whole separator sequence.

When you find the separator sequence you take the saved position of the previous separator, and the position of the newly found separator, and the data between is the data you want. Save the position of the newly found separator as the old position, and continue searching for the next separator.

score 1 · Accepted Answer · answered Jun 18 '16 at 13:22

Definitely save the file in binary and dump actual hex bytes, as opposed to text form. You'll save 3x more space and the implementation to read files is easier to write.

That being said, if your file is in binary, this is the solution:

#include <fstream>  

using std::ifstream;
using std::ofstream;
using std::string;

void incrementFilename(char* filename) {
  int iFile;
  sscanf(filename, "%d.dat", &iFile);
  sprintf(filename, "%d.dat", ++iFile);
}

int main() {
  char outputFilename[16] = "1.dat";
  ifstream input("myfile.dat", ifstream::binary);
  ofstream output(outputFilename, ofstream::binary);

  while (!input.eof() || !input.is_open()) {
    char readbyte;
    input.read(&readbyte, 1);

    if (readbyte == 0x20) {
      char remaining[7];
      char testcase[7] = { 0x00, 0x20, 0x00, 0x00, 0x10, 0x00, 0x00 };
      input.read(remaining, 7);
      if (strncmp(remaining, testcase, 7) == 0) {
        incrementFilename(outputFilename);
        output.close();
        output.open(outputFilename, ofstream::binary);
      } else {
        output.write(&readbyte, 1);
        output.write(remaining, 7);
      }
    } else {
      output.write(&readbyte, 1);
    }
  }

  return 0;
}

John Burger · Answer 3 · 2016-06-18T08:42:47.697

Given that the actual data sequence you're after is potentially split across lines, you need to read the data in the smallest "bite" you can - two-character arrays - and ignore whitespace (the space or newline delimeters).

Once you do this, you can keep track of what you've read as you write it to your sub-file. Once you get your "magic sequence", start a new sub-file.

Two complexities that you don't cover:

Is the "magic sequence" at all possible to exist in a file as part of the normal data? If so, you're going to split an otherwise-single file.
I assume you don't want the "magic sequence" at the end of every sub-file. That's going to add a little complexity to your comparison:
- If you start to match, you need to suspend writing to the sub-file.
- If you get halfway through and suddenly stop matching, you're going to have to write out the partial match before writing out the new non-matching entry.

One advantage in doing it this way: if a sub-file, while still inside the main file, started near the end of a line, it will start at a new line and break after 16 two-characters rather than mimic its position in the main file. Or did you want the sub-files output in true bytes, without space delimiters?

I'm going to go away and write this program: it sounds like fun!

OK, I wrote the following. Hopefully the Usage describes what to do. I didn't particularly want to use streams - I find them horribly inefficient - but you started it...

//
// SubFile.cpp
//

#include <string>
#include <fstream>
#include <iostream>
#include <iomanip>

using namespace std;

const unsigned MaxBytesPerLine = 16;

const unsigned char magic[] = { '\x20','\x00','\x20','\x00','\x00','\x10','\x00','\x00' };

class OutFile : private ofstream {
public: // Methods
    using ofstream::is_open; // Let others see whether I'm open
    OutFile(const string &fileName, bool bin);
    bool Write(unsigned b);
    ~OutFile();
private: // Variables
    unsigned num; // Number bytes in line
    bool bin; // Whether to output binary
}; // OutFile

OutFile::OutFile(const string &filename, bool bin) :
         ofstream(filename),
         num(0),
         bin(bin) {
    if (!bin) {
        setf(uppercase);
    } // if
} // OutFile::OutFile(name, bin)

bool OutFile::Write(unsigned b) {
    if (bin) {
        char c = (char)b; // Endian fix!
        return write(&c, 1).good();
    } // if
    if (num > 0) {
        *this << " ";
    } // if
    *this << setbase(16) << setw(2) << setfill('0') << b;
    if (++num == MaxBytesPerLine) {
        *this << endl;
        num = 0;
    } // if
    return good();
} // OutFile::Write(b)

OutFile::~OutFile() {
    if (bin) {
        return;
    } // if
    if (num == 0) {
        return;
    } // if
    if (!good()) {
        return;
    } // if
    *this << endl;
} // OutFile::~OutFile

void Usage(char *argv0) {
    cout << "Usage:" << endl;
    cout << "     " << argv0 << " <filename.txt> [bin]" << endl;
    cout << "  Read <filename.txt> in hex char pairs, ignoring whitespace." << endl;
    cout << "  Write pairs out to multiple sub-files, called \"1.txt\", \"2.txt\" etc." << endl;
    cout << "  New files are started when the following sequence is detected: " << endl << " ";
    for (unsigned i = 0; i < sizeof(magic); ++i) {
        cout << ' ' << hex << setw(2) << setfill('0') << (int)magic[i];
    } // for
    cout << endl;
    cout << "  If bin is specified: write out in binary, and files have a '.bin' extension" << endl;
} // Usage(argv0)

int main(int argc, char *argv[]) {
    if (argc < 2) {
        Usage(argv[0]);
        return 1;
    } // if
    ifstream inFile(argv[1]);
    if (!inFile.is_open()) {
        cerr << "Could not open '" << argv[1] << "'!" << endl;
        Usage(argv[0]);
        return 2;
    } // if

    bool bin = (argc >= 3) &&
               (argv[2][0] == 'b'); // Close enough!
    unsigned fileNum = 0; // Current output file number

    inFile >> setbase(16); // All inFile accesses will be like this
    while (inFile.good()) { // Let's get started!
        string outFileName = to_string(++fileNum) + (bin ? ".bin" : ".txt");
        OutFile outFile(outFileName, bin);
        if (!outFile.is_open()) {
            cerr << "Could not create " << outFileName << "!" << endl;
            return (int)(fileNum + 2);
        } // if

        unsigned b; // byte read in
        unsigned pos = 0; // Position in 'magic'
        while (inFile >> b) {
            if (b > 0xFF) {
                cerr << argv[1] << " contains illegal value: "
                     << hex << uppercase << showbase << b << endl;
                return -1;
            } // if
            if (b == magic[pos]) {            // Found some magic!
                if (++pos == sizeof(magic)) { // ALL the magic?
                    break;                    // Leave!
                } // if
                continue;                     // Otherwise go back for more
            } // if
            if (pos > 0) {                   // Uh oh. No more magic!
                for (unsigned i = 0; i < pos; ++i) {
                    outFile.Write(magic[i]); // So write out what we got
                } // for
                pos = 0;
            } // if
            outFile.Write(b);
        } // while
    } // for
    if (inFile.eof()) {
        return 0; // Success!
    } // if

    string s;
    inFile.clear();
    getline(inFile, s);
    cerr << argv[1] << " contains invalid data: " << s << endl;
    return -2;
} // main(argc,argv)

Whenever someone posts code, there are invariably comments posted:
"Why didn't you do this?"
"Why did you do that?"
Let the floodgates open!

score 0 · Answer 4 · answered Jun 18 '16 at 08:17

Here's my solution. It's a bit inefficient but I may rewrite it once I'm done with my finals. I assume that there are bytes of data separated by white-space. The problem is quite simple then -> it's just a pattern matching problem. I could use some sophisticated techniques to handle that but our pattern has a fix size which is quite small. Even brute-force approach will have linear time.

The code is self explanatory. I read file byte by byte and add it to a buffer (not too efficient, could keep only a window of data with index boundaries in file -> this could make possibility for more efficient I/O operations during creating new files). Once a terminating sequence is found, we pop it and save to a file (I made an assumption that we don't want empty files).

void save(const std::vector<short>& bytes, std::string filename, int sequenceLength)
{
    if (!bytes.size()) return; // Don't want empty files

    std::ofstream outputFile(filename);
    int i = 0;
    for (short byte : bytes)
    {
        outputFile << std::uppercase << std::hex << byte;

        i = (i + 1) % sequenceLength;
        if (i) outputFile << " ";
        else   outputFile << std::endl;
    }
}

std::string getFilename(int number)
{
    std::stringstream ss;
    ss << number << ".txt";
    return ss.str();
}

short getIntFromHex(const char* buffer)
{
    short result;
    std::stringstream ss;
    ss << std::hex << buffer;
    ss >> result;
    return result;
}

bool findTerminatingSequence(const std::vector<short>& bytes, short terminatingSequence[], int sequenceLength)
{
    int i = 0;
    int startIndex = bytes.size() - sequenceLength;
    for (i; i < sequenceLength; i++)
        if (terminatingSequence[i] != bytes[startIndex + i])
            break;
    return i == sequenceLength;
}

void popSequence(std::vector<short>& bytes, int sequenceLength)
{
    for (int j = 0; j < sequenceLength; j++)
        bytes.pop_back();
}

int main()
{
    std::vector<short> bytes;
    std::ifstream inputFile("input.txt");
    int outputFileIndex = 1;
    int sequenceLength = 8;
    short terminatingSequence[] = { 0x20, 0x00, 0x20, 0x00, 0x00, 0x10, 0x00, 0x00 };
    short nextByte;
    char buffer[3];

    while (inputFile >> buffer)
    {
        nextByte = getIntFromHex(buffer);
        bytes.push_back(nextByte);
        if (bytes.size() < sequenceLength || 
            !findTerminatingSequence(bytes, terminatingSequence, sequenceLength)) 
            continue;

        popSequence(bytes, sequenceLength);
        save(bytes, getFilename(outputFileIndex++), sequenceLength);
        bytes.clear();
    }

    save(bytes, getFilename(outputFileIndex), sequenceLength);

    return 0;
}

Mark Setchell · Answer 5 · 2016-06-18T12:55:50.590

I would go with Perl along these lines:

#!/usr/bin/perl
use warnings;
use strict;

# Slurp entire file from stdin into variable $data
my $data = <>;

# Find offsets of all occurrences of marker in file
my @matches;
my $marker='\x20\x00\x20\x00\x00\x10\x00\x00';
while ($data =~ /($marker)/gi){
    # Save offset of this match - you may want to add length($marker) here to avoid including marker in output file
    push @matches, $-[0];
}

# Extract data between pairs of markers and write to file
for(my $i=0;$i<scalar @matches -1;$i++){
   my $image=substr $data, $matches[$i], $matches[$i+1] - $matches[$i];
   my $filename=sprintf("file-%05d",$i);
   printf("Saving match at offset %d to file %s\n",$matches[$i],$filename);
   open(MYFILE,">$filename");
   print MYFILE $image;
   close(MYFILE);
}

Output

Saving match at offset 12 to file file-00000
Saving match at offset 44 to file file-00001

Run like this:

./perlscript < binaryData

I use more or less exactly this technique to recover damaged flash memory cards from cameras. You just search through the entire flash card for some bytes that look like the start of a JPEG/raw file and grab the following 10-12MB and save it as a file.

halit · Answer 6 · 2016-06-22T01:32:02.493

Your problem can be solved by implementing a simple finite state machine since you don't have long condition. You will read hex values separated by spaces and check values one by one if it's matching your criteria. If it matches create a new file continue flow, if not write you have read to current file. Here is the solution, reading part can be optimized by changing loop.

(assumed input filename as input.txt)

#include <fstream>
#include <sstream>

using namespace std;

void writeChunk(ostream& output, int value) {
    if (value == 0)
        output << "00" << " ";
    else
        output << hex << value << " ";
}

bool readNext(fstream& input, int& value, stringstream* keep = NULL) {
    if (input.eof()) {
        return false;
    } else {
        input >> hex >> value;
        if (keep != NULL)
            writeChunk(*keep, value);
        return true;
    }
}

string getFileName(int count) {
    stringstream fileName;
    fileName << count << ".txt";
    return fileName.str();
}

int main() {
    int fileCount = 1;
    stringstream fileName;
    fstream inputFile, outputFile;

    inputFile.open("input.txt");
    outputFile.open(getFileName(fileCount), ios::out);

    int hexValue;
    while (readNext(inputFile, hexValue)) {
        // It won't understand eof until an unsuccessful read, so double checking 
        if (inputFile.eof())
            break;

        if (hexValue == 0x20) {
            stringstream ifFails;
            ifFails << "20 ";
            if (readNext(inputFile, hexValue, &ifFails) && hexValue == 0x00 &&
                    readNext(inputFile, hexValue, &ifFails) && hexValue == 0x20 &&
                    readNext(inputFile, hexValue, &ifFails) && hexValue == 0x00 &&
                    readNext(inputFile, hexValue, &ifFails) && hexValue == 0x00 &&
                    readNext(inputFile, hexValue, &ifFails) && hexValue == 0x10 &&
                    readNext(inputFile, hexValue, &ifFails) && hexValue == 0x00 &&
                    readNext(inputFile, hexValue, &ifFails) && hexValue == 0x00) {
                outputFile.close();
                outputFile.open(getFileName(++fileCount), ios::out);
                continue;
            }
            outputFile << ifFails.str();
        } else {
            writeChunk(outputFile, hexValue);
        }
    }

    return 1;
}

score 0 · Answer 7 · answered Jun 23 '16 at 15:19

You can also use tokenizer for that: First read the "myfile" into a string. This is needed, because on a file you can have only forward iterator, but the regex needs a bidirectional one:

auto const& str(dynamic_cast<ostringstream&> (ostringstream().operator<<(ifstream("myfile").rdbuf())).str());

Then you need a pattern to split, with extended the '.' matches also newline:

auto const& re(regex(".?20.00.20.00.00.10.00.00.?", regex_constants::extended));

And finally iterate over the tokenized string and write it into the file 0.txt and so on.

auto i(0u);
for_each(sregex_token_iterator(str.cbegin(), str.cend(), re, -1),
         sregex_token_iterator(),
         [&i] (string const& s) {ofstream(to_string(i++) + ".txt") << s; });

Please note that the output files are not fully formated, they look like for 1.txt:

55 73 A2 FF
38 5D 9C FF 3A 5E 95 FF

It is just the contents without the delimiter.

Separate data in a text file

7 Answers7