Read file to memory, loop through data, then write file

Question

I'm trying to ask a similar question to this post: C: read binary file to memory, alter buffer, write buffer to file but the answers didn't help me (I'm new to c++ so I couldn't understand all of it)

How do I have a loop access the data in memory, and go through line by line so that I can write it to a file in a different format?

This is what I have:

#include <fstream>
#include <iostream>
#include <string>
#include <sstream>
#include <vector>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdlib.h>

using namespace std;

int main()
{
    char* buffer;
    char linearray[250];
    int lineposition;
    double filesize;
    string linedata;
    string a;

    //obtain the file
    FILE *inputfile;
    inputfile = fopen("S050508-v3.txt", "r");

    //find the filesize
    fseek(inputfile, 0, SEEK_END);
    filesize = ftell(inputfile);
    rewind(inputfile);

    //load the file into memory
    buffer = (char*) malloc (sizeof(char)*filesize);      //allocate mem
    fread (buffer,filesize,1,inputfile);         //read the file to the memory
    fclose(inputfile);

    //Check to see if file is correct in Memory
    cout.write(buffer,filesize);

    free(buffer);
}

I appreciate any help!

Edit (More info on the data):

My data is different files that vary between 5 and 10gb. There are about 300 million lines of data. Each line looks like

M359

T359 3520 359

M400

A3592 zng 392

Where the first element is a character, and the remaining items could be numbers or characters. I'm trying to read this into memory since it will be a lot faster to loop through line by line, than reading a line, processing, and then writing. I am compiling in 64bit linux. Let me know if I need to clarify further. Again thank you.

Edit 2 I am using a switch statement to process each line, where the first character of each line determines how to format the rest of the line. For example 'M' means millisecond, and I put the next three numbers into a structure. Each line has a different first character that I need to do something different for.

A pointer can be accessed just like an array. If you want to access line-by-line, you should look into [`std::istringstream`](http://en.cppreference.com/w/cpp/io/basic_istringstream). — Some programmer dude, Feb 16 '13 at 18:56
First time I've seen `cout` and `malloc` together in the same function. — us2012, Feb 16 '13 at 19:02
Drop the `sizeof(char)`. The size of an object is defined in multiples of the size of a char, so `sizeof (char)` is one by definition. — Ulrich Eckhardt, Feb 16 '13 at 19:02
When I run it, it takes about a minute (The file size is 6gb, I have 32 in ram) to load into memory, and then the cout.write will display my data to the console. — BrianR, Feb 16 '13 at 19:08
Are you tabulating some specific data from the lines being read? perhaps a better understand of that would help. Are the lines in groups of data (i.e. every 5 lines represents a record of blah..)? I think we can help with this, but need some more info on the format of the incoming data (ie. what it represents, and how that format is represented in the inbound file). Also, make damn sure you're compiling 64-bit, because there's no way a 6GB alloc is going to work on a 32bit platform, even with typical addressing extensions. — WhozCraig, Feb 16 '13 at 19:11
Start from scratch. Use these headers: `string` `iostream` `fstream` `list` *and nothing else*. Do not use pointers, `new` or `delete`. Loop over a file, read it line by line with `getline`, add each line to a list of strings. Then loop over the list and write it to the output file. Once you get this working, add your transformation code (you may need more headers at this point). — n. m. could be an AI, Feb 16 '13 at 19:20

WhozCraig · Accepted Answer · 2013-02-18T08:57:57.523

So pardon the potentially blatantly obvious, but if you want to process this line by line, then...

#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main(int argc, char *argv[])
{
    // read lines one at a time
    ifstream inf("S050508-v3.txt");
    string line;
    while (getline(inf, line))
    {
        // ... process line ...
    }
    inf.close();

    return 0;
}

And just fill in the body of the while loop? Maybe I'm not seeing the real problem (a forest for the trees kinda thing).

EDIT

The OP is inline with using a custom streambuf which may not necessarily be the most portable thing in the world, but he's more interested in avoiding flipping back and forh between input and output files. With enough RAM, this should do the trick.

#include <iostream>
#include <fstream>
#include <iterator>
#include <memory>
using namespace std;

struct membuf : public std::streambuf
{
    membuf(size_t len)
        : streambuf()
        , len(len)
        , src(new char[ len ] )
    { 
        setg(src.get(), src.get(), src.get() + len);
    }

    // direct buffer access for file load.
    char * get() { return src.get(); };
    size_t size() const { return len; };

private:
    std::unique_ptr<char> src;
    size_t len;
};

int main(int argc, char *argv[])
{
    // open file in binary, retrieve length-by-end-seek
    ifstream inf(argv[1], ios::in|ios::binary);
    inf.seekg(0,inf.end);
    size_t len = inf.tellg();
    inf.seekg(0, inf.beg);

    // allocate a steam buffer with an internal block
    //  large enough to hold the entire file.
    membuf mb(len+1);

    // use our membuf buffer for our file read-op.
    inf.read(mb.get(), len);
    mb.get()[len] = 0;

    // use iss for your nefarious purposes
    std::istream iss(&mb);
    std::string s;
    while (iss >> s)
        cout << s << endl;

    return EXIT_SUCCESS;
}

I have done that before, but my files can vary between 5 and 10gb. The program took about an hour to complete. — BrianR, Feb 16 '13 at 19:13
@BrianR OK, i think I understand now. You want to avoid the butterfly'ing on your likely-single-spindle disk system. Since you have enough ram you want a single sequential read of the entire source file, then use that read buffer as a formatted stream source to do your processing, writing your result data to output in a long line of output-only ops. Is that accurate? — WhozCraig, Feb 16 '13 at 19:43
Yes that is what I am trying to do! I'm sorry for not being able to articulate it better. — BrianR, Feb 16 '13 at 19:46
I am open to anything. If it isn't the standard, I am ok with it (Completing this project takes priority of learning standards, since I have months to learn the standards, but not so with completing my project). — BrianR, Feb 16 '13 at 19:57
@BrianR Dunno if you're still monitoring this. But I think the update may do what you want. it is a little f'ugly, but it avoids a duplicate memory allocation for your massive file buffer and allows you to do the whole pre-load which I think you were looking for. Hope it helps, and thanks for the up-vote. — WhozCraig, Feb 17 '13 at 00:36

Jerry Coffin · Answer 2 · 2013-02-16T19:33:38.890

If I had to do this, I'd probably use code something like this:

std::ifstream in("S050508-v3.txt");

std::istringstream buffer;

buffer << in.rdbuf();

std::string data = buffer.str();

if (check_for_good_data(data))
    std::cout << data;

This assumes you really need the entire contents of the input file in memory at once to determine whether it should be copied to output or not. If (for example) you can look at the data one byte at a time, and determine whether that byte should be copied without looking at the others, you could do something more like:

std::ifstream in(...);

std::copy_if(std::istreambuf_iterator<char>(in),
             std::istreambuf_iterator<char>(),
             std::ostream_iterator<char>(std::cout, ""),
             is_good_char);

...where is_good_char is a function that returns a bool saying whether that char should be included in the output or not.

Edit: the size of files you're dealing with mostly rules out the first possibility I've given above. You're also correct that reading and writing large chunks of data will almost certainly improve speed over working on one line at a time.

score 0 · Answer 3 · answered Feb 16 '13 at 19:18

You should look into fgets and scanf, in which you can pull out matched pieces of data so it is easier to manipulate, assuming that is what you want to do. Something like this could look like:

FILE *input = fopen("file.txt", "r");
FILE *output = fopen("out.txt","w");

int bufferSize = 64;
char buffer[bufferSize];

while(fgets(buffer,bufferSize,input) != EOF){
   char data[16];
   sscanf(buffer,"regex",data);
   //manipulate data
   fprintf(output,"%s",data);
}
fclose(output);
fclose(input);

That would be more of the C way to do it, C++ handles things a little more eloquently by using an istream: http://www.cplusplus.com/reference/istream/istream/

Read file to memory, loop through data, then write file

3 Answers3

Linked