0

Suppose you want to read the data from large text file (~300mb) to array of vectors: vector<string> *Data (assume that the number of columns is known).

//file is opened with ifstream; initial value of s is set up, etc...


Data = new vector<string>[col];
string u;
int i = 0;

do
{       
    istringstream iLine = istringstream(s);

    i=0;
    while(iLine >> u)
    {
        Data[i].push_back(u);
        i++;
    }
}
while(getline(file, s));

This code works fine for small files (<50mb) but memory usage is increasing exponentially when reading large file. I'm pretty sure that the problem is in creating istringstream objects each time in a loop. However, defining istringstream iLine; outside of both loops and putting each string into stream by iLine.str(s); and clearing the stream after inner while-loop (iLine.str(""); iLine.clear();) causes the same order of memory explosion as well. The questions that arise:

  1. why istringstream behaves this way;
  2. if it is the intended behavior, how the above task can be accomplished?

Thank you

EDIT: In regards to the 1st answer I do clean the memory allocated by array later in the code:

for(long i=0;i<col;i++)
    Data[i].clear();
delete []Data;

FULL COMPILE-READY CODE (add headers):

int _tmain(int argc, _TCHAR* argv[])
{
ofstream testfile;
testfile.open("testdata.txt");

srand(time(NULL));

for(int i = 1; i<1000000; i++)
{
    for(int j=1; j<100; j++)
    {
        testfile << rand()%100 << " ";
    }

    testfile << endl;
}

testfile.close();

vector<string> *Data;

clock_t begin = clock();

ifstream file("testdata.txt"); 

string s;

getline(file,s);

istringstream iss = istringstream(s);

string nums;

int col=0;

while(iss >> nums)
{
    col++;
}

cout << "Columns #: " << col << endl;

Data = new vector<string>[col];

string u;
int i = 0;

do
{

    istringstream iLine = istringstream(s);

    i=0;

    while(iLine >> u)
    {
        Data[i].push_back(u);
        i++;

    }

}
while(getline(file, s));

cout << "Rows #: " << Data[0].size() << endl;

for(long i=0;i<col;i++)
        Data[i].clear();
    delete []Data;

clock_t end = clock();

double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;

cout << elapsed_secs << endl;

getchar();
return 0;
}
Oleg Shirokikh
  • 3,447
  • 4
  • 33
  • 61
  • 2
    The real question is _why are you using a separate `vector` for each line?!_ – Lightness Races in Orbit Jan 31 '13 at 06:53
  • Why aren't you just using your `ifstream` ? – WhozCraig Jan 31 '13 at 06:54
  • 1
    Why are you `new`ing your `vector`? – Mark Garcia Jan 31 '13 at 06:55
  • 1. I use separate vector for each COLUMN. The behavior reading row-wise is similar (I've checked) – Oleg Shirokikh Jan 31 '13 at 06:57
  • I think an even realer (is that a word?) question is, since you are using vector, which is a replacement for manually managing dynamic arrays, why are you making a dynamic array of vectors instead of a vector of vectors? – Benjamin Lindley Jan 31 '13 at 06:57
  • 2. I can't just use ifstream because I need to parse different types of data using istringstream – Oleg Shirokikh Jan 31 '13 at 06:57
  • Benjamin, I've tried vector of vectors - the same memory issue... question is about the behavior of istringstream – Oleg Shirokikh Jan 31 '13 at 06:59
  • @user2028058: Then go back to it, it wasn't the problem, and now you've created more. – Benjamin Lindley Jan 31 '13 at 07:00
  • @BenjaminLindley: it won't solve the main problem: memory leak caused by `istringstream iLine = istringstream(s);` – Oleg Shirokikh Jan 31 '13 at 07:01
  • @Uri: read the comment above do{} loop – Oleg Shirokikh Jan 31 '13 at 07:02
  • Why do you believe that it is the istringstream causing the problem? And do you have some numbers to back up your claim that the growth is exponential? – JoergB Jan 31 '13 at 07:03
  • @MarkGarcia: 'cause it is array of vectors... – Oleg Shirokikh Jan 31 '13 at 07:04
  • @JoergB: Good comment. I've tried just to push random strings of the same size to the vectors while not reading from stringstream. For 280MB txt file filled with 2-digit numbers, memory used by the process goes to 2GB in a few seconds. Same file - no stringstream - everything is fine – Oleg Shirokikh Jan 31 '13 at 07:06
  • If it is indeed a memory leak in your istringstream implementation, then it should be replicable without file input. You should be able to simply `main()` yourself a test rig that just eats data from a similar loop, assuming each line is roughly the same size, but invoking the loop repeatedly. Can you do that in an SSCCE? (or have you tried to?) – WhozCraig Jan 31 '13 at 07:07
  • How about a full compile-able example that demonstrates the problem? – Benjamin Lindley Jan 31 '13 at 07:09
  • 2 digit numbers, stored as individual strings, may well occupy a lot more memory than the original file. – JasonD Jan 31 '13 at 07:10
  • You seem to have left some lines out of your source code. Can you post a **complete** minimal sample program? – Robᵩ Jan 31 '13 at 07:17
  • I'm not seeing any leaks with just an [istringstream stress](http://ideone.com/cHGe2a) on ideone.com, and after 42-million tokens and 113MB of test data, the footprint never climbed above 3MB. If you get similar numbers, you can probably conclude it isn't the string stream that is the core issue. – WhozCraig Jan 31 '13 at 07:38
  • I don't see any exponential growth. But it definitely fails for me, and it has nothing to do with stringstream, just a bad allocation for the vectors. I'd imagine the OS is just having a difficult time finding one hundred contiguous 12MB locations, in addition to 100 million small locations for the string data. – Benjamin Lindley Jan 31 '13 at 07:54
  • @BenjaminLindley: Yes, I understood.. Problem is not in stringstreams... Any ideas how to overcome this issue and read the file? – Oleg Shirokikh Jan 31 '13 at 08:00
  • Trying to understand customs here: why has this been downvoted? The poster had made a wrong guess about the origin of his memory problems. But there was a real problem. And it appears an answer has been found - and maybe the poster learned a bit about how to analyze such problems. – JoergB Jan 31 '13 at 08:02
  • @JoergB: Thank you. You helped a lot. Somebody is trying to be "smart" not by helping and tackling the problem but downvoting the post :) – Oleg Shirokikh Jan 31 '13 at 08:06

2 Answers2

0

I seriously suspect this is not istringstream problem (especially, given you have same result with iLine constructor outside the loop).

Possibly, this is a normal behavior of the std::vector. To test that, how about you run the exact same lines, but comment out: Data[i].push_back(u);. See if your memory grows this way. If it doesn't then you know where the problem is..

Depends on your library, vector::push_back will expand its capacity by a factor of 1.5 (Microsoft) or 2 (glib) every time it needs more room.

Uri London
  • 10,631
  • 5
  • 51
  • 81
0

vector<> grows memory geometrically. A typical pattern would be that it doubles the capacity whenever it needs to grow. That may leave a lot of extra space allocated but unused, if your loop ends right after such a threshold. You could try calling shrink_to_fit() on each vector when you are done.

Additionally, memory allocated by the C++ allocators (or even plain malloc()) is often not returned to the OS, but left in a process-internal free memory pool. this may lead to further apparent growth. And it may cause the results of shrink_to_fit() to be invisible from outside the process.

Finally if you have lots of small strings ("2-digit numbers"), the overhead of a stringobject may be considerable. Even if the implementation uses a small-string optimization, I'd assume that a typical string uses no less than 16 or 24 bytes (size, capacity, data pointer or small string buffer) - probably more on a platform where size_type is 64 bits. That is a lot of memory for 3 bytes of payload.

So I assume you are seeing normal behaviour of vector<>

JoergB
  • 4,383
  • 21
  • 19
  • Hm.. This really makes sense for me, but: 1. The memory grows so fast that exception raised before loop is done, so I cannot shrink the size of vector 2. 2-digit numbers are just for example. I need to handle any input atomic type – Oleg Shirokikh Jan 31 '13 at 07:22
  • 2
    With your test code, does it change anything (fail sooner, fail later, fail never), if you call `Data[i].reserve(1000000);` for each vector. (Taking advantage of the fact that you know that you'll have 999999 lines of input.) – JoergB Jan 31 '13 at 07:37
  • Yes, you are right... it fails even on the stage of allocating this much memory for all the vectors... Do you know the alternative solution for my task? – Oleg Shirokikh Jan 31 '13 at 07:54
  • The most efficient use of memory would be to load all the file content into an array of chars and use something equivalent to `strtok()` to break it into words in place. Your column vectors would then be `vector`. That is assuming you need all the data in memory in text form at once. – JoergB Jan 31 '13 at 07:59
  • @user2028058: That depends upon what your task is. Reading data into memory is certainly not the task, but a means to an end. So what exactly are you trying to do with the data? – Benjamin Lindley Jan 31 '13 at 08:00
  • basically, I need the functionality of stringstream to determine the type of the data later and perform conversion using stringstream. – Oleg Shirokikh Jan 31 '13 at 08:03
  • If your final data has a smaller representation than as a string (for example, if you will convert to numbers), you may want to do that conversion right away and store the converted data. If your memory isn't even sufficient for that, you'll need to consider an algorithm that uses disk for temp storage. – JoergB Jan 31 '13 at 08:06
  • @user2028058: No, that is not your task. That is the functionality you've decided to use to accomplish your task. A task is more specific. Something like *"I'm trying to sum all the numbers in each column"*, or *"I'm trying to find the person in this database that would be best suited to be my butler"*. – Benjamin Lindley Jan 31 '13 at 08:07
  • @JoergB: Thank you very much. Basically, you resolved the root problem. Coming up with the way of handling such data is out of scope of my question already. Although if you have any other suggestions, I would greatly appreciate this. – Oleg Shirokikh Jan 31 '13 at 08:13
  • @BenjaminLindley: :) True... Suppose I want to run multiple regression on the data stored in file. I don't know what type of data is there. If there are only numeric data, then i need just one structure (e.g. array of vecs or vec of vecs). If there are some categorical variables (non-numeric), then I'll need to factorize them (e.g. assign unique double number to each categorical class) but still keep original string representation. – Oleg Shirokikh Jan 31 '13 at 08:13