How to process large data files iteratively?

Question

I have a space separated data file with 4.5 million entries in the following format

CO_1 A 0 0 0 0 1

CO_2 A 0 0 0 0 1

This data file is used as an input to the Self-Organizing Map (SOM) algorithm that iterates through this file 100 (in my case) times.

I use the following readFile function to copy the file completely into the temp string and pass the string on to the SOM algorithm.

public String readFile()
{
    String temp = "";

    try
    {
        FileReader file = new FileReader(FILE_LOCATION);
        BR = new BufferedReader(file);
        String strLine = null;

        while((strLine = BR.readLine()) != null)
        {
            temp += strLine + "\n";
        }
    }
    catch(Exception e)
    {
        
    }
    
    return temp;
}

How ever I feel the above method puts a heavy burden on memory and slows down the iterations which could result in memory overruns. Currently I'm running this code in a cluster with 30GB memory allocation and the execution has not even completed a single iteration for about 36 hours.

I cannot partially read the file (as in blocks of lines) since the SOM will have to poll for data once the initial block is done which could result in even further complications.

Any ideas how I could improve this so I could successfully iterate 4.5 million entries 100 times.

EDIT

The whole files is read in to the string using the above method only once. Then the string variable is used throughout the 100 iterations. However, each time string tokenizers has been utilized to process each line in the file * number of iterations.

Can you represent you file with a more efficient data structure? For example, what about a ``Map`` with entries of the form ``map.put(1, new BitSet())`` where you use the key ``1`` to represent the string ``CO_1`` and a bitset containing ``0 0 0 0 1`` to represent the rest of your string? — gdiazc, Feb 22 '14 at 12:26
@Synex have you tried profiling your code to see what part is taking the longest? — Alan, Feb 22 '14 at 12:41
@Alan no I have not. Any suggestions? I'm using the Eclipse IDE — Synex, Feb 22 '14 at 12:43
@Synex, I suspect Eclipse might not be allocating enough heap space. If I'm right, then even if your machine has 30GB of memory available, that memory isn't being made available to your Java code. You should add a compiler flag: ``-Xmx4G``. I'll post an answer with this suggestion. — gdiazc, Feb 22 '14 at 13:25
@Synex A simple place to start would be to use `System.out.println` to report the time between start and completion of each section as calculated by taking the difference between calls to `currentTimeMillis` — Alan, Feb 22 '14 at 14:06

score 2 · Accepted Answer · answered Feb 22 '14 at 12:31

2

Don't ever use string concatenation for this kind of purpose.
Instead of String, use StringBuffer class for this purpose.
Consider Following example:

public StringBuffer readFile()
{
    StringBuffer tempSB = new StringBuffer();

    try
    {
        FileReader file = new FileReader(FILE_LOCATION);
        BR = new BufferedReader(file);
        String strLine = null;

        while((strLine = BR.readLine()) != null)
        {
            tempSB.append(strLine);
            tempSB.append("\n");
        }
    }
    catch(Exception e)
    {

    }

    return temp;
}

This will save your heap memory.

answered Feb 22 '14 at 12:31

unknown

4,859
10
44
62

In this situation, since you know roughly how much data to expect it might be wise to specify an initial capacity for the StringBuffer, so it doesn't have to spend ages re-sizing. E.g.: `= new StringBuffer(typical_line_length * rough_number_of_lines);` – Alan Feb 22 '14 at 12:34
You can use a `StringBuilder` here, since you need no synchronization. – qqilihq Feb 22 '14 at 12:41
If you need to search any text then you should go for ``Lucene Indexing``. – unknown Feb 22 '14 at 12:47

score 2 · Answer 2 · answered Feb 22 '14 at 13:29

I'd like to complement the other answers. Even though I think you should store your data in a more efficient data structure than just a string, I think there might be another reason you code is slow.

Since your file size seems to be around 100 MB, your code might be slowing down because Eclipse has not allocated enough heap space for it. Try adding the following flag:

-Xmx4G

This will give your code 4 GB of heap space to work with. To do this, in Eclipse go to:

// Run -> Run Configurations -> <Select your main class on the left>
// -> <Select the 'Arguments' tab>
// -> <Add the string "-Xmx4G" to the 'VM arguments' text area>

This might speed it up!

score 0 · Answer 3 · answered Feb 22 '14 at 12:29

Reading a file with String += is very expensive. I suggest you parse the entries into a data structure and this should take about 1-10 seconds. To iterate this repeatedly should take less than a second. 4.5 million entries which use say 110 bytes per entry should use about 0.5 GB, perhaps 1 GB for a more complex structure. which shouldn't be enough to worry about.

injecteer · Answer 4 · 2014-02-22T12:54:44.023

0

if you need to parse the txt serial file and be able to read it randomly, use a persistent storage, like a SQL DB or no-SQL one or even the Lucene Search Engine. This will give you the benefits like:

you don't have to load the whole files into RAM
you can use stream-processing -> read the file line-by-line and keep only the actual line in RAM
parsing and persisting of source files would cost a bit more time, but the random access would be way faster.
you can even parse and read your data in several threads independently

edited Feb 22 '14 at 12:54

answered Feb 22 '14 at 12:34

injecteer

20,038
4
45
89

This might be a good idea if there was more data, but Synex is only dealing with a relatively small set here so I think it would be overkill to use a DB. Each record is a short string, a character and 5 numbers. This works out as less than 100 bytes per entry, and thus less than 0.5GB. Assuming a decent machine this should not be a problem. – Alan Feb 22 '14 at 12:39
well, the 36++ hours of processing time and 30 GB RAM *ARE* already the overkill :) anyway, one would have to setup a DB only once, so even for a short-term perspective it should pay off – injecteer Feb 22 '14 at 12:53
I agree with @Alan. Since the file size seems to be around 100 MB, an in-memory solution is more appropriate in this case. – gdiazc Feb 22 '14 at 13:22
@injecteer It is quite unlikely that the 36+ hours is due to the data input step - at least not once the String/StringBuffer issue is solved. – Alan Feb 22 '14 at 14:08

How to process large data files iteratively?

4 Answers4