Load huge file into memory

Question

I have a locally stored file, around 2.3MB in size, about 500 000 lines altogether and I would like to store it into a HashSet into memory. Since the file is large, and reading is so slow, I have split the file into 5 smaller ones, less than 100 000 lines each.
My idea is to instantiate 5 separate threads from the Application class. Each thread would read its own file and store data in its own set. Upon completion, it would return the obtained subset to the main thread, ie. to the Application class, which would then store in the main set. Thread code is as follows:

private class LoadFileThread extends Thread {
    private String filename;
    private Set<String> subSet;
    private MyApplication application;

    public LoadFileThread(String filename, MyApplication ctx) {
        this.filename = filename;
        this.application = ctx;
        this.subSet = new HashSet<String>();
    }

    @Override
    public void run() {
        AssetManager am = application.getAssets();
        BufferedReader reader = null;
        try {
            InputStream is = am.open(filename);
            reader = new BufferedReader(new InputStreamReader(
                is));
            String line = null;
            while ((line = reader.readLine()) != null) {
                subSet.add(line.toUpperCase());
            }        
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {reader.close();}catch (IOException ignorable) {}
        }
        application.setSubSet(subSet, this.getName());
    }

}

Method setSubSet in the Application class:

public synchronized void setSubSet(Set<String> subSet, String name) {
        myMainSet.addAll(subSet);
        Log.d("Thread finished", name);
    }

Two problems occur:

Reading is still waaaaay to slow.
I get an out of memory error when calling addAll on the main set.

Is there a better way to do this? How?

Can your disk read from 5 different places at the same time? — Paul Hankin, Jan 19 '14 at 14:04
Have you got these numbers correct? 2.3Mb with 500,000 lines makes each line about 5 characters long. — , Jan 19 '14 at 14:04
@MikeW yes, each line is not very long. it is less than 500 000 lines, and each line is not longer than 10 characters — Maggie, Jan 19 '14 at 14:06
@Anonymous that's what's bugging me. When I run the plain java program on my computer, it indeed runs in a split second. — Maggie, Jan 19 '14 at 14:07
You are reading 5 characters at a time then. Causing many reads. Read more than a single line at a time then split it up and procees each line once its in memory — MikeHelland, Jan 19 '14 at 14:20

score 1 · Answer 1 · answered Jan 19 '14 at 14:29

1

With 500,000 lines and readLine () you are doing 500,000 reads.

Create a 64k buffer and read into that.

Process each full line you can then read another 64k.

That should cut your reads into a fraction of 500,000

answered Jan 19 '14 at 14:29

MikeHelland

1,151
1
7
17

Why 64k and not some other number? – Oliver Charlesworth Jan 19 '14 at 15:42
I thought in my reading I read tgat Android uses 64k under the hood so match up with that. Perhaps 32k was better: http://stackoverflow.com/questions/10143731/android-optimal-buffer-size – MikeHelland Jan 19 '14 at 15:46

Load huge file into memory

1 Answers1