Can reading the dataset be faster in time and/or better in memory than this?

Question

In Java, here is the code to read a file with a table of integers:

public static int[][] getDataset() {

    // open data file to read n and m size parameters
    BufferedReader br = null;
    try {
        br = new BufferedReader(new FileReader(filePath));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        System.exit(1);
    }

    // count the number of lines
    int i = -1;
    String line = null, firstLine = null;
    do {

        // read line
        try {
            line = br.readLine();
            i++;
            if (i == 0) firstLine = line;
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }

    } while (line != null);

    // close data file
    try {
        br.close();
    } catch (IOException e) {
        e.printStackTrace();
        System.exit(1);
    }

    // check the data for emptiness
    if (i == 0) {
        System.out.println("The dataset is empty!");
        System.exit(1);
    }

    // initialize n and m (at least the first line exists)
    n = i; m = firstLine.split(" ").length;
    firstLine = null;

    // open data file to read the dataset
    br = null;
    try {
        br = new BufferedReader(new FileReader(filePath));
    } catch (FileNotFoundException e) {
        e.printStackTrace();
        System.exit(1);
    }

    // initialize dataset
    int[][] X = new int[n][m];

    // process data
    i = -1;
    while (true) {

        // read line
        try {
            line = br.readLine();
            i++;
        } catch (IOException e) {
            e.printStackTrace();
            System.exit(1);
        }

        // exit point
        if (line == null) break;

        // convert a line (string of integers) into a dataset row
        String[] stringList = line.split(" ");
        for (int j = 0; j < m; j++) {
            X[i][j] = Integer.parseInt(stringList[j]);
        }

    }

    // close data file
    try {
        br.close();
    } catch (IOException e) {
        e.printStackTrace();
        System.exit(1);
    }

    return X;

}

Dataset size parameters n and m are of type static final int and declared outside as well as static final String filePath.

I give you my solution (maybe will be useful for newbies later coming to read this) and ask if it is possible to make it faster in time and/or consuming less memory? I'm interested in perfect micro-optimization, any advice would be great here. In particular I do not like the way the file is opened twice.

Maybe it is faster, when you read the file byte for byte (or chunk of bytes) instead of whole lines. Than immiadiatly check the bytes for space and linebreak. So you could save the "split" — Adrian, Mar 08 '13 at 12:29

AlexWien · Accepted Answer · 2013-03-08T13:07:16.453

0

Read the file only once and add all lines to an ArraList<String>. ArrayList grows automatically. Later process that ArrayList to split the lines.

Further optimisations: Strimg.split uses a huge regular expression analyzer. Try it with StringTokenizer or your own stringsplit method.

Instead of ArrayList you could avoid overhead by using GrowingIntArray,or GrowingStringArray, these avoid some overhead but are less handy.

speed and mempory usage are contradicting, often you cannot optimize both.

You can save memor by using a one dimesnional array, in java 2d arrays need more space becauseeach column is an object. access one dim array by X[col + row *rowsize].

edited Mar 08 '13 at 13:07

answered Mar 08 '13 at 12:21

AlexWien

28,470
6
53
83

Why should this be faster than processing the data immidiatly at read time? – Adrian Mar 08 '13 at 12:24
@Adrian because it reads the file only once. file reading is slow. – AlexWien Mar 08 '13 at 12:26
@AlexWien I had this solution and it was slower because extra container had been created. – Sophie Sperner Mar 08 '13 at 12:27
Yes, but processing immidiatly reads it also only once. It is not said, that the method call always returns the same result, as the file could be altered elsewhere. – Adrian Mar 08 '13 at 12:27
1

@SophieSperner i doubt that you have measured correctly. Especially if the file size is bigger than the cache reading a file two times is slower. – AlexWien Mar 08 '13 at 12:28
1

@Adrian you dont know the number of lines in advance, if the result must be in X then you have to know the number of lines first. Of course you could transform x into an ArrayList, too. – AlexWien Mar 08 '13 at 12:30
@AlexWien Thanks for the explanation. Maybe you can add some points to your answer to make it more clearly. – Adrian Mar 08 '13 at 12:32
@AlexWien I measured on the file: `n = 405342; m = 43; fileSize = 55.5Mb`. Resources spent on my approach `2.4` seconds with `290` megabytes. Yours - `3` seconds with `330` Mb. Just try yourself. Maybe you are talking about extremely huge files? – Sophie Sperner Mar 08 '13 at 12:38
@SophieSperner run your solution first, then mine. which times you get? – AlexWien Mar 08 '13 at 12:49
@AlexWien Took a bigger dataset: `n = 605338; m = 43; fileSize = 83Mb`. Mine: `3.6` seconds and `356Mb`, yours - `4.4` seconds and `391Mb`. – Sophie Sperner Mar 08 '13 at 12:56
if you try that on an embedded device, things may change – AlexWien Mar 08 '13 at 13:02

Can reading the dataset be faster in time and/or better in memory than this?

1 Answers1