0

I have an input text file that is basically a tsv of people. I need to sort all the records (by last name, first name) and then store all the records into a binary file. What I've done so far, is create a DataRecord object which has all the appropriate fields and getter/setters and compareTo. In the main, I've got an ArrayList of type DataRecord for sorting purposes.

public class DataRecord implements Comparable<DataRecord>{

private String lastName, firstName, middleName, suffix, cityOfBirth;
private int monthOfBirth, dayOfBirth, yearOfBirth;
private char gender;


//getters
public String getLastName() { return this.lastName;}
public String getFirstName() { return this.firstName;}
public String getMiddleName() { return this.middleName;}
public String getSuffix() { return this.suffix;}
public String getCityOfBirth() { return this.cityOfBirth;}
public int getMonthOfBirth() { return this.monthOfBirth;}
public int getDayOfBirth() { return this.dayOfBirth;}
public int getYearOfBirth() { return this.yearOfBirth;}
public char getGender() { return this.gender;}

//setters
public void setLastName(String lastName) { this.lastName = lastName;}
public void setFirstName(String firstName) { this.firstName = firstName;}
public void setMiddleName(String middleName) { this.middleName = middleName;}
public void setSuffix(String suffix) { this.suffix = suffix;}
public void setCityOfBirth(String cityOfBirth) { this.cityOfBirth = cityOfBirth;}
public void setMonthOfBirth(int monthOfBirth) { this.monthOfBirth = monthOfBirth;}
public void setDayOfBirth(int dayOfBirth) { this.dayOfBirth = dayOfBirth;}
public void setYearOfBirth(int yearOfBirth) { this.yearOfBirth = yearOfBirth;}
public void setGender(char gender) { this.gender = gender;}

public DataRecord(){

}

//constructor to make copy of record passed in
public DataRecord(DataRecord copyFrom){
    this.lastName = copyFrom.getLastName();
    this.firstName = copyFrom.getFirstName();
    this.middleName = copyFrom.getMiddleName();
    this.suffix = copyFrom.getSuffix();
    this.monthOfBirth = copyFrom.getMonthOfBirth();
    this.dayOfBirth = copyFrom.getDayOfBirth();
    this.yearOfBirth = copyFrom.getYearOfBirth();
    this.gender = copyFrom.getGender();
    this.cityOfBirth = copyFrom.getCityOfBirth();
}

@Override
public int compareTo(DataRecord arg0) {
    // TODO Auto-generated method stub
    int lastNameCompare;

    //check if the last names are the same, if so return the first name comparison
    if ((lastNameCompare = this.getLastName().compareTo(arg0.getLastName())) == 0){
        return this.getFirstName().compareTo(arg0.getFirstName());
    }

    //otherwise return the last name comparison
    return lastNameCompare;

}

public String toString(){
    return this.getLastName() + ' ' + this.getFirstName();
}

}

public class IOController {

  public static void main(String[] args) throws IOException {
    File inputFile; // input file
    RandomAccessFile dataStream = null; // output stream
    ArrayList<DataRecord> records = new ArrayList<DataRecord>();

    BufferedReader reader = new BufferedReader(new FileReader(args[0]));
    try {
        String sb;
        String line = reader.readLine();
        String[] fields;

        // loop through and read all the lines in the input file
        while (line != null) {
            DataRecord currentRecord = new DataRecord();

            // store the current line into a local string
            sb = line;

            // create an array of all the fields
            fields = sb.split("\t");
            // set the fields for the DataRecord object
            currentRecord.setLastName(fields[0]);
            currentRecord.setFirstName(fields[1]);

            // check other fields exist
            if (fields.length >= 3) {
                currentRecord.setMiddleName(fields[2]);
                currentRecord.setSuffix(fields[3]);
                currentRecord.setMonthOfBirth(Integer.parseInt(fields[4]));
                currentRecord.setDayOfBirth(Integer.parseInt(fields[5]));
                currentRecord.setYearOfBirth(Integer.parseInt(fields[6]));
                currentRecord.setGender(fields[7].charAt(0));
                currentRecord.setCityOfBirth(fields[8]);
            }

            // add the current record to the array list of records
            records.add(currentRecord);
            line = reader.readLine();
        }
    } finally {
        reader.close();
      //Collections.sort(records);
    }

    for (int i = 0; i < 5; i++) {
        System.out.println(records.get(i));
    }

}

}

My issue is that if I use a temporary DataRecord (named currentRecord) to read the fields, then add to the ArrayList, I have all the same data in every record in the ArrayList. If I copy that data to another DataRecord object (using a constructor where I pass in a DataRecord), I run out of heapspace.

records.add(new DataRecord(currentRecord));
line = reader.readLine();

Is my mistake using an ArrayList?

Jarryd Goodman
  • 477
  • 3
  • 9
  • 19
  • Which size is yor file? If your file is greater than available memory u gonna have to use a External Sort approach. https://en.wikipedia.org/wiki/External_sorting if u have sufficient memory you can increase the java heap memory. – p.magalhaes Sep 02 '15 at 22:52
  • The code looks good! Like @LuiggiMendonca said, i imagine the error in "\t". – p.magalhaes Sep 02 '15 at 23:01
  • Why switch to "\\t"? It won't split on tab characters in that case, will it? – Jarryd Goodman Sep 02 '15 at 23:42
  • No problem to use "\\t" or "\t". Read this post: http://stackoverflow.com/questions/3762347/understanding-regex-in-java-split-t-vs-split-t-when-do-they-both-wor. I still imagine that the problem is in your file. Did u try to run your program with another input file? – p.magalhaes Sep 03 '15 at 00:08

2 Answers2

2

You're using the same object reference to add in the ArrayList and are updating it on every iteration. Just create a new instance of the object on each iteration:

while (line != null) {
    DataRecord currentRecord = new DataRecord();
    // rest of the code...
    records.add(currentRecord);
}
//sort the list

As best practice, declare your variables in the narrowest possible scope.

Since you run out of heap space, you may try adding more ram to your process by using -Xmx argument. If you lack of ram in the PC you're executing the process, then use another alternative like splitting the file into small chunks, sort each new file, then using a derivate of merge sort between the data in these files.

Luiggi Mendoza
  • 85,076
  • 16
  • 154
  • 332
  • When I do this, I still run out of heapspace: Exception in thread "main" java.lang.OutOfMemoryError: Java heap space – Jarryd Goodman Sep 02 '15 at 22:53
  • Sorry, I think that heapspace error is when I call `Collections.sort` – Jarryd Goodman Sep 02 '15 at 22:54
  • Nvm, I commented out the sort and it still has a heap space error. The file is 43.7mb, no way around that. – Jarryd Goodman Sep 02 '15 at 22:56
  • And the error is raised on the line: `fields = sb.split("/t");` – Jarryd Goodman Sep 02 '15 at 22:56
  • Did u override the equal method of the record class? I recommend override the hashCode method too. Post the full code. – p.magalhaes Sep 02 '15 at 22:56
  • @JarrydGoodman use the following: `split("\\t");` and initialize your `ArrayList` with an initial size that can hold all the elements that you will read or at least 2/3 of it – Luiggi Mendoza Sep 02 '15 at 22:58
  • I made those changes, still running out of heap space. – Jarryd Goodman Sep 02 '15 at 23:05
  • How much memory are you using as max? Are you sure the problem is not somewhere else or that you're doing more stuff? – Luiggi Mendoza Sep 02 '15 at 23:06
  • @JarrydGoodman: I imagine that the problem is with the input file. The only way to get a java heap space in this code, is variable line never become null. So you are in a "infinite lloop" (or too large loop.) Did u figure it out, if inside yor file have (cr+lf) lost in the lines? Use for example notepad++ to examine it. Try the same code with another file. For example, with just 5 lines. – p.magalhaes Sep 02 '15 at 23:10
  • @JarrydGoodman try changing from `ArrayList` to `LinkedList`. – Luiggi Mendoza Sep 02 '15 at 23:14
  • @LuiggiMendoza It works with a file of 35 lines no problem, and LinkedList did not fix it. – Jarryd Goodman Sep 02 '15 at 23:30
  • The reason I'm skeptical about adjusting heap size is that nobody else in my class made mention of that so it seems to be that there is another solution. – Jarryd Goodman Sep 02 '15 at 23:30
2

Is my mistake using an ArrayList?

No.

Your mistake is one or both of the following:

  • Attempting to hold the information contents of a large file in memory at the same time. The alternative is to stream the data; e.g. read record, write record, read, record, write record, etc. (Of course the feasibility of doing that depends on the nature of your "binary" file representation.)

  • Attempting to run with a heap that is too small. The java command documentation explains how to increase the heap size, but obviously there are practical limits to that approach.

And for the record, this is also a mistake:

    records.add(currentRecord);

If you do that, you will end up with an list containing (just) N copies of the last record in your CSV input file. If you are going to build an in-memory copy in a list, then you need to create a new DataRecord object for each row.


For the record, changing to LinkedList won't help in the long term. The maximum space usage for an ArrayList created by appending to a list created using new ArrayList() is roughly 3 x the size of a reference. For a LinkedList the space usage is 3 x the size of a reference + 2 additional words per entry.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216