2

I have a class which reads a CSV file but when size of file is high, the program throws Java heap size error, so I need to split that file into pieces and transfer lines to other files according to line size.

For example; I have a file of 500 000 lines and I'm dividing it into 5 files by 100 000 lines. So I have 5 files consisting of 100 000 lines so that I can read them.

I couldn't find a way to do that so it would be nice if I see example lines of code.

dagelf
  • 1,468
  • 1
  • 14
  • 25
Bora Ulu
  • 33
  • 1
  • 6
  • 1
    do you have to have all lines in memory? otherwise you could read line by line and do your processing. – bwright Mar 17 '20 at 14:09
  • 1
    You could also try increasing the heap size – ControlAltDel Mar 17 '20 at 14:11
  • @bwright I created a list of DTO which consists of a lines as you said. This question is my another option to read that high size CSV file. Do you have another option rather than splitting file into pieces? – Bora Ulu Mar 17 '20 at 14:18
  • @ControlAltDel that is not a good option as size of file is changeable. I can increase it but tomorrow it could throw exception again. It's not sure. – Bora Ulu Mar 17 '20 at 14:19
  • 1
    You are supposed to show a honest attempt. The goals are to prove that you have researched and ensure that any solution provided by someone else will smoothly fit into your application. – Serge Ballesta Mar 17 '20 at 14:23
  • Also, You can not split the file, and process manageable chunks of it. – Akin Okegbile Mar 17 '20 at 14:23
  • 1
    Java has mechanisms (for example [`Files.lines`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/Files.html#lines-java.nio.file.Path-)) to work with these files. Process it as a stream by reading line by line. – KarelG Mar 17 '20 at 14:24

3 Answers3

3
public static void splitLargeFile(final String fileName, 
                                   final String extension, 
                                   final int maxLines,
                                   final boolean deleteOriginalFile) {

    try (Scanner s = new Scanner(new FileReader(String.format("%s.%s", fileName, extension)))) {
        int file = 0;
        int cnt = 0;
        BufferedWriter writer = new BufferedWriter(new FileWriter(String.format("%s_%d.%s", fileName, file, extension)));

        while (s.hasNext()) {
            writer.write(s.next() + System.lineSeparator());
            if (++cnt == maxLines && s.hasNext()) {
                writer.close();
                writer = new BufferedWriter(new  FileWriter(String.format("%s_%d.%s", fileName, ++file, extension)));
                cnt = 0;
            }
        }
        writer.close();
    } catch (Exception e) {
        e.printStackTrace();
    }

    if (deleteOriginalFile) {
        try {
            File f = new File(String.format("%s.%s", fileName, extension));
            f.delete();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}
Ryan
  • 1,762
  • 6
  • 11
0

If you're on Linux and you can run the CSV through a script first, then you can use "split":

$ split -l 100000 big.csv small-

This generates files named small-aa, small-ab, small-ac... To rename these to csv's if needed:

$ for a in small-*; do 
    mv $a $a.csv;                # rename split files to .csv 
    java MyCSVProcessor $a.csv;  # or just process them anyways 
done

Try this for additional options:

$ split -h

-a –suffix-length=N use suffixes of length N (default 2)
-b –bytes=SIZE put SIZE bytes per output file
-C –line-bytes=SIZE put at most SIZE bytes of lines per output file
-d –numeric-suffixes use numeric suffixes instead of alphabetic
-l –lines=NUMBER put NUMBER lines per output file

This is however a poor mitigation for your problem - the reason your CSV reader module is running out of memory, is because it's either reading the whole file into memory before splitting it, or it's doing that and keeping your processed output in memory. To make your code more portable and universally runnable, you should consider processing one line at a time - and splitting the input yourself, line by line. (From https://stackabuse.com/reading-and-writing-csvs-in-java/)

BufferedReader csvReader = new BufferedReader(new FileReader(pathToCsv));
while ((row = csvReader.readLine()) != null) {
    String[] data = row.split(",");
    // do something with the data
}
csvReader.close();

Caveat with the above code is that quoted commas will just be treated as new columns - you will have to add some additional processing if your CSV data contains quoted commas.

Of course, if you really want to use your existing code, and just want to split the file, you can adapt the above:

import java.io.*;

public class split {

    static String CSVFile="test.csv";
    static String row;
    static BufferedReader csvReader;
    static PrintWriter csvWriter;

    public static void main(String[] args) throws IOException {   

    csvReader = new BufferedReader(new FileReader(CSVFile));

    int line = 0;
    while ((row = csvReader.readLine()) != null) {
       if (line % 100000 == 0) {  // maximum lines per file
          if (line>0) { csvWriter.close(); }
          csvWriter = new PrintWriter("cut-"+Integer.toString(line)+CSVFile);
       }
       csvWriter.println(row);
        // String[] data = row.split(",");
        // do something with the data
       line++;
    }
    csvWriter.close();
    csvReader.close();

    }
}

I chose PrintWriter above FileWriter or BufferedWriter because it automatically prints the relevent newlines - and I would presume that it's buffered... I've not written anything in Java in 20 years, so I bet you can improve on the above.

dagelf
  • 1,468
  • 1
  • 14
  • 25
0

I created a simple fun to create a childcsv from parent based on the start and last Range. It can be used as splitter based on line range.

public static void createcsv(String csvPath,String newcsvPath, int startRange, int lastRange) {
    csvPath = csvPath.trim();
    String childcsvPath = newcsvPath.trim();
    Scanner sc = null;
    FileWriter writer = null;
    int count = 0;
    // Iterate to startRange Location
    try {
        sc = new Scanner(new File(csvPath));
        sc.useDelimiter(","); // sets the delimiter pattern
        ArrayList<String> newCsv = new ArrayList<String>();

        while (sc.hasNextLine()) // returns a boolean value
        {
            String value = sc.nextLine();
            count++;
            if (count > lastRange)
                break;

            else if (count >= startRange) {
                newCsv.add(value);
            } else
                continue;
        }

        writer = new FileWriter(childcsvPath);

        for (int j = 0; j < newCsv.size(); j++) {
            writer.append(newCsv.get(j));
            writer.append("\n");
        }
    } catch (Exception e) {
        System.out.print("Exception Found" + e);
    } finally {
        if (sc != null) {
            try {
                sc.close();
                writer.close();
            } catch (Exception e) {
            }
        }
    }
}