9

I have a 35GB CSV file. I want to read each line, and write the line out to a new CSV if it matches a condition.

try (BufferedWriter writer = Files.newBufferedWriter(Paths.get("source.csv"))) {
    try (BufferedReader br = Files.newBufferedReader(Paths.get("target.csv"))) {
        br.lines().parallel()
            .filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
            .forEach(line -> {
                writer.write(line + "\n");
        });
    }
}

This takes approx. 7 minutes. Is it possible to speed up that process even more?

AbdelAziz AbdelLatef
  • 3,650
  • 6
  • 24
  • 52
membersound
  • 81,582
  • 193
  • 585
  • 1,120
  • 1
    Yes, you could try not doing this from Java but rather do it directly from your Linux/Windows/etc. operating system. Java is interpreted, and there will always be an overhead in using it. Besides this, no, I don't any obvious way to speed it up, and 7 minutes for 35GB seems reasonable to me. – Tim Biegeleisen Oct 22 '19 at 09:50
  • 1
    Maybe removing the `parallel` makes it faster? And doesn't that shuffle the lines around? – Thilo Oct 22 '19 at 09:51
  • Removing `parallel()` gives +1min longer on top. I don't care about shuffed lines in a csv. – membersound Oct 22 '19 at 09:52
  • 1
    Create the `BufferedWriter` yourself, using the [constructor](https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/io/BufferedWriter.html#%3Cinit%3E(java.io.Writer,int)) that lets you set the buffer size. Maybe a bigger (or smaller) buffer size will make a difference. I would try to match the `BufferedWriter` buffer size to the host operating system buffer size. – Abra Oct 22 '19 at 09:59
  • How can I know the buffer size suitable? Default is `8192` – membersound Oct 22 '19 at 10:04
  • By trail and error – Sterconium Oct 22 '19 at 10:15
  • 5
    @TimBiegeleisen: "Java is interpreted" is misleading at best and almost always wrong as well. Yes, for some optimizations you might need to leave the JVM world, but doing this quicker in Java is *definitely* doable. – Joachim Sauer Oct 22 '19 at 10:16
  • @JoachimSauer Then why didn't you post an answer? ^ ^ – Tim Biegeleisen Oct 22 '19 at 10:17
  • 1
    You should profile the application to see if there are any hotspots that you can do something about. You won't be able to do much about the raw IO (the default 8192 byte buffer isn't that bad, since there are sector sizes etc. involved), but there might be things happening (internally) that you might be able to work with. – Kayaman Oct 22 '19 at 10:38
  • 1
    On the side of a non-functional aspect, how much does the size translate to, after the filtering logic that you're performing? and how about splitting the file into chunks performing the operation and merging the results? – Naman Oct 22 '19 at 10:45
  • The resulting file is about 30GB. – membersound Oct 22 '19 at 10:46
  • 1
    Try `java.util.Scanner`. It allows pattern matching right withing the mutable buffer, rather than creating immutable `String` instances. Care to extract the intended portions, without obsolete intermediate substring operations. This could be improved even more by a custom implementation that allows to pass the input buffer directly to the output writer (the fragment specified by offsets). Don't use `BufferedReader`/`BufferedWriter`. – Holger Oct 22 '19 at 12:22
  • This sounds promising, could you give an example on scanner pattern matching? I mean: `scanner.nextLine()` still returns a `String`, so conversation already took place, even if I apply `scanner.skipPattern()` beforehand.... – membersound Oct 24 '19 at 09:34
  • @membersound Could you shed more light on the filtering? Is it something like `>#> some text # some more text` and you want to read the delimiter `#` and then substring say from `#` to the end of the line? – diginoise Oct 24 '19 at 14:29
  • I one of my cases (there are many), I want to skip anything that is contained within two separators, like `#`. – membersound Oct 24 '19 at 15:24

3 Answers3

3

If it is an option you could use GZipInputStream/GZipOutputStream to minimize disk I/O.

Files.newBufferedReader/Writer use a default buffer size, 8 KB I believe. You might try a larger buffer.

Converting to String, Unicode, slows down to (and uses twice the memory). The used UTF-8 is not as simple as StandardCharsets.ISO_8859_1.

Best would be if you can work with bytes for the most part and only for specific CSV fields convert them to String.

A memory mapped file might be the most appropriate. Parallelism might be used by file ranges, spitting up the file.

try (FileChannel sourceChannel = new RandomAccessFile("source.csv","r").getChannel(); ...
MappedByteBuffer buf = sourceChannel.map(...);

This will become a bit much code, getting lines right on (byte)'\n', but not overly complex.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • The problem with bytes reading is that in real world I have to evaluate the beginning of the line, substring on a specific character and only write the remaining part of the line into the outfile. So I probably cannot read the lines as bytes only? – membersound Oct 22 '19 at 10:31
  • I just tested `GZipInputStream + GZipOutputStream` fully inmemory on a ramdisk. Performance was much worse... – membersound Oct 22 '19 at 10:35
  • 1
    On Gzip: then it is not a slow disk. Yes, bytes is an option: newlines, comma, tab, semicolon all can be handled as bytes, and will be considerably faster than as String. Bytes as UTF-8 to UTF-16 char to String to UTF-8 to bytes. – Joop Eggen Oct 22 '19 at 10:41
  • While this sounds promising, how could I use the `MappedByteBuffer` beyond the 2GB filesize limit? – membersound Oct 22 '19 at 11:24
  • 1
    Just map different parts of the file over time. When you reach the limit, just create a new `MappedByteBuffer` from the last known-good position (`FileChannel.map` takes longs). – Joachim Sauer Oct 22 '19 at 11:32
  • @JoachimSauer is right, with the problem of the last line broken over 2 buffers. Thanks for the the tip. – Joop Eggen Oct 22 '19 at 11:35
  • So there is no example on the net how to read/write files > 2GB with `MappedByteBuffer`? – membersound Oct 22 '19 at 11:46
  • 1
    In 2019, there is no need to use `new RandomAccessFile(…).getChannel()`. Just use `FileChannel.open(…)`. – Holger Oct 24 '19 at 07:56
0

you can try this:

try (BufferedWriter writer = new BufferedWriter(new FileWriter(targetFile), 1024 * 1024 * 64)) {
  try (BufferedReader br = new BufferedReader(new FileReader(sourceFile), 1024 * 1024 * 64)) {

I think it will save you one or two minutes. the test can be done on my machine in about 4 minutes by specifying the buffer size.

could it be faster? try this:

final char[] cbuf = new char[1024 * 1024 * 128];

try (Writer writer = new FileWriter(targetFile)) {
  try (Reader br = new FileReader(sourceFile)) {
    int cnt = 0;
    while ((cnt = br.read(cbuf)) > 0) {
      // add your code to process/split the buffer into lines.
      writer.write(cbuf, 0, cnt);
    }
  }
}

This should save you three or four minutes.

If that's still not enough. (The reason I guess you ask the question probably is you need to execute the task repeatedly). if you want to get it done in one minutes or even couple of seconds. then you should process the data and save it into db, then process the task by multiple servers.

user_3380739
  • 1
  • 14
  • 14
  • To your last example: how can I then evaluate the `cbuf` content, and only write portions out? And would I have to reset the buffer once full? (how can I know the buffer is full?) – membersound Oct 24 '19 at 10:22
0

Thanks to all your suggestions, the fastest I came up with was exchanging the writer with BufferedOutputStream, which gave approx 25% improvement:

   try (BufferedReader reader = Files.newBufferedReader(Paths.get("sample.csv"))) {
        try (BufferedOutputStream writer = new BufferedOutputStream(Files.newOutputStream(Paths.get("target.csv")), 1024 * 16)) {
            reader.lines().parallel()
                    .filter(line -> StringUtils.isNotBlank(line)) //bit more complex in real world
                    .forEach(line -> {
                        writer.write((line + "\n").getBytes());
                    });
        }
    }

Still the BufferedReader performs better than BufferedInputStream in my case.

membersound
  • 81,582
  • 193
  • 585
  • 1,120