Converting CSV file to array consumes massive amounts of memory

Question

I have some medium large CSV files (about 140mb) and I'm trying to turn them into an array of structs.
I don't want to to load the hole file in the memory so I'm using a steam reader.
For each line I read the data, turn the line into my struct and append the struct to the array. Because there are more then 5_000_000 lines in total, I used reserveCapacity to get a better memory management.

var dataArray : [inputData] = []
dataArray.reserveCapacity(5_201_014)

Unfortunately, that doesn't help at all. There is no perforce difference. The memory graph in the debug session rises up to 1.54 GB and then stays there. Im wondering what im doing wrong, because I can't imaging that it takes 1.54Gb of RAM to store an array of structs from a file with an original size of 140mb.

I use the following code to create the array:

 var dataArray : [inputData] = []
 dataArray.reserveCapacity(5_201_014)

 let stream = StreamReader(path: "pathToDocument")
 defer { stream!.close() }

  while let line = stream!.nextLine() {

      if line.isHeader() {} else {
      let array = line.components(separatedBy: ",")
      dataArray.append(inputData(a: Float32(array[0])!, b: Float32(array[1])!, c: Float32(array[2])!))

     }
   }

I know that here are several CSV reader packages on GitHub but they are extremely slow.

Here a screenshot of the debug session:

Thanks for any advice.

Are you doing this conversion on main thread or background thread? — Gulfam Khan, Jul 06 '21 at 13:22
@GulfamKhan Yes, but there is no real difference between running in main or in QOS utility or background - there it just took about 5 sec longer. — DoTryCatch, Jul 06 '21 at 13:26
Putting it in a background thread might not solve the memory usage problem. I suspect that the problem is that the autoreleasepool isn't being drained, because that happens in the main run loop. Try nesting the body of your while loop in `autoreleasepool { ... }` — Chip Jarred, Jul 06 '21 at 13:34
If `stream!.nextline` is where all of the allocations are happening, you might need to move it inside the `autoreleasepool` too. In that case, it will be `if let line = stream!.nextline {... ; return true } else { return false }` then use the value returned by `autoreleasepool` to control the loop. — Chip Jarred, Jul 06 '21 at 13:41
@ChipJarred Thanks a lot. This saves about 1GB of ram. It is still with 514 mb higher than expected but much more better. — DoTryCatch, Jul 06 '21 at 13:43
60% is a good first pass. You are reserving space for roughly 5 million Float32, which comes to 20+Mb. StreamReader is probably retaining some references internally that could possibly be released. — Chip Jarred, Jul 06 '21 at 13:47
What about `let array = line.components(separatedBy: ",")`? This allocates a new array for each line. You could save space by iterating over the string yourself and parsing the floats on the fly. — jraufeisen, Jul 06 '21 at 13:57
I miscounted. The space reserved is for 15ish million Float32, so 60+Mb. Still that's not the problem. I'm thinking its something internal to StreamReader. Is it doing work on a concurrent queue (maybe some double buffering or something)?. — Chip Jarred, Jul 06 '21 at 14:15
@JoRa, doesn't `line.components` return `[Substring]`? `Substring` pretty lightweight. but yeah an array of them would build up, but after nesting it in the `autoreleasepool`, they would be deallocated on each iteration too. — Chip Jarred, Jul 06 '21 at 14:17
I now switched from stream reading to instead once reading the hole file, creating an array of it and looping over it. This for some reason saves agin 160mb ram, even though it now has to store 3 times the full size in memory - no idea whats going on there. — DoTryCatch, Jul 06 '21 at 14:24
Is [https://github.com/hectr/swift-stream-reader](https://github.com/hectr/swift-stream-reader) the `StreamReader` you were using? I was reading over it the source code, and it does allocate some buffers internally, but it does so synchronously so they should be deallocated by the `autoreleasepool` too. If you don't want to read the whole thing, you could write your own using `FileHandle` directly — Chip Jarred, Jul 06 '21 at 14:30
@ChipJarred yeah thats exactly the one I use, but I fount the code somewhere here on stackoverflow. I´ll maybe try to write an own stream reader. Hopefully this might saves then some more memory. - Thanks for your great help/ideas — DoTryCatch, Jul 06 '21 at 14:37
I had a thought about the 514Mb over-expected memory use after using `autoreleasepool`. When array needs to re-allocate to append more data it doubles its current capacity, so if you append one more element than the reserved 5,201,014 elements, it will jump to 10,402,028... and double again when you reach that limit. If you don't end up using a significant fraction of that additional capacity, it's basically wasted memory, and might account for some of the unexpected overage. — Chip Jarred, Jul 08 '21 at 13:57

Converting CSV file to array consumes massive amounts of memory

0 Answers0