2

I have multiple .arrow files, each about 1GB (total filesize is larger than my RAM). I tried to open all of them using vaex.open_many() to read them into a single dataframe, and saw that the memory usage was increasing by GBs, and it was taking longer than I expected.

These arrow files were generated by first making elasticsearch queries and storing the results as a pandas dataframe (df_pd). Then I did a fillna() and set the datatype of each column (I had gotten error messages converting to arrow when there were NaN values and mixed datatypes for a column). I then converted the df_pd dataframes to arrow files using vaex.

vaex_df = vaex.from_pandas(df=df_pd)
vaex_df.export("file1.arrow")

This was repeated for different ES query time periods. It was after I have created the arrow files that I tried to open them with vaex.

I tried just opening one file using the code below.

%%time
df = vaex.open("file1.arrow")

What I noticed it takes about 4-5 seconds to open the file, and the free memory (as indicated by the free column returned by the command free -h) kept decreasing until it was ~1GB lesser.

I thought that when opening the arrow files, vaex would use memory-mapping and thus, won't actually use up so much memory, and it would also be faster. Is my understanding correct, or am I doing something wrong?

Rayne
  • 14,247
  • 16
  • 42
  • 59
  • Memory mapping uses memory. Stuff gets read from disk using a different path, but it still gets read in. There's no magic involved. – Tim Roberts Jan 20 '22 at 07:23
  • Does memory-mapping use the same amount of memory as the filesize? I thought the memory usage would be much lower, hence I was interested in using vaex for large datasets that are too big to fit into my RAM all at once. – Rayne Jan 20 '22 at 07:30
  • It depends. If you access all of the memory then yes, it will all need to be loaded into RAM. However, you can memory map a large file and then access small parts of it without needing to load the entire thing in memory. I don't know `vaex` but intuitively I would think that merely opening a memory mapped file wouldn't actually load the full thing into RAM. That being said, it is possible vaex is trying to preload with the assumption that you will be soon accessing that data. – Pace Jan 20 '22 at 07:48
  • Are you doing anything at all with `df`? Like trying to print it or convert it to pandas or anything? – Pace Jan 20 '22 at 07:50
  • @Pace No, I was only opening the file. I noticed the memory usage before working with the dataframe using the vaex APIs. – Rayne Jan 20 '22 at 08:57
  • I thought vaex would not read the memory when opening the file based on its documentation (https://vaex.io/docs/example_arrow.html#Opens-instantly). And if I look at the timing, `%time` does return microseconds, but if I time the time it takes the whole cell to complete running (`%%time`), it takes seconds. – Rayne Jan 20 '22 at 08:59
  • 3
    Memory mapped files use the same operating system components as the swap file. When you map a file, it is assigned virtual address space, equal to the size of the mapping, just as if you had done a malloc. It won't consume any PHYSICAL memory until you touch the pages, at which point they will be swapped in just like pages from your working set that had been swapped out. – Tim Roberts Jan 20 '22 at 18:04
  • @TimRoberts Thanks for the explanation. So does that mean I should not open more files than the amount of total memory I have? Also, my files reside in a different disk partition than where my Anaconda and vaex are installed. Does this affect the loading time? – Rayne Jan 22 '22 at 02:13
  • 1
    Well, you CAN "oversubscribe", but as stuff gets read in, other pages get swapped out. You can end up with thrashing. It's a tricky balance. Spreading the files around only matters if you're interleaving them. With Python, you don't have that much control. – Tim Roberts Jan 22 '22 at 07:34
  • I tried converting the arrow file to HDF5, and it turns out that opening the HDF5 file takes only milliseconds, and the memory usage is also minimal. Seems like there's a difference between the 2 file formats. – Rayne Jan 24 '22 at 03:46
  • Seems like a good question for someone on the Vaex team. You might ask on their github. Arrow's IPC format is intended for batch-wise random access (e.g. you can access any batch of data without reading the entire file into memory) but this doesn't really have anything to do with memory mapping. – Pace Jan 25 '22 at 02:53

0 Answers0