I noticed that when saving large dataframes as CSVs the memory allocations are an order of magnitude higher than the size of the dataframe in memory (or the size of the CSV file on disk), at least by a factor of 10. Why is this the case? And is there a way to prevent this? Ie is there a way to save a dataframe to disk without using (much) more memory than the actual dataframe?
In the example below I generate a dataframe with one integer column and 10m rows. It weighs 76MB but writing the CSV allocates 1.35GB.
using DataFrames, CSV
function generate_df(n::Int64)
DataFrame!(a = 1:n)
end
julia> @time tmp = generate_df2(10000000);
0.671053 seconds (2.45 M allocations: 199.961 MiB)
julia> Base.summarysize(tmp) / 1024 / 1024
76.29454803466797
julia> @time CSV.write("~/tmp/test.csv", tmp)
3.199506 seconds (60.11 M allocations: 1.351 GiB)