I'm a R
user with great interest for Julia
and strongly willing to switch to Julia
on the long term. I looked for a large csv file on the internet and found this website of US government and downloaded the College Scorecard
dataset. I tried to read a 'csv' file in Juno
with the following command:
using CSV
using DataFrames
@time df = CSV.read("/Path/to/Most-Recent-Cohorts-Scorecard-Elements.csv", rows_for_type_detect = 7175);
I got the following output:
212.333866 seconds (43.84 M allocations: 2.244 GiB, 0.25% gc time)
I inserted the line rows_for_type_detect = 7175
because otherwise I get an error message an cannot open the file. See this other question for why it might happen
When doing the same operation in R
with
start_time <- Sys.time()
df_try = read.csv("/Path/to/Most-Recent-Cohorts-Scorecard-Elements.csv")
end_time <- Sys.time()
end_time - start_time
I get the following output
Time difference of 0.3337972 secs
Is there a way to read large dataframe more efficiently in Julia
?
MAJOR EDIT
As pointed out by @BogumiłKamiński, this difference between R
and Julia
for this particular task LARGELY DECREASES when using the newest version of Julia
and CSV
. So please read my message above (which I frankly hesitated to simply delete) with a significant grain of salt and read the comment of Bogumił Kamiński! And a big thank you to all developers who take from their free time to build and improve a wonderful language like Julia
for free!
EDIT N°2
now when performing
@time df = CSV.read(joinpath("/path/to/file.csv"))
Here are the result
0.184593 seconds (223.44 k allocations: 5.321 MiB)
Brilliant! Thank you @Bogumił Kamiński!