4

I'm a R user with great interest for Julia and strongly willing to switch to Julia on the long term. I looked for a large csv file on the internet and found this website of US government and downloaded the College Scorecard dataset. I tried to read a 'csv' file in Juno with the following command:

using CSV
using DataFrames

@time df = CSV.read("/Path/to/Most-Recent-Cohorts-Scorecard-Elements.csv", rows_for_type_detect = 7175);

I got the following output:

212.333866 seconds (43.84 M allocations: 2.244 GiB, 0.25% gc time)

I inserted the line rows_for_type_detect = 7175 because otherwise I get an error message an cannot open the file. See this other question for why it might happen

When doing the same operation in R with

start_time <- Sys.time()
df_try = read.csv("/Path/to/Most-Recent-Cohorts-Scorecard-Elements.csv")
end_time <- Sys.time()
end_time - start_time

I get the following output

Time difference of 0.3337972 secs

Is there a way to read large dataframe more efficiently in Julia?

MAJOR EDIT

As pointed out by @BogumiłKamiński, this difference between R and Julia for this particular task LARGELY DECREASES when using the newest version of Julia and CSV. So please read my message above (which I frankly hesitated to simply delete) with a significant grain of salt and read the comment of Bogumił Kamiński! And a big thank you to all developers who take from their free time to build and improve a wonderful language like Julia for free!

EDIT N°2

now when performing

@time df = CSV.read(joinpath("/path/to/file.csv"))

Here are the result

0.184593 seconds (223.44 k allocations: 5.321 MiB)

Brilliant! Thank you @Bogumił Kamiński!

ecjb
  • 5,169
  • 12
  • 43
  • 79
  • 2
    I will answer here (as you have asked a similar question in the other thread). I have tested this file and on my Julia and R. R takies 1.2 second and Julia takes 2.3 second on the first call and 1.4 second on the second call (differences due to precompilation), so they are comparable. CSV.jl has undergone significant improvements over last few weeks. I recommend you to wait till a new release of CSV.jl is tagged (probably in a few days) or use master version of the package `add CSV#master`. – Bogumił Kamiński Sep 24 '18 at 09:42
  • 2
    Dear @BogumiłKamiński. Thank you very much for your answer. You're totally right: I actually ran the code above with an old version of `Julia`: `Julia-Pro-Juno-0.6.2.2`. When running the code with `Julia 1.0`, it takes 9.6 seconds the first call and 1.4 seconds the second call. Sorry for the inconvenience. – ecjb Sep 24 '18 at 10:10

1 Answers1

1

You should probably try using https://github.com/queryverse/CSVFiles.jl

This package provides load and save support for CSV Files under the FileIO.jl package.

It's part of Queryverse ecosystem.

using CSVFiles, DataFrames
fname = "/Path/to/Most-Recent-Cohorts-Scorecard-Elements.csv"
df = DataFrame(load(fname))

but as Bogumił Kamiński suggested, maybe simply using latest master version of CSV.jl could help to improve performances but I think that being aware of this quite new Queryverse universe is something important for the Julia community.

If you have to load several time the same big file, maybe you should consider an other kind of format because textual data is quite long to process. A binary format with compressing may be something you should consider if fast loading is really important for you.

Maybe you should have a look at:

scls
  • 16,591
  • 10
  • 44
  • 55
  • 1
    What's the relationship between csv.jl and csvfiles.jl? Which one is faster? Which one should I use if I want to stream a large dataset that doesn't fit on memory? (for example to later use it with OnlineStats? – skan Nov 21 '18 at 23:58