Julia : Dataframes packages having trouble to convert column containing both int and float

Question

I'm a R user with great interest for Julia. I don't have a computer science background. I just tried to read a 'csv' file in Juno with the following command:

using CSV
using DataFrames

df = CSV.read(joinpath(Pkg.dir("DataFrames"), 
"path/to/database.csv"));

and got the following error message

CSV.CSVError('error parsing a 'Int64' value on column 26, row 289; encountered '.'"
in read at CSV/src/Source.jl:294
in #read#29 at CSV/src/Source.jl:299
in stream! at DataStreams/src/DataStreams.jl:145
in stream!#5 at DataStreams/src/DataStreams.jl:151
in stream! at DataStreams/src/DataStreams.jl:187
in streamto! at DataStreams/src/DataStreams.jl:173
in streamfrom at CSV/src/Source.jl:195
in paresefield at CSV/src/paresefield.jl:107
in paresefield at CSV/src/paresefield.jl:127
in checknullend at CSV/src/paresefield.jl:56

I look at the entry indicated in the data frame: the row 287, 288 are like this 30, 33 respectively (seem to be of type Integer) and the the row 289 is 30.445 (which is of type float).

Is the problem that DataFrames filling the column with Int and stopped when it saw an Float?

Many thanks in advance

Would you make a [MCVE](https://stackoverflow.com/help/mcve)? I want to see it for myself. — rickhg12hs, Sep 23 '18 at 22:13
Thank you very much for your answer @rickhg12hs. I just tried to create a MCVE but this theory didn't work (there was no problem importing the data from the MCVE with entry `30, 31, 32, 22.51234` from the csv). But the fact remain that this is the pattern encountered at the row indicated by Julia. As it is a research database, I cannot put it on the internet sorry. Accordingly I removed the last phrase of the question. — ecjb, Sep 23 '18 at 22:27
For example, `io = IOBuffer("1,2,3,4\n5,6,7,8\n9,10,11,12\n13,14.2,15,16");df=CSV.read(io, datarow=1)` works fine for me. — rickhg12hs, Sep 23 '18 at 22:28
Out of curiosity, what happens if you try `df = CSV.read(joinpath(Pkg.dir("DataFrames"), "path/to/database.csv"), datarow=289)`? — rickhg12hs, Sep 23 '18 at 22:34
There is the following error output: `function joinpath does not accept keyword arguments [1] kwfunc(::Any) at ./boot.jl:237 [2] eval(::Module, ::Any) at ./boot.jl:235 [3] eval(::Any) at ./boot.jl:234 [4] macro expansion at /Applications/JuliaPro-0.6.2.2.app/Contents/Resources/pkgs-0.6.2.2/v0.6/Atom/src/repl.jl:117 [inlined]` [5] anonymous at ./:? — ecjb, Sep 23 '18 at 22:41
I think there's a `(` and/or `)` placement error. What I mean to try is `df = CSV.read(wherever/your/CSV/is, datarow=289)`. — rickhg12hs, Sep 23 '18 at 22:46
Ok sorry. I got the following error message `ERROR: CSV.CSVError("error parsing a 'Int64' value on column 9, row 112; encountered 'N'")` — ecjb, Sep 23 '18 at 22:54
Are you confident about this file's CSV format? Does `readdlm` show similar problems? — rickhg12hs, Sep 23 '18 at 23:07
I'm confident that this file is at CSV format. I generated this file after cleaning up a data base with R using command `write.csv()`. I had no trouble re-opening it with the command `read.csv()` in R. It's size is 811 x 105 if it helps. I tried the command `readdlm("path/to/file.csv") in Julia but got the following error message: ` ERROR: unexpected character ',' after quoted field at row 1 column 1` — ecjb, Sep 24 '18 at 06:15
I'd try with the new CSV.jl release tagged a few days ago, it should be much more robust regarding type detection. — Milan Bouchet-Valat, Sep 28 '18 at 17:44

score 2 · Answer 1 · answered Sep 24 '18 at 06:18

2

The problem is that float happens too late in the data set. By default CSV.jl uses rows_for_type_detect value equal to 100. Which means that only first 100 rows are used to determine the type of a column in the output. Set rows_for_type_detect keyword parameter in CSV.read to e.g. 300 and all should work correctly.

Alternatively you can pass types keyword argument to manually set column type (in this case Float64 for this column would be appropriate).

answered Sep 24 '18 at 06:18

Bogumił Kamiński

66,844
3
80
107

@ Thank you very much @BogumilKaminsky. That solved the problem indeed. However as the problem persisted after setting `rows_for_type_detect = 300` I set it to 811 (the length of the database). This time it worked but it took a good 26.8 seconds according to the command `@time df = CSV.read("path/tofile.csv")`. Isn't there a way to read a csv file more easily and efficiently (cf command `read.csv("path/tofile.csv")` of R which takes 0.023 sec (1000x faster) without having to specify a particular number of row)? – ecjb Sep 24 '18 at 06:40

Julia : Dataframes packages having trouble to convert column containing both int and float

1 Answers1

Linked