MethodError when trying to get a row from an Arrow Dataframe in Julia

Question

I have a dataset that looks like this:

I am taking a CSV file, converting it to Parquet and then sending it to Arrow. There is a reason why I am doing it like this. My goal is to get access to the information in row "Algeria". This is my code:

df = CSV.read("temp.csv", DataFrame)
write_parquet("data_file.parquet", df)
df = DataFrame(read_parquet("data_file.parquet"))
Arrow.write("data_file.arrow", df)
df = DataFrame(Arrow.Table("data_file.arrow"))

dates = names(df)[5:end]
countries = unique(df[:, :"Country/Region"])

algeria = df[df."Country/Region" .== "Algeria", 4:end]
# Print(sum(eachcol(algeria)))
Print(Statistics.mean(eachcol(algeria)))

But the last part, which tries to retrieve the data from Arrow, throws this error:

MethodError: no method matching +(::Float64, ::String)

Closest candidates are:

+(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:538

+(::Float64, !Matched::Float64) at float.jl:401

+(!Matched::ChainRulesCore.One, ::Any) at /home/onur/.julia/packages/ChainRulesCore/7d1hl/src/differential_arithmetic.jl:94

What am I doing wrong?

This is what I get when I type in "Algeria" to the REPL

Update: Implementation of Gabriel's suggestion:

begin
    algeria = df[df."Country/Region" .== "Algeria", 4:end]
    
    for i = 1:size(algeria, 2)
        if eltype(algeria[!, i]) == String
            algeria[!, i] = parse.(Float64, algeria[!, i])
        end
    end
    
    Statistics.mean(eachcol(algeria))
end

This is the error:

MethodError: no method matching +(::Float64, ::String)

Closest candidates are:

+(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:538

+(::Float64, !Matched::Float64) at float.jl:401

+(!Matched::ChainRulesCore.One, ::Any) at /home/onur/.julia/packages/ChainRulesCore/7d1hl/src/differential_arithmetic.jl:94

Please remove the `begin` `end` blocks which are specific for Pluto and make it unnecessary hard for others to read your code. — Przemyslaw Szufel, Mar 17 '21 at 21:39
Can you show us what is output when you type `algeria` into the REPL? — Gabriel Hassler, Mar 18 '21 at 00:20
@oo92 Sorry for not being more clear, I was hoping to see what the type of each column in `algeria` is, which is usually output in the REPL (although apparently not in your particular editor). For `mean` to work, all elements in `algeria` should be of type `Float64`. Try this: `all(eltype.(df[!, i] for i = 1:size(df, 2)) .== Float64)` to see if it returns `true`. If not, figure out which columns are of the wrong type with `findall(eltype.(df[!, i] for i = 1:size(df, 2)) .!= Float64)` and use some version of `parse(x, Float64)` to convert them to the right type. — Gabriel Hassler, Mar 18 '21 at 00:59

score 3 · Accepted Answer · answered Mar 17 '21 at 21:38

3

You need to vectorize mean, please see the code below:

julia> df = DataFrame(a=1:3, b=1.5:1:3.5)
3×2 DataFrame
 Row │ a      b
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.5
   2 │     2      2.5
   3 │     3      3.5

julia> Statistics.mean.(eachcol(df))
2-element Vector{Float64}:
 2.0
 2.5

answered Mar 17 '21 at 21:38

Przemyslaw Szufel

40,002
3
32
62

Sure but that's not what I'm stuck with – Onur-Andros Ozbek Mar 17 '21 at 21:41
You wrote "the last chunk" and this was the code there and it was incorrect with regard to vectorization. Perhaps you could in that case do some MWE to show more clearly what you need :-) – Przemyslaw Szufel Mar 17 '21 at 21:45
This is what is throwing the error mentioned: `algeria = df[df."Country/Region" .== "Algeria", 4:end]` – Onur-Andros Ozbek Mar 17 '21 at 21:46
There is no addition (`+`) here involved. Maybe the commented out line was the reason? – Przemyslaw Szufel Mar 17 '21 at 21:57
Nope. Just removed that and makes no difference. – Onur-Andros Ozbek Mar 17 '21 at 22:07
Btw. Your code doesn't make much of a different. Turns out it is my `Print(Statistics.mean(eachcol(algeria)))` that was throwing the error. I was trying to calculate the mean of all values in Algeria but I was getting the error above. – Onur-Andros Ozbek Mar 17 '21 at 23:29
I do not believe it is possible to say anything more from my side without a MWE. You can calculate means for rows by writing: `mean(Matrix(df),dims=2)` – Przemyslaw Szufel Mar 17 '21 at 23:31
Why are you using df? That's the entire dataframe. I am only trying to calculate a single row from that df – Onur-Andros Ozbek Mar 17 '21 at 23:36

score 0 · Answer 2 · answered Mar 18 '21 at 01:06

0

So it looks like one of the columns in algeria contains strings rather than floating point numbers.

Try doing this before calculating the mean:

for i = 1:size(algeria, 2)
    if eltype(algeria[!, i]) == String
        algeria[!, i] = parse.(Float64, algeria[!, i])
    end
end

answered Mar 18 '21 at 01:06

Gabriel Hassler

136
3

Check the edits. I tried to implement your logic and I got the same error. – Onur-Andros Ozbek Mar 18 '21 at 02:08
have you tried the `findall(eltype.(df[!, i] for i = 1:size(df, 2)) .!= Float64)` to see if there are columns of the wrong type? Ultimately, `mean` is trying to add a `Float64` to a `String`, so there must a a `String` somewhere. – Gabriel Hassler Mar 18 '21 at 03:18
It just returns a list of integers from 1 to n where n is the size of the row – Onur-Andros Ozbek Mar 18 '21 at 23:56

MethodError when trying to get a row from an Arrow Dataframe in Julia

2 Answers2