2

I have a dataset that looks like this: enter image description here

I am taking a CSV file, converting it to Parquet and then sending it to Arrow. There is a reason why I am doing it like this. My goal is to get access to the information in row "Algeria". This is my code:

df = CSV.read("temp.csv", DataFrame)
write_parquet("data_file.parquet", df)
df = DataFrame(read_parquet("data_file.parquet"))
Arrow.write("data_file.arrow", df)
df = DataFrame(Arrow.Table("data_file.arrow"))

dates = names(df)[5:end]
countries = unique(df[:, :"Country/Region"])

algeria = df[df."Country/Region" .== "Algeria", 4:end]
# Print(sum(eachcol(algeria)))
Print(Statistics.mean(eachcol(algeria)))

But the last part, which tries to retrieve the data from Arrow, throws this error:

MethodError: no method matching +(::Float64, ::String)

Closest candidates are:

+(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:538

+(::Float64, !Matched::Float64) at float.jl:401

+(!Matched::ChainRulesCore.One, ::Any) at /home/onur/.julia/packages/ChainRulesCore/7d1hl/src/differential_arithmetic.jl:94

What am I doing wrong?

This is what I get when I type in "Algeria" to the REPL

enter image description here

Update: Implementation of Gabriel's suggestion:

begin
    algeria = df[df."Country/Region" .== "Algeria", 4:end]
    
    for i = 1:size(algeria, 2)
        if eltype(algeria[!, i]) == String
            algeria[!, i] = parse.(Float64, algeria[!, i])
        end
    end
    
    Statistics.mean(eachcol(algeria))
end

This is the error:

MethodError: no method matching +(::Float64, ::String)

Closest candidates are:

+(::Any, ::Any, !Matched::Any, !Matched::Any...) at operators.jl:538

+(::Float64, !Matched::Float64) at float.jl:401

+(!Matched::ChainRulesCore.One, ::Any) at /home/onur/.julia/packages/ChainRulesCore/7d1hl/src/differential_arithmetic.jl:94
Onur-Andros Ozbek
  • 2,998
  • 2
  • 29
  • 78
  • Please remove the `begin` `end` blocks which are specific for Pluto and make it unnecessary hard for others to read your code. – Przemyslaw Szufel Mar 17 '21 at 21:39
  • Can you show us what is output when you type `algeria` into the REPL? – Gabriel Hassler Mar 18 '21 at 00:20
  • @GabrielHassler Check the edits – Onur-Andros Ozbek Mar 18 '21 at 00:46
  • @oo92 Sorry for not being more clear, I was hoping to see what the type of each column in `algeria` is, which is usually output in the REPL (although apparently not in your particular editor). For `mean` to work, all elements in `algeria` should be of type `Float64`. Try this: `all(eltype.(df[!, i] for i = 1:size(df, 2)) .== Float64)` to see if it returns `true`. If not, figure out which columns are of the wrong type with `findall(eltype.(df[!, i] for i = 1:size(df, 2)) .!= Float64)` and use some version of `parse(x, Float64)` to convert them to the right type. – Gabriel Hassler Mar 18 '21 at 00:59

2 Answers2

3

You need to vectorize mean, please see the code below:

julia> df = DataFrame(a=1:3, b=1.5:1:3.5)
3×2 DataFrame
 Row │ a      b
     │ Int64  Float64
─────┼────────────────
   1 │     1      1.5
   2 │     2      2.5
   3 │     3      3.5

julia> Statistics.mean.(eachcol(df))
2-element Vector{Float64}:
 2.0
 2.5
Przemyslaw Szufel
  • 40,002
  • 3
  • 32
  • 62
0

So it looks like one of the columns in algeria contains strings rather than floating point numbers.

Try doing this before calculating the mean:

for i = 1:size(algeria, 2)
    if eltype(algeria[!, i]) == String
        algeria[!, i] = parse.(Float64, algeria[!, i])
    end
end