6

Using a DataFrame in Julia, I want to select rows on the basis of the value taken in a column.

With the following example

using DataFrames, DataFramesMeta
DT = DataFrame(ID = [1, 1, 2,2,3,3, 4,4], x1 = rand(8))

I want to extract the rows with ID taking the values 1 and 4. For the moment, I came out with that solution.

@where(DT, findall(x -> (x==4 || x==1), DT.ID))

When using only two values, it is manageable.

However, I want to make it applicable to a case with many rows and a large set of value for the ID to be selected. Therefore, this solution is unrealistic if I need to write down all the value to be selected

Any fancier solution to make this selection generic?

Damien

djourd1
  • 459
  • 4
  • 14

1 Answers1

8

Here is a way to do it using standard DataFrames.jl indexing and using @where from DataFramesMeta.jl:

julia> DT
8×2 DataFrame
│ Row │ ID    │ x1        │
│     │ Int64 │ Float64   │
├─────┼───────┼───────────┤
│ 1   │ 1     │ 0.433397  │
│ 2   │ 1     │ 0.963775  │
│ 3   │ 2     │ 0.365919  │
│ 4   │ 2     │ 0.325169  │
│ 5   │ 3     │ 0.0495252 │
│ 6   │ 3     │ 0.637568  │
│ 7   │ 4     │ 0.391051  │
│ 8   │ 4     │ 0.436209  │

julia> DT[in([1,4]).(DT.ID), :]
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

julia> @where(DT, in([1,4]).(:ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

In non performance critical code you can also use filter, which is - at least for me a bit simpler to digest (but it has a drawback, that it is slower than the methods discussed above):

julia> filter(row -> row.ID in [1,4], DT)
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

Note that in the approach you mention in your question you could omit DT in front of ID like this:

julia> @where(DT, findall(x -> (x==4 || x==1), :ID))
4×2 DataFrame
│ Row │ ID    │ x1       │
│     │ Int64 │ Float64  │
├─────┼───────┼──────────┤
│ 1   │ 1     │ 0.433397 │
│ 2   │ 1     │ 0.963775 │
│ 3   │ 4     │ 0.391051 │
│ 4   │ 4     │ 0.436209 │

(this is a beauty of DataFramesMeta.jl that it knows the context of the DataFrame you want to refer to)

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • Thank you for these different options. Is it me or the help documentation for "in" does not mention this use ? – djourd1 Oct 04 '19 at 07:12
  • 1
    It mentions it by saying "`in(x)` Create a function that checks whether its argument is in `x`, i.e. a function equivalent to `y -> y in x`.". But I agree that it is non obvious that next you can use it in broadcasting. Soon a documentation of DataFrames.jl will be updated with such an example, see https://github.com/JuliaData/DataFrames.jl/pull/1971. – Bogumił Kamiński Oct 04 '19 at 08:03