3

Given column names and column types like these:

col_names = ["A", "B", "C"]
col_types = ["String", "Int64", "Bool"]

I want to create an empty DataFrame like this:

desired_DF = DataFrame(A = String[], B = Int64[], C = Bool[]) #But I cannot specify every column name and type like this every time.

How do I do this?

I seek either your code snippet for doing the needful or, if you like the following solution I've copied below, please explain it to me.

I've seen a solution here. It works, but I do not understand it, especially the third line, in particular the semicolon at the beginning and the three dots at the end.

col_names = [:A, :B] # needs to be a vector Symbols
col_types = [Int64, Float64]
# Create a NamedTuple (A=Int64[], ....) by doing
named_tuple = (; zip(col_names, type[] for type in col_types )...)

df = DataFrame(named_tuple) # 0×2 DataFrame

Also, I was hoping that perhaps there is an even more elegant way to do the needful?

  • 1
    Maybe change the *Does not work* to `desired_DF = DataFrame(A = String[], B = Int64[], C = Bool[])` – GKi Jul 20 '23 at 07:35

1 Answers1

4

Let us start with the input:

julia> col_names = ["A", "B", "C"]
3-element Vector{String}:
 "A"
 "B"
 "C"

julia> col_types = [String, Int64, Bool]
3-element Vector{DataType}:
 String
 Int64
 Bool

Note the difference, col_types need to be types not strings. col_names are good the way you proposed.

Now there are many ways to solve your problem. Let me show you the simplest one in my opinion:

First, create a vector of vectors that will be columns of your data frame:

julia> [T[] for T in col_types]
3-element Vector{Vector}:
 String[]
 Int64[]
 Bool[]

Now you just need to pass it to DataFrame constructor, where this vector of vectors is a first argument, and the second argument are column names:

julia> DataFrame([T[] for T in col_types], col_names)
0×3 DataFrame
 Row │ A       B      C
     │ String  Int64  Bool
─────┴─────────────────────

and you are done.

If you would not have column names you can generate them automatically passing :auto as a second argument:

julia> DataFrame([T[] for T in col_types], :auto)
0×3 DataFrame
 Row │ x1      x2     x3
     │ String  Int64  Bool
─────┴─────────────────────

This is a simple way to get what you want.


Now let us decompose the approach you mentioned above:

(; zip(col_names, type[] for type in col_types )...)

To understand it you need to know how keyword arguments can be passed to functions. See this:

julia> f(; kwargs...) = kwargs
f (generic function with 1 method)

julia> f(; [(:a, 10), (:b, 20), (:c, 30)]...)
pairs(::NamedTuple) with 3 entries:
  :a => 10
  :b => 20
  :c => 30

Now the trick is that in the example above:

(; zip(col_names, type[] for type in col_types )...)

you use exactly this trick. Since you do not pass a name of the function a NamedTuple is created (this is how Julia syntax works). The zip part just creates you the tuples of values, like in my example function above:

julia> collect(zip(col_names, type[] for type in col_types ))
3-element Vector{Tuple{Symbol, Vector}}:
 (:A, String[])
 (:B, Int64[])
 (:C, Bool[])

So the example is the same as passing:

julia> (; [(:A, String[]), (:B, Int64[]), (:C, Bool[])]...)
(A = String[], B = Int64[], C = Bool[])

Which is, given what we have said, the same as passing:

julia> (; :A => String[], :B => Int64[], :C => Bool[])
(A = String[], B = Int64[], C = Bool[])

Which is, in turn, the same as just writing:

julia> (; A = String[], B = Int64[], C = Bool[])
(A = String[], B = Int64[], C = Bool[])

So - this is the explanation how and why the example you quoted works. However, I believe that what I propose is simpler.

Bogumił Kamiński
  • 66,844
  • 3
  • 80
  • 107
  • 1
    Would `col_names = ["A", "B", "C"]; col_types = [String[], Int64[], Bool[]]; DataFrame(col_types, col_names)` be an alternative option? – GKi Jul 20 '23 at 07:30
  • 2
    yes, but then it is not `col_types` but just `cols` (since it is actual columns and not their types only). – Bogumił Kamiński Jul 20 '23 at 08:40