5

When I'm running simulations, I like to initialize a big, empty array and fill it up as the simulation iterates through to the end. I do this with something like res = Array(Real,(n_iterations,n_parameters)). However, it would be nice to have named columns, which I think means using a DataFrame. Yet when I try to do something like res_df = convert(DataFrame,res) it throws an error. I would like a more concise approach than doing something like res_df = DataFrame(a=Array(Real,N),b=Array(Real,N),c=Array(Real,N),....) as suggested by the answers to: julia create an empty dataframe and append rows to it

Community
  • 1
  • 1
Will Townes
  • 1,787
  • 1
  • 17
  • 21

1 Answers1

12

To preallocate a data frame, you must pre-allocate its columns. You can create three columns full of missing values by simply doing [fill(missing, 10000) for _ in 1:3], but that doesn't actually allocate anything at all because those vectors can only hold one value — missing — and thus they can't be changed to hold other values later. One way to do this is by using to Vector constructors that can hold either Missing or Float64:

julia> DataFrame([Vector{Union{Missing, Float64}}(missing, 10000) for _ in 1:3], [:a, :b, :c])
10000×3 DataFrame
   Row │ a         b         c
       │ Float64?  Float64?  Float64?
───────┼──────────────────────────────
     1 │  missing   missing   missing
     2 │  missing   missing   missing
   ⋮   │    ⋮         ⋮         ⋮
 10000 │  missing   missing   missing
                     9997 rows omitted

Note that rather than Real, this is using the concrete Float64 — this will have significantly better performance.

(this answer was edited to reflect DataFrames v1.0 syntax)

mbauman
  • 30,958
  • 4
  • 88
  • 123
  • If you know that all your columns are of the same type and that there will never be unpopulated (`NA`) elements, there may be other data structures that you can use. Take a look at [NamedArrays.jl](https://github.com/davidavdav/NamedArrays.jl), or if you're willing to fly by the seat of your pants and working on the unstable 0.4, you can try my recent work-in-progress [AxisArrays.jl](https://github.com/mbauman/AxisArrays.jl). Both projects aim to more directly augment the built-in `Array` with dimension names and axis metadata, whereas DataFrames uses a collection-of-columns approach. – mbauman Feb 23 '15 at 20:01
  • This method is now deprecated. – Jake Ireland Oct 13 '21 at 10:26