0

I'd creating a Pandas DataFrame in which each particular (index, column) location can be a numpy ndarray of arbitrary shape, or even a simple number.

This works:

import numpy as np, pandas as pd
x = pd.DataFrame([[np.random.rand(100, 100, 20, 2), 3], [2, 2], [3, 3], [4, 4]],
                              index=['A1', 'B2', 'C3', 'D4'], columns=['data', 'data2'])
print(x)

but takes 50 seconds to create on my computer! Why?

np.random.rand(100, 100, 20, 2) alone is super fast (< 1 second to create)

How to speed up the creation of Pandas datasets containing ndarrays of various shapes?

Basj
  • 41,386
  • 99
  • 383
  • 673
  • When a pandas DataFrame is a homogenous type, the whole thing can be a single numpy array. When you create a list like this where the columns are hetergeneous, pandas has to do a bunch of bookkeeping and reformatting to keep track of the different datatypes. – Tim Roberts Jun 23 '22 at 23:25
  • Yes probably @TimRoberts but here I only have ~400 000 coefficients to store in the dataframe. 50 seconds for this is really problematic! Is there an easy fix here? – Basj Jun 23 '22 at 23:27
  • It's not the creation taking time, it's the `print`. The creation is pretty much instantaneous on my computer, as is `print(x['data2'])`. But `print(x['data'])` takes about 15 seconds – Nick Jun 23 '22 at 23:41
  • Oh you're right @Nick, solved! You can post as an answer! – Basj Jun 23 '22 at 23:42
  • In fact `print(x['data']['A1'])` and `print(x['data']['B2'])` are likewise super fast. So I guess `print` is just having trouble putting together elements of vastly different size. A bug perhaps? – Nick Jun 23 '22 at 23:43

1 Answers1

2

It's not actually the creation that is the issue, it's the print statement. 1000 loops of the creation take 2.8 seconds on my computer. But one iteration of the print takes about 26 seconds.

Interestingly, print(x['data2']), print(x['data']['A1']) and print(x['data']['B2']) are all basically instantaneous. So it seems print is having an issue figuring out how to display items of vastly different size. Perhaps a bug?

Nick
  • 138,499
  • 22
  • 57
  • 95