What is the fastest way to make a DataFrame from a list?

Question

So basically im trying to transform a list into a DataFrame.

Here are the two ways of doing it that I am trying, but I cannot come up to a good performance benchmark.

import pandas as pd

mylist = [1,2,3,4,5,6]
names = ["name","name","name","name","name","name"]

# Way 1
pd.DataFrame([mylist], columns=names)

# Way 2
pd.DataFrame.from_records([mylist], columns=names)

I also tried dask but I did not find anything that could work for me.

Calling `pd.DataFrame` is marginally faster. What is the scale of data here? Why is performance so important? — cs95, Jan 25 '19 at 10:48
Hmm, like milions of columns. I am looking for bottlenecks in my code and trying to make all the code faster overall. — Guillem, Jan 25 '19 at 10:54
If you are trying to optimize performance of `pd.DataFrame` I see struggles down the line.. possibly Pandas as a framework will not be efficient for your needs. Profile your code to confirm it's *actually* your bottleneck. — jpp, Jan 25 '19 at 11:15
I had similar issues, and I would often try to replace all strings with ints using .cat or creating a dict. Again as jpp said pandas isn't the program for big data, save it as a csv and dump it into a SQL dB, clean it and then export the data you need for analysis. — Umar.H, Jan 25 '19 at 12:08
also i'm no expert but would delcaring the dtypes speed this up? especially if you have date times declaring the format is much better for pandas — Umar.H, Jan 25 '19 at 12:09

score 1 · Accepted Answer · answered Jan 25 '19 at 12:10

so I just made up an example with 10 columns and random integers in the range of 1 Million values in those, i got the maximum result very quickly. Does this give you maybe a start to work with dask? They proposed the approach here which is also related to this question.

import dask.dataframe as dd
from dask.delayed import delayed
import pandas as pd
import numpy as np

# Create List with random integers
list_large = [np.random.random_sample(int(1e6))*i for i in range(10)]

# Convert it to dask dataframe
dfs = [delayed(pd.DataFrame)(i) for i in list_large]
df = dd.from_delayed(dfs)

# Calculate Maximum
max = df.max().compute()

What is the fastest way to make a DataFrame from a list?

1 Answers1