1

So basically im trying to transform a list into a DataFrame.

Here are the two ways of doing it that I am trying, but I cannot come up to a good performance benchmark.

import pandas as pd

mylist = [1,2,3,4,5,6]
names = ["name","name","name","name","name","name"]

# Way 1
pd.DataFrame([mylist], columns=names)

# Way 2
pd.DataFrame.from_records([mylist], columns=names)

I also tried dask but I did not find anything that could work for me.

Guillem
  • 144
  • 4
  • 13
  • Calling `pd.DataFrame` is marginally faster. What is the scale of data here? Why is performance so important? – cs95 Jan 25 '19 at 10:48
  • Hmm, like milions of columns. I am looking for bottlenecks in my code and trying to make all the code faster overall. – Guillem Jan 25 '19 at 10:54
  • If you are trying to optimize performance of `pd.DataFrame` I see struggles down the line.. possibly Pandas as a framework will not be efficient for your needs. Profile your code to confirm it's *actually* your bottleneck. – jpp Jan 25 '19 at 11:15
  • I had similar issues, and I would often try to replace all strings with ints using .cat or creating a dict. Again as jpp said pandas isn't the program for big data, save it as a csv and dump it into a SQL dB, clean it and then export the data you need for analysis. – Umar.H Jan 25 '19 at 12:08
  • also i'm no expert but would delcaring the dtypes speed this up? especially if you have date times declaring the format is much better for pandas – Umar.H Jan 25 '19 at 12:09
  • Thanks! I will try this approach! – Guillem Jan 28 '19 at 10:01

1 Answers1

1

so I just made up an example with 10 columns and random integers in the range of 1 Million values in those, i got the maximum result very quickly. Does this give you maybe a start to work with dask? They proposed the approach here which is also related to this question.

import dask.dataframe as dd
from dask.delayed import delayed
import pandas as pd
import numpy as np

# Create List with random integers
list_large = [np.random.random_sample(int(1e6))*i for i in range(10)]

# Convert it to dask dataframe
dfs = [delayed(pd.DataFrame)(i) for i in list_large]
df = dd.from_delayed(dfs)

# Calculate Maximum
max = df.max().compute()
Moritz
  • 105
  • 2
  • 6