1

Suppose I have this code:

import pandas as pd

mylist = [item for item in range(100000)]
df = pd.DataFrame()
df["col1"] = mylist

Is the data in mylist copied when it is assigned to df["col1"] ? If so, is there a way to avoid this copy?

Edit: My list in this case is a list of strings. One things I am getting from these answers is if I instead create a numpy array of these strings, no data duplication will occur I call df["col1"] = mynparray?

mmnormyle
  • 763
  • 1
  • 8
  • 18

2 Answers2

1

When you assign your list to a series, a new NumPy array is created. This data structure permits vectorised computations for numeric types. Such series are laid out in contiguous memory blocks. See Why NumPy instead of Python lists? for more details.

Therefore, you will need enough memory to hold duplicate data. This is unavoidable. There is no way to "convert" a list into a Pandas series in place.

Note: the above does not relate to what happens when you assign a NumPy array to a series.

jpp
  • 159,742
  • 34
  • 281
  • 339
  • But the obvious question for me then is whether the values were in an array to begin with and whether the values in the df are the same object. I'm just about to test it, but I think that might be the heart of the question. – roganjosh Jul 13 '18 at 17:19
  • The values in the list and the values in the series can't be the same object. They can't be because they must be laid out in very different memory layouts. For one, there are no pointers in a `float` or `int` array, while there are in a Python list. – jpp Jul 13 '18 at 17:21
  • Seems not true: `a = np.arange(10)`; `df = pd.DataFrame(a)`; `a[0] = 15` – roganjosh Jul 13 '18 at 17:22
  • But `a` is not a list here ! – jpp Jul 13 '18 at 17:22
  • As my original comment says "in an array to begin with". I will accept in the case of a list :) – roganjosh Jul 13 '18 at 17:23
  • @roganjosh, OK, I get you now. Except OP's code + description indicates they mean list. An array is another question :). – jpp Jul 13 '18 at 17:23
  • Yeah, I cheated and gave a poorly-worded extension of the question sorry. But I was actually curious myself whether a numpy array remains the same object once it goes into a df. Your answer is correct but it might be worth mentioning? – roganjosh Jul 13 '18 at 17:27
  • 1
    @roganjosh, Yup, I've updated the answer while we mull over your question. – jpp Jul 13 '18 at 17:35
0

just a thought - can you remove a list after creating df, if memory is a concern?

import pandas as pd
mylist = [item for item in range(100000)]
df = pd.Series(mylist).to_frame()
del mylist
gregV
  • 987
  • 9
  • 28