0

I have a Jupyter Notebook. I know it's not optimal for large works but for many circumstances, is the tool I have to use.

After some computations, I end up with several pandas DataFrame in memory that I would like to pickle. So I do

df_name.to_pickle(filename)

However, I wanted to create a list of all DataFrame using

 df_list = %who DataFrame

And then I wanted to do something like

for varname in df_list:
    varname.to_pickle(f'{varname}.pickle')

This of course doesn't work because varname is a string, not a DataFrame object with the associated .to_pickle method

So my stupid question is, what's the best way to access the actual object varname and not just the string with it's name?

Note: If I create a list of the actual DataFrame, these are quite big objects in memory, so I will probably run into memory issues.

Thanks

phollox
  • 323
  • 3
  • 13
  • 4
    "Note: If I create a list of the actual `DataFrame`, these are quite big objects in memory, so I will probably run into memory issues." That's completely untrue. List stores only references, so it takes very small amount of additional memory to store all your dataframes in a list. Trying to do it like you described is a **really bad idea**. – matszwecja Jul 25 '22 at 09:36
  • Thanks for the heads up. Besides manually, how can I create such list? `%who DataFrame` is not an option – phollox Jul 25 '22 at 09:50
  • Manually is the most reasonable way to do it. Add your dataframes to a list when defining them, so you know exactly what this list contains. – matszwecja Jul 25 '22 at 10:48
  • Hello @matszwecja. Could you post this as an answer so I can credit you? Fill the `df_list` manually because it doesn't have memory issues, etc. Thanks for the help – phollox Jul 26 '22 at 11:14
  • "So my stupid question is, what's the best way to access the actual object varname and not just the string with it's name?" the best way is not to try to dynamically access variables at all, instead, your code should have organized your data into some sort of container (e.g. a `list`, or a `dict`) to begin with. – juanpa.arrivillaga Jul 26 '22 at 19:19
  • [The issue seen here by this OP](https://stackoverflow.com/q/73386782/8508004) is a good example of why the Jul 25 comment by @matszwecja is sound advice. – Wayne Aug 17 '22 at 18:23

1 Answers1

0

As @matszwecja pointed out in the comments, the most reasonable way is to collect them as you make them. It will also be clearest to yourself and others later. Plus more robust and easier to debug as you develop the code.
However, you seemed to be thinking more abstractly of iterating on the dataframes in kernel's namespace, and it is possible to do that and step through pickling the dataframes all automatically. It's just not that easy, perhaps. For example, you already found you cannot simply make a useable list using df_list = %who DataFrame. (It shows the names in the output cell but not in a way Python can use.)

Here's an option that would work if you really did want to do it. This first part sets up some dummy dataframes and then makes a list of dictionaries of them:

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
input ='''
River_Level Rainfall
0.876       0.0
0.877       0.8
0.882       0.0
0.816       0.0
0.826       0.0
0.836       0.0
0.817       0.8
0.812       0.0
0.816       0.0
0.826       0.0
0.836       0.0
0.807       0.8
0.802       0.0
''' 
df_name_one = pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)
input ='''
River_Level Rainfall
0.976       0.1
0.977       0.5
0.982       0.0
0.916       0.3
0.926       0.0
0.996       9.0
0.917       0.8
0.912       0.0
0.916       0.0
0.926       0.1
0.836       0.0
0.907       0.6
0.902       0.0
''' 
df_name_two = pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)
list_of_dfs_dicts = []
for obj_name in dir():
    obj_type_str = str((type(eval(obj_name))))
    #print(obj_type_str)
    if "DataFrame" in obj_type_str: 
        #print(obj_name)
        #print(obj_type_str)
        list_of_dfs_dicts.append({obj_name: eval(obj_name)})

Now each entry in the list is the name of the dataframe object and the dataframe. That can be iterated on and pickled via a single line in a notebook:

[df.to_pickle(f'{varname}.pkl') for d in list_of_dfs_dicts for varname,df in d.items()];

That actually equates to this, which is easier to read:

for d in list_of_dfs_dicts:
    for varname,df in d.items():
        df.to_pickle(f'{varname}.pkl')

For this self-contained answer, I listed the entire dataframe as part of the collected list and dictionary. Memory wasn't a concern here with these dataframes and I wanted it to illustrate things well in small steps.

However, memory was a concern of yours. You can just vary the collection step to not add the entire dataframe to the list, like so:

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
input ='''
River_Level Rainfall
0.876       0.0
0.877       0.8
0.882       0.0
0.816       0.0
0.826       0.0
0.836       0.0
0.817       0.8
0.812       0.0
0.816       0.0
0.826       0.0
0.836       0.0
0.807       0.8
0.802       0.0
''' 
df_name_one = pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)
input ='''
River_Level Rainfall
0.976       0.1
0.977       0.5
0.982       0.0
0.916       0.3
0.926       0.0
0.996       9.0
0.917       0.8
0.912       0.0
0.916       0.0
0.926       0.1
0.836       0.0
0.907       0.6
0.902       0.0
''' 
df_name_two = pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)
df_list = []
for obj_name in dir():
    obj_type_str = str((type(eval(obj_name))))
    if "DataFrame" in obj_type_str: 
        df_list.append(obj_name)
for df_name in df_list:
    eval(df_name).to_pickle(f'{df_name}.pkl')

Bear in mind though eval() is something to be careful using. In particular it opens the gate to code injection.
And by doing it this way, you aren't checking things. For example, while developing you could erroneously make a lot of dataframes at some point (example), and if those were still in your kernel's namespace, they'd ALL get pickled by the pickling step. That's why collecting what you want as you go along is more practical & safer/robust in the long run. I just thought your idea of using df_list = %who DataFrame was intriguing.

Wayne
  • 6,607
  • 8
  • 36
  • 93