why am I seeing unknown dataframes in my python script? and how to avoid letting the script saving them?

Question

at the end of my python script on jupyter notebook, I check to see all available dataframes that were saved. I use the following code:

# list of all DFs in memory
All_DFs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
All_DFs

other than the dataframes I've created myself I see the following dataframes that are unknown to me.

['_14',
 '_24',
 '_38',
 '_39',
 '_40',
 '_41',
 '_8',
 '__',
 '___']

I wonder why theses are created, and I would like to know how to avoid having them created.

thank you,

Read the docs, please. Use `;` at the end to avoid caching output, Use `%who` or `%whos` to see your variables. See '%xdel` as alternative to `del`. Use `print` or `display` to avoid caching big dataframes. — Vitalizzare, Aug 17 '22 at 19:38
Better question is why would you care about "all DFs in memory"? Collect the ones you care about yourself, and let other dfs do their job in the background. — matszwecja, Aug 18 '22 at 07:43

Wayne · Answer 1 · 2022-08-17T18:29:12.153

Those are cell numbers where you've run code and gotten output you could show again, for example by putting _14 in a Jupyter cell. (Specifically, created as part of IPython/Jupyter's Output caching system.) Without seeing examples of what you ran in the pertinent cells, I'd be totally guessing as to why they are in that list.
I can say one way that I was able to reproduce the phenomena was to put something like this on the final line of a cell:

pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)

Where I basically had defined a dataframe but not assigned it to a variable. I'm not saying that is what you did, but it was one way I could get something like you saw and then test getting rid of such cases. Maybe there's mixed output in the cases you have and your evaluation code senses a dataframe in there? You'd have to provide much more information to give specifics. (And because collecting dataframes like this after-the-fact is not advised [see below], I don't have experience encountering it myself.)

If everything else is fine with your notebook, I would suggest you don't bother to avoid making them. You can filter them out to accomplish your goal of avoiding having the Python code in your notebook save them. You can do that by specifying removing those that have an underscore as the first character. (That assumes you have some cases where you assigned dataframes to variables that you just left off what you've shown.)

Illustrative Example of the Suggestion

I'm going to build on my example code here to set up the problem and how it can be remedied after-the-fact. Let's put the following code in a cell and execute it:

import pandas as pd
try:
    from StringIO import StringIO
except ImportError:
    from io import StringIO
input ='''
River_Level Rainfall
0.876       0.0
0.877       0.8
0.882       0.0
0.816       0.0
0.826       0.0
0.836       0.0
0.817       0.8
0.812       0.0
0.816       0.0
0.826       0.0
0.836       0.0
0.807       0.8
0.802       0.0
''' 
df_name_one = pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)
input ='''
River_Level Rainfall
0.976       0.1
0.977       0.5
0.982       0.0
0.916       0.3
0.926       0.0
0.996       9.0
0.917       0.8
0.912       0.0
0.916       0.0
0.926       0.1
0.836       0.0
0.907       0.6
0.902       0.0
''' 
df_name_two = pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)
pd.read_table(StringIO(input), header=0, index_col=None,  delim_whitespace=True)

And then in the next Jupyter cell, I run a variation on your code:

All_DFs = [var for var in dir() if isinstance(eval(var), pd.core.frame.DataFrame)]
All_DFs = [obj_name for obj_name in All_DFs if not obj_name.startswith("_")]
All_DFs

I'll just see the following output:

['df_name_one', 'df_name_two']

If instead I ran your original code, I'd see:

['_1', 'df_name_one', 'df_name_two']

That '_1' is similar to examples you see among your list. You don't see those if you add in the filter of removing those that start with an underscore.

(If you were doing the collecting from dir() in a for loop like I do in my example code here, incorporating the filter could be simply be done by adding the line if not obj_name.startswith("_"): before the append line.)

Note collecting dataframes this way is a bad idea as was touched on in the comment by matszwecja here. The issue you stumbled upon in your notebook is a good example of why avoiding doing it this way is sage advice. Sure, with some hindsight and understanding you can filter it out; however, you could probably have easily designed in collecting what you need as you went along in order to have clearer, more robust code that was more easily debugged in the development process.

An aside for writing better questions and searching for answers yourself... The fact you are running Python code in a Jupyter notebook was important here yet you say in title only 'Python script' and only have 'Python' and 'Pandas' as categories.

why am I seeing unknown dataframes in my python script? and how to avoid letting the script saving them?

1 Answers1

Illustrative Example of the Suggestion

Linked