1

I am spinning up on Python and PySpark. While (slowly) working through the beginning of this tutorial, I found that help(rdd.toDF) was the right incantation to to access the toDF documentation. Here, rdd is the variable name in the tutorial code for an object whose class includes the toDF method, which itself is shorthand for spark.createDataFrame. In turn, help(spark.createDataFrame) says that the method "Creates a :class:DataFrame".

The problem is that the documentation doesn't specify how to prefix DataFrame for the help() function (the following do not work):

>>> help(DataFrame)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'DataFrame' is not defined

>>> help(pyspark.sql.session.DataFrame)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'pyspark' is not defined

>>> help(spark.DataFrame)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'SparkSession' object has no attribute 'DataFrame'

Does Python or Spark have a convention whereby the user can determine the prefixing needed to access the help() documentation for a class that is cited in the help() documentation?

The immediate use case is for the DataFrame class, but I was hoping that there was a way to always be able to figure out how to the required prefixing.

Troubleshooting

As per the comment under Myron_Ben4's answer, I tried searching for the class definition of DataFrame in the *.py files of the current Conda environment. Unfortunately, it yields many ambiguous results.

# Go to the conda environment
cd /c/Users/User.Name/anaconda3/envs/py39

# Find *.py files and grep for definition of DataFrame class
find * -type f -name "*.py" \
   -exec grep -E "^\s*class\s+DataFrame\b" {} +

   Lib/site-packages/dask/dataframe/core.py:class DataFrame(_Frame):
   Lib/site-packages/pandas/core/frame.py:class DataFrame(NDFrame, OpsMixin):
   Lib/site-packages/pandas/core/interchange/dataframe_protocol.py:class DataFrame(ABC):
   Lib/site-packages/panel/pane/markup.py:class DataFrame(HTML):
   Lib/site-packages/panel/widgets/tables.py:class DataFrame(BaseTable):
   Lib/site-packages/param/__init__.py:class DataFrame(ClassSelector):
   Lib/site-packages/pyspark/pandas/frame.py:class DataFrame(Frame, Generic[T]):
   Lib/site-packages/pyspark/sql/connect/dataframe.py:class DataFrame:
   Lib/site-packages/pyspark/sql/dataframe.py:class DataFrame(PandasMapOpsMixin, PandasConversionMixin):

I tried to look at the context of each occurrence using VimScript's vimgrep:

cd /c/Users/User.Name/anaconda3/envs/py39 " The Conda environment

" This command is minimal code, but there are too many files to
" search in the Conda environment:
"
"    vimgrep '^\s*class\s\+DataFrame\>' **/*.py

" This command lets me navigate through the occurrences in the
" above files
vimgrep /^\s*class\s\+DataFrame\>/
\ Lib/site-packages/dask/dataframe/core.py
\ Lib/site-packages/pandas/core/frame.py
\ Lib/site-packages/pandas/core/interchange/dataframe_protocol.py
\ Lib/site-packages/panel/pane/markup.py
\ Lib/site-packages/panel/widgets/tables.py
\ Lib/site-packages/param/__init__.py
\ Lib/site-packages/pyspark/pandas/frame.py
\ Lib/site-packages/pyspark/sql/connect/dataframe.py
\ Lib/site-packages/pyspark/sql/dataframe.py

Since the Vim command line won't accept continued lines, I had to put the VimScript code in a file and issue :source #59, where 59 is the Vim buffer number of the file (use whatever the buffer number is for your situation). This allowed me to surf through the various definitions of the DataFrame class within the context of their *.py files.

From this examination, however, I could see no obvious indication of which class definition is the one that is in effect in the context of my question. I am guessing that it is the last one and can read the documentation. In general, however, this just shows that an unambiguous way is needed to access the correct help content.

Work-around

Having just read up on Python's type hinting/annotation, I see that one work-around is to look a the return-type annotation for createDataFrame:

>>> help(spark.createDataFrame)
Help on method createDataFrame in module pyspark.sql.session:

createDataFrame(data: Union[pyspark.rdd.RDD[Any], Iterable[Any],
ForwardRef('PandasDataFrameLike'), ForwardRef('ArrayLike')],
schema: Union[pyspark.sql.types.AtomicType,
pyspark.sql.types.StructType, str, NoneType] = None,
samplingRatio: Optional[float] = None, verifySchema: bool = True)
-> pyspark.sql.dataframe.DataFrame method of
pyspark.sql.session.SparkSession instance

    Creates a :class:`DataFrame` from an :class:`RDD`, a list, a
    :class:`pandas.DataFrame` or a :class:`numpy.ndarray`.

The annotation -> pyspark.sql.dataframe.DataFrame indicates what to submit to help(). Only useful, of course, if there is annotation.

>>> help(pyspark.sql.dataframe.DataFrame)
Help on class DataFrame in module pyspark.sql.dataframe:

class DataFrame(pyspark.sql.pandas.map_ops.PandasMapOpsMixin,
pyspark.sql.pandas.conversion.PandasConversionMixin)
 | DataFrame(jdf: py4j.java_gateway.JavaObject, sql_ctx:
 | Union[ForwardRef('SQLContext'), ForwardRef('SparkSession')])
 |
 | A distributed collection of data grouped into named columns.

The only confusing thing about the above help documentation is that it opens with what appears to be a contructor, the signature of which differs from that in /c/Users/User.Name/anaconda3/envs/py39/Lib/site-packages/pyspark/sql/dataframe.py:

class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
    <...snip...>
    def __init__(
        self,
        jdf: JavaObject,
        sql_ctx: Union["SQLContext", "SparkSession"],
    ):

This Q&A helps with understanding the discrepancy due to presence of ForwardRef in the help() output. It doesn't actually describe the keyword ForwardRef, but I couldn't find a good description of the latter online. This Python documentation says that "class typing.ForwardRef...[is] used for internal typing representation of string forward references". It is not clear why it needs to be explicit in the help(pyspark.sql.dataframe.DataFrame) output sql_ctx: Union[ForwardRef('SQLContext'), ForwardRef('SparkSession')], i.e., instead of just sql_ctx: Union["SQLContext", "SparkSession"], as shown in dataframe.py. It also isn't clear whether the ForwardRef from help(pyspark.sql.dataframe.DataFrame) is the same as the class typing.ForwardRef in the aforementioned Python documentation.

An alternative to relying on the above documentation from help() is to find the help page online. The createDataFrame, page links to DataFrame, which provides the full "path" pyspark.sql.DataFrame.

user2153235
  • 388
  • 1
  • 11

1 Answers1

1

To read the manual of a module, class, or method with help() in Python, you need to import it, since you need to pass it as argument to the help function.

As far as I know, in Python it is not possible to retrieve the top-level module of a class/method… without prior knowledge. The import system doesn't provide a straightforward way to determine it.
For instance, if you want to get the guide of scikit-learn "PolynomialFeatures" class, but you don't know that is a part of the sklearn module, and you are not able to check online where it comes from...you simply cannot use it.

But once you know the top-level module(s) you can always do:

from sklearn import preprocessing
help(preprocessing.PolynomialFeatures)

or...

from sklearn.preprocessing import PolynomialFeatures
help(PolynomialFeatures)

However, you can still try to find an IDE that, by installing certain plugins, can be capable of identifying and suggesting missing imports, based on code context.

Myron_Ben4
  • 432
  • 4
  • 13
  • Thanks, Myron. It's really a shame that one has to know the entire sequence of dot-separated identifiers before one can use the help. For PySpark, one can have classes within classes within modules within subpackages within packages. Classes can be nested to multiple levels, as can subpackages. Furthermore, [sub]package `__init__.py` files that import things with different names, sometimes using the `from` clause, which changes the prefixing "path" required for help. – user2153235 Aug 14 '23 at 04:16
  • Learning an entire IDE just to get help support might be worth it, but I'm spinning up on so many different fronts right now that I think I'll rely on Google to find the right prefix path. Failing that, I might use Cygwin's `find` command to find the packages/modules/classes/methods. It's far from ideal. – user2153235 Aug 14 '23 at 04:16
  • Thanks @user2153235 for the reply! Actually, it is not a bad idea to exploit Linux.. you can use a commad like: "find . -type f -name "*.py" -exec grep -El "class\s+SparkSession" {} +" to get the full path from your current directory. – Myron_Ben4 Aug 14 '23 at 08:26
  • Agreed that one can use Linux file navigation, search, text processing, etc. to work around the problem, but it just seems so barbaric and not efficient in terms of cognitive capacity and time. That's why I thought that there must be a better way, and that it was only my newbiness that kept me from using smarter ways. – user2153235 Aug 14 '23 at 13:52
  • I tried the `find`+`[vim]grep` approach, but unfortunately, it is far from ideal. Many definitions of the same class are found. :( I documented this in the *Troubleshooting* section of my question. – user2153235 Aug 28 '23 at 07:06
  • Hi again! Your effort clearly deserves my upvote! ...I still need to read other 5 times your considerations to really catch your point, though! =) – Myron_Ben4 Aug 30 '23 at 07:22
  • Thanks, Myron_Ben4. What were the points that was obscured? How can I make them clearer? – user2153235 Aug 30 '23 at 14:21