I am spinning up on Python and PySpark. While (slowly) working
through the beginning of
this
tutorial, I found that help(rdd.toDF)
was the right incantation to
to access the toDF
documentation. Here, rdd
is the variable name
in the tutorial code for an object whose class includes the toDF
method, which itself is shorthand for spark.createDataFrame
. In
turn, help(spark.createDataFrame)
says that the method "Creates a
:class:DataFrame
".
The problem is that the documentation doesn't specify how to prefix
DataFrame
for the help()
function (the following do not work):
>>> help(DataFrame)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'DataFrame' is not defined
>>> help(pyspark.sql.session.DataFrame)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'pyspark' is not defined
>>> help(spark.DataFrame)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'SparkSession' object has no attribute 'DataFrame'
Does Python or Spark have a convention whereby the user can determine the prefixing needed to access the help() documentation for a class that is cited in the help() documentation?
The immediate use case is for the DataFrame
class, but I was hoping that there was a way to always be able to figure out how to the required prefixing.
Troubleshooting
As per the comment under Myron_Ben4's answer, I tried searching for the class definition of DataFrame
in the *.py
files of the current Conda environment. Unfortunately, it yields many ambiguous results.
# Go to the conda environment
cd /c/Users/User.Name/anaconda3/envs/py39
# Find *.py files and grep for definition of DataFrame class
find * -type f -name "*.py" \
-exec grep -E "^\s*class\s+DataFrame\b" {} +
Lib/site-packages/dask/dataframe/core.py:class DataFrame(_Frame):
Lib/site-packages/pandas/core/frame.py:class DataFrame(NDFrame, OpsMixin):
Lib/site-packages/pandas/core/interchange/dataframe_protocol.py:class DataFrame(ABC):
Lib/site-packages/panel/pane/markup.py:class DataFrame(HTML):
Lib/site-packages/panel/widgets/tables.py:class DataFrame(BaseTable):
Lib/site-packages/param/__init__.py:class DataFrame(ClassSelector):
Lib/site-packages/pyspark/pandas/frame.py:class DataFrame(Frame, Generic[T]):
Lib/site-packages/pyspark/sql/connect/dataframe.py:class DataFrame:
Lib/site-packages/pyspark/sql/dataframe.py:class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
I tried to look at the context of each occurrence using VimScript's vimgrep
:
cd /c/Users/User.Name/anaconda3/envs/py39 " The Conda environment
" This command is minimal code, but there are too many files to
" search in the Conda environment:
"
" vimgrep '^\s*class\s\+DataFrame\>' **/*.py
" This command lets me navigate through the occurrences in the
" above files
vimgrep /^\s*class\s\+DataFrame\>/
\ Lib/site-packages/dask/dataframe/core.py
\ Lib/site-packages/pandas/core/frame.py
\ Lib/site-packages/pandas/core/interchange/dataframe_protocol.py
\ Lib/site-packages/panel/pane/markup.py
\ Lib/site-packages/panel/widgets/tables.py
\ Lib/site-packages/param/__init__.py
\ Lib/site-packages/pyspark/pandas/frame.py
\ Lib/site-packages/pyspark/sql/connect/dataframe.py
\ Lib/site-packages/pyspark/sql/dataframe.py
Since the Vim command line won't accept continued lines, I had to put the VimScript code in a file and issue :source #59
, where 59
is the Vim buffer number of the file (use whatever the buffer number is for your situation). This allowed me to surf through the various definitions of the DataFrame
class within the context of their *.py
files.
From this examination, however, I could see no obvious indication of which class definition is the one that is in effect in the context of my question. I am guessing that it is the last one and can read the documentation. In general, however, this just shows that an unambiguous way is needed to access the correct help content.
Work-around
Having just read up on Python's type
hinting/annotation,
I see that one work-around is to look a the return-type annotation for
createDataFrame
:
>>> help(spark.createDataFrame)
Help on method createDataFrame in module pyspark.sql.session:
createDataFrame(data: Union[pyspark.rdd.RDD[Any], Iterable[Any],
ForwardRef('PandasDataFrameLike'), ForwardRef('ArrayLike')],
schema: Union[pyspark.sql.types.AtomicType,
pyspark.sql.types.StructType, str, NoneType] = None,
samplingRatio: Optional[float] = None, verifySchema: bool = True)
-> pyspark.sql.dataframe.DataFrame method of
pyspark.sql.session.SparkSession instance
Creates a :class:`DataFrame` from an :class:`RDD`, a list, a
:class:`pandas.DataFrame` or a :class:`numpy.ndarray`.
The annotation -> pyspark.sql.dataframe.DataFrame
indicates what to
submit to help()
. Only useful, of course, if there is annotation.
>>> help(pyspark.sql.dataframe.DataFrame)
Help on class DataFrame in module pyspark.sql.dataframe:
class DataFrame(pyspark.sql.pandas.map_ops.PandasMapOpsMixin,
pyspark.sql.pandas.conversion.PandasConversionMixin)
| DataFrame(jdf: py4j.java_gateway.JavaObject, sql_ctx:
| Union[ForwardRef('SQLContext'), ForwardRef('SparkSession')])
|
| A distributed collection of data grouped into named columns.
The only confusing thing about the above help documentation is that it
opens with what appears to be a contructor, the signature of which differs from that in
/c/Users/User.Name/anaconda3/envs/py39/Lib/site-packages/pyspark/sql/dataframe.py
:
class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
<...snip...>
def __init__(
self,
jdf: JavaObject,
sql_ctx: Union["SQLContext", "SparkSession"],
):
This Q&A helps with
understanding the discrepancy due to presence of ForwardRef
in the
help()
output. It doesn't actually describe the keyword
ForwardRef
, but I couldn't find a good description of the latter
online. This Python
documentation
says that "class typing.ForwardRef...[is] used for internal typing
representation of string forward references". It is not clear why it
needs to be explicit in the help(pyspark.sql.dataframe.DataFrame)
output sql_ctx: Union[ForwardRef('SQLContext'), ForwardRef('SparkSession')]
,
i.e., instead of just sql_ctx: Union["SQLContext", "SparkSession"]
,
as shown in dataframe.py
. It also isn't clear whether the ForwardRef
from help(pyspark.sql.dataframe.DataFrame)
is the same as the
class typing.ForwardRef
in the aforementioned Python documentation.
An alternative to relying on the above documentation from help()
is
to find the help page online. The
createDataFrame,
page links to
DataFrame,
which provides the full "path" pyspark.sql.DataFrame
.