PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?

Question

This is the pyspark dataframe

And the schema of the dataframe. Just two rows.

Then I want to convert it to pandas dataframe.

But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen?

And when I use pandas_api, the result is the same.

Why this could happen? It bothers me the whole day.

Could anyone help me?

This is the package version.

score 0 · Answer 1 · answered Aug 23 '23 at 15:32

0

try using this in first cell of notebook

import findspark

findspark.init()

findspark.find()

this will initialize the spark in jupiter notebook

answered Aug 23 '23 at 15:32

Akhaya Chandan Mishra

78
6

Thank you. I tried but doesn't work. – Sparrow Jack Aug 24 '23 at 06:01
As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 30 '23 at 19:13

score 0 · Answer 2 · answered Aug 24 '23 at 06:04

0

After trying and trying again, I found that the reason is that although runs in local mode, the source dir contains several parquet files. Then I need to convert it to rdd and coalesce into one partition. Then convert the rdd into pyspark dataframe. Then the pandas_api works fine. Wish this answer can help someone who met the same problem as me.

answered Aug 24 '23 at 06:04

Sparrow Jack

45
9

Please read [Why should I not upload images of code/data/errors?](https://meta.stackoverflow.com/q/285551/354577). Instead, format code as a [code block]. The easiest way to do this is to paste the code as text directly into your question, then select it and click the code block button. – ChrisGPT was on strike Aug 27 '23 at 12:55

PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?

2 Answers2