0

This is the pyspark dataframe

enter image description here enter image description here And the schema of the dataframe. Just two rows.

Then I want to convert it to pandas dataframe. enter image description here

But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen?

enter image description here

And when I use pandas_api, the result is the same. enter image description here

Why this could happen? It bothers me the whole day.

Could anyone help me?

This is the package version.

enter image description here

2 Answers2

0

try using this in first cell of notebook

import findspark

findspark.init()

findspark.find()

this will initialize the spark in jupiter notebook

enter image description here

  • Thank you. I tried but doesn't work. – Sparrow Jack Aug 24 '23 at 06:01
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Aug 30 '23 at 19:13
0

After trying and trying again, I found that the reason is that although runs in local mode, the source dir contains several parquet files. Then I need to convert it to rdd and coalesce into one partition. Then convert the rdd into pyspark dataframe. Then the pandas_api works fine. Wish this answer can help someone who met the same problem as me.

enter image description here

  • Please read [Why should I not upload images of code/data/errors?](https://meta.stackoverflow.com/q/285551/354577). Instead, format code as a [code block]. The easiest way to do this is to paste the code as text directly into your question, then select it and click the code block button. – ChrisGPT was on strike Aug 27 '23 at 12:55