Questions tagged [pyspark-pandas]
131 questions
1
vote
1 answer
Create multiple columns by pivoting even when pivoted value doesn't exist
I have a PySpark…

Scope
- 727
- 4
- 15
1
vote
0 answers
write dynamic frame on s3 with xml format with custom rowtag and roottag specified
I used the below code, but I am getting rootTag as 'root', and rowTag as 'record'. But I want rootTag as 'SET'and rowtag as 'TRECORD'
repartitioned_df = df.repartition(1)
datasink4 = glueContext.write_dynamic_frame.from_options(frame =…

Ashish Lekhi
- 11
- 2
1
vote
1 answer
how to read data from multiple folder from adls to databricks dataframe
file path format is data/year/weeknumber/no of…

heena shaikh
- 23
- 4
1
vote
1 answer
PySpark: how to performs conditional calculation on each element of a long string
I have a dataframe that looks like this:
+--------+-------------------------------------+-----------+
| Worker | Schedule | Overtime |
+--------+-------------------------------------+-----------+
| 1 |…

DPatrick
- 59
- 1
- 7
1
vote
0 answers
pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented
I am trying to replace pandas library with pyspark.pandas library.
I tried this :
NOTE : df is pyspark.pandas dataframe
import pyspark.pandas as pd
print(set(df["horizon"].unique()))
But got the below error :
print(set(df["horizon"].unique()))
…

user19930511
- 299
- 2
- 15
1
vote
1 answer
Pandas on Spark 3.2 -NLP.pipe - pd.Series.__iter__() is not implemented
I'm currently trying to migrate some processes from python to (pandas on) spark to measure performance, everything went good until this point:
df_info is of type pyspark.pandas
nlp is defined as:
nlp = spacy.load('es_core_news_sm',…

Alejandro
- 519
- 1
- 6
- 32
1
vote
1 answer
'DataFrame' object has no attribute 'to_delta'
My code used to work. Why does my code not work anymore? I updated to the newer Databricks runtime 10.2 so I had to change some earlier code to use pandas on pyspark.
# Drop customer ID for AutoML
automlDF = churn_features_df.drop(key_id)
# Write…

Climbs_lika_Spyder
- 6,004
- 3
- 39
- 53
0
votes
1 answer
Solving a system of multi-variable equations using PySpark on Databricks
Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is more complex than the set of equations provided…

lord_mendonca
- 9
- 4
0
votes
0 answers
How to parallelize work in pyspark over chunks of a dataset and the chunk needs to be a pandas df
I have a question on how the best way to implement the following problem.
I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset.
In order to run the model, I need to transform the…

Vinícius Matheus Olivieri
- 85
- 1
- 4
0
votes
0 answers
Using Koalas, how do I save to an external table?
I have the code below to save a Koalas dataframe to an Orc table. How to modify it to save to an EXTERNAL table?
df.reset_index().to_orc(
f"/corporativo/mydatabase/mytable",
mode="overwrite",
partition_cols=["year", "month"]
)

neves
- 33,186
- 27
- 159
- 192
0
votes
2 answers
PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?
This is the pyspark dataframe
And the schema of the dataframe. Just two rows.
Then I want to convert it to pandas dataframe.
But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen?
And when I use…

Sparrow Jack
- 45
- 9
0
votes
2 answers
manipulating multiple sum() values in pyspark pivot table
I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come up with to illustrate what I'm trying to…

zenith7
- 151
- 1
- 3
- 8
0
votes
0 answers
how to find getnumPartitions on pyspark.pandas dataframe
The pyspark documentation says that the pandas-on-spark is distributed. If I create a dataframe using pyspark.pandas.read_csv('file.csv'), how can I know the number of partitions of the pandas dataframe? Do we have an equivalent to…

shankar
- 196
- 14
0
votes
0 answers
Find similar rows of a pyspark dataframe based on a particular column using fuzzywuzzy library
I am trying to find "similar" rows in a dataframe based on a particular column. For example, let's say we have this data -
+---+------+
| id| fruit|
+---+------+
| 1| apple|
| 2| appl|
| 3|banana|
| 4| ora|
| 5| banan|
| 6|…

DonkeyKong
- 1
- 1
0
votes
1 answer
Conversion from Spark to Pandas using pandas_api and toPandas
df = spark.table("data").limit(100)
df = df.toPandas()
This conversion using .toPandas works just fine as df.limit is just a few rows. If I get rid of limit and do toPandas on the whole df, I get an error "Job aborted due to stage failure"
I've…

dhk02
- 1
- 2