Highest Voted 'pyspark-pandas' Questions

1

vote

1 answer

Create multiple columns by pivoting even when pivoted value doesn't exist

I have a PySpark…

asked Jun 29 '22 at 06:53

Scope

727
4
15

1

vote

0 answers

write dynamic frame on s3 with xml format with custom rowtag and roottag specified

I used the below code, but I am getting rootTag as 'root', and rowTag as 'record'. But I want rootTag as 'SET'and rowtag as 'TRECORD' repartitioned_df = df.repartition(1) datasink4 = glueContext.write_dynamic_frame.from_options(frame =…

amazon-web-services apache-spark pyspark-pandas

asked Jun 13 '22 at 16:22

Ashish Lekhi

11
2

1

vote

1 answer

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of…

dataframe pyspark databricks pyspark-pandas

asked May 10 '22 at 07:34

heena shaikh

23
4

1

vote

1 answer

PySpark: how to performs conditional calculation on each element of a long string

I have a dataframe that looks like this: +--------+-------------------------------------+-----------+ | Worker | Schedule | Overtime | +--------+-------------------------------------+-----------+ | 1 |…

pyspark apache-spark-sql pyspark-pandas

asked Apr 29 '22 at 21:15

DPatrick

59
1
7

1

vote

0 answers

pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Series.iter()` is not implemented

I am trying to replace pandas library with pyspark.pandas library. I tried this : NOTE : df is pyspark.pandas dataframe import pyspark.pandas as pd print(set(df["horizon"].unique())) But got the below error : print(set(df["horizon"].unique())) …

python pandas dataframe apache-spark pyspark-pandas

asked Mar 24 '22 at 18:09

user19930511

299
2
15

1

vote

1 answer

Pandas on Spark 3.2 -NLP.pipe - pd.Series.iter() is not implemented

I'm currently trying to migrate some processes from python to (pandas on) spark to measure performance, everything went good until this point: df_info is of type pyspark.pandas nlp is defined as: nlp = spacy.load('es_core_news_sm',…

python apache-spark pyspark spark-koalas pyspark-pandas

asked Mar 09 '22 at 21:01

Alejandro

519
1
6
32

1

vote

1 answer

'DataFrame' object has no attribute 'to_delta'

My code used to work. Why does my code not work anymore? I updated to the newer Databricks runtime 10.2 so I had to change some earlier code to use pandas on pyspark. # Drop customer ID for AutoML automlDF = churn_features_df.drop(key_id) # Write…

pyspark databricks delta-lake pyspark-pandas

asked Jan 18 '22 at 21:32

Climbs_lika_Spyder

6,004
3
39
53

0

votes

1 answer

Solving a system of multi-variable equations using PySpark on Databricks

Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is more complex than the set of equations provided…

python apache-spark pyspark pyspark-pandas spark-koalas

asked Aug 30 '23 at 20:27

lord_mendonca

9
4

0

votes

0 answers

How to parallelize work in pyspark over chunks of a dataset and the chunk needs to be a pandas df

I have a question on how the best way to implement the following problem. I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset. In order to run the model, I need to transform the…

python dataframe apache-spark pyspark pyspark-pandas

asked Aug 29 '23 at 16:30

Vinícius Matheus Olivieri

85
1
4

0

votes

0 answers

Using Koalas, how do I save to an external table?

I have the code below to save a Koalas dataframe to an Orc table. How to modify it to save to an EXTERNAL table? df.reset_index().to_orc( f"/corporativo/mydatabase/mytable", mode="overwrite", partition_cols=["year", "month"] )

pyspark-pandas spark-koalas

asked Aug 28 '23 at 20:33

neves

33,186
27
159
192

0

votes

2 answers

PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?

This is the pyspark dataframe And the schema of the dataframe. Just two rows. Then I want to convert it to pandas dataframe. But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen? And when I use…

pyspark jupyter-notebook jupyter pyspark-pandas

asked Aug 23 '23 at 14:29

Sparrow Jack

45
9

0

votes

2 answers

manipulating multiple sum() values in pyspark pivot table

I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come up with to illustrate what I'm trying to…

apache-spark pyspark apache-spark-sql pyspark-pandas

asked Aug 22 '23 at 06:52

zenith7

151
1
3
8

0

votes

0 answers

how to find getnumPartitions on pyspark.pandas dataframe

The pyspark documentation says that the pandas-on-spark is distributed. If I create a dataframe using pyspark.pandas.read_csv('file.csv'), how can I know the number of partitions of the pandas dataframe? Do we have an equivalent to…

pandas dataframe pyspark pyspark-pandas

asked Aug 19 '23 at 05:59

shankar

196
14

0

votes

0 answers

Find similar rows of a pyspark dataframe based on a particular column using fuzzywuzzy library

I am trying to find "similar" rows in a dataframe based on a particular column. For example, let's say we have this data - +---+------+ | id| fruit| +---+------+ | 1| apple| | 2| appl| | 3|banana| | 4| ora| | 5| banan| | 6|…

python pyspark fuzzywuzzy fuzzy-comparison pyspark-pandas

asked Aug 16 '23 at 18:39

DonkeyKong

1
1

0

votes

1 answer

Conversion from Spark to Pandas using pandas_api and toPandas

df = spark.table("data").limit(100) df = df.toPandas() This conversion using .toPandas works just fine as df.limit is just a few rows. If I get rid of limit and do toPandas on the whole df, I get an error "Job aborted due to stage failure" I've…

pandas dataframe apache-spark pyspark pyspark-pandas

asked Aug 03 '23 at 19:51

dhk02

1
2

Questions tagged [pyspark-pandas]