Questions tagged [pyspark-pandas]

131 questions
1
vote
0 answers

write dynamic frame on s3 with xml format with custom rowtag and roottag specified

I used the below code, but I am getting rootTag as 'root', and rowTag as 'record'. But I want rootTag as 'SET'and rowtag as 'TRECORD' repartitioned_df = df.repartition(1) datasink4 = glueContext.write_dynamic_frame.from_options(frame =…
1
vote
1 answer

how to read data from multiple folder from adls to databricks dataframe

file path format is data/year/weeknumber/no of…
1
vote
1 answer

PySpark: how to performs conditional calculation on each element of a long string

I have a dataframe that looks like this: +--------+-------------------------------------+-----------+ | Worker | Schedule | Overtime | +--------+-------------------------------------+-----------+ | 1 |…
DPatrick
  • 59
  • 1
  • 7
1
vote
0 answers

pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented

I am trying to replace pandas library with pyspark.pandas library. I tried this : NOTE : df is pyspark.pandas dataframe import pyspark.pandas as pd print(set(df["horizon"].unique())) But got the below error : print(set(df["horizon"].unique())) …
user19930511
  • 299
  • 2
  • 15
1
vote
1 answer

Pandas on Spark 3.2 -NLP.pipe - pd.Series.__iter__() is not implemented

I'm currently trying to migrate some processes from python to (pandas on) spark to measure performance, everything went good until this point: df_info is of type pyspark.pandas nlp is defined as: nlp = spacy.load('es_core_news_sm',…
Alejandro
  • 519
  • 1
  • 6
  • 32
1
vote
1 answer

'DataFrame' object has no attribute 'to_delta'

My code used to work. Why does my code not work anymore? I updated to the newer Databricks runtime 10.2 so I had to change some earlier code to use pandas on pyspark. # Drop customer ID for AutoML automlDF = churn_features_df.drop(key_id) # Write…
Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53
0
votes
1 answer

Solving a system of multi-variable equations using PySpark on Databricks

Any suggestion or help or references are most welcome for the below problem statement. I am performing big data analysis on the data that is currently stored on Azure. The actual implementation is more complex than the set of equations provided…
0
votes
0 answers

How to parallelize work in pyspark over chunks of a dataset and the chunk needs to be a pandas df

I have a question on how the best way to implement the following problem. I have a LGBM model in my driver. I need to run this model against a very large distributed over the executors dataset. In order to run the model, I need to transform the…
0
votes
0 answers

Using Koalas, how do I save to an external table?

I have the code below to save a Koalas dataframe to an Orc table. How to modify it to save to an EXTERNAL table? df.reset_index().to_orc( f"/corporativo/mydatabase/mytable", mode="overwrite", partition_cols=["year", "month"] )
neves
  • 33,186
  • 27
  • 159
  • 192
0
votes
2 answers

PySpark on Jupyter Notebook, dataframe of two rows can't be converted to pandas dataframe. Why?

This is the pyspark dataframe And the schema of the dataframe. Just two rows. Then I want to convert it to pandas dataframe. But it is suspended at stage 3. No result, and no information about the procedure. Why this can happen? And when I use…
0
votes
2 answers

manipulating multiple sum() values in pyspark pivot table

I'm having a little difficulty in further manipulating a pyspark pivot table to give me a reduced result. My data is a little more complex than the example below, but it's the best example I can come up with to illustrate what I'm trying to…
zenith7
  • 151
  • 1
  • 3
  • 8
0
votes
0 answers

how to find getnumPartitions on pyspark.pandas dataframe

The pyspark documentation says that the pandas-on-spark is distributed. If I create a dataframe using pyspark.pandas.read_csv('file.csv'), how can I know the number of partitions of the pandas dataframe? Do we have an equivalent to…
shankar
  • 196
  • 14
0
votes
0 answers

Find similar rows of a pyspark dataframe based on a particular column using fuzzywuzzy library

I am trying to find "similar" rows in a dataframe based on a particular column. For example, let's say we have this data - +---+------+ | id| fruit| +---+------+ | 1| apple| | 2| appl| | 3|banana| | 4| ora| | 5| banan| | 6|…
0
votes
1 answer

Conversion from Spark to Pandas using pandas_api and toPandas

df = spark.table("data").limit(100) df = df.toPandas() This conversion using .toPandas works just fine as df.limit is just a few rows. If I get rid of limit and do toPandas on the whole df, I get an error "Job aborted due to stage failure" I've…
dhk02
  • 1
  • 2
1 2
3
8 9