Questions tagged [pyspark-pandas]
131 questions
1
vote
1 answer
pyspark df.filter for dynamic columns based on user arguments
I am trying to perform isin() using df.filter but I need to filter columns dynamically based on user arguments passed. I am appending custom columnns to list and join later which converts it into string, since isin() is also getting converted into…

avinash reddy
- 25
- 4
1
vote
1 answer
Pandas API on spark runs too slow according to pandas
I am making transform on my dataframe. while the process takes just 3 seconds with pandas, when I use pyspark and Pandas API on spark it takes approximately 30 minutes, yes 30 minutes! my data is 10k rows.
The following is my pandas approach;
def…

Ugur Selim Ozen
- 159
- 3
- 10
1
vote
2 answers
Pyspark - Transpose distinct row values to a column header where ill insert values from the same row I transposed but different column
This is the table I want to transpose
I created a list of the distinct values at DESC_INFO using this:
columnsToPivot = list(dict.fromkeys(df.filter(F.col("DESC_INFO") != '').rdd.map(lambda x: (x.DESC_INFO, x.RAW_INFO)).collect()))
And then I tried…

José Bastos
- 11
- 2
1
vote
0 answers
How to flatten complex nested json file using pyspark
I have a nested complex json file which has struct type, array type, list, dict nested within each other.
i have a function which flattens the columns with struct type, but when it encounters any other type it fails.
is their any recursive function…

harshith
- 71
- 9
1
vote
1 answer
PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API
Since we upgraded to pyspark 3.3.0 for our job we have issues with cached ps.Dataframe that are then concat using pyspark pandas : ps.concat([df1,df2])
This issue is that the concatenated data frame is not using the cached data but is re-reading the…

frco9
- 11
- 2
1
vote
1 answer
How to replace any null in pyspark df with value from the below row, same column
Let's say I have a pyspark DF:
| Column A | Column B |
| -------- | -------- |
| val1 | val1B |
| null | val2B |
| val2 | null |
| val3 | val3B |
Can someone help me with replacing any null value in any column (for the…

newmon
- 25
- 3
1
vote
0 answers
How to filter out groups of groupby of which not all rows are present in another DataFrame
The purpose of this code is to filter out the groups of a pyspark DataFrame that contains rows that do not have a mathing row in the hub DataFrame. The groups are determined based on the primary key.
To check if all rows a the group have matching…

jcnouwens
- 11
- 3
1
vote
1 answer
How to replace text in column by the value contained in the columns named in this text
In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula).
So to be clear, here is an example :
Input:
|param_1|param_2|calc…

Cazau
- 13
- 3
1
vote
1 answer
How to groupby on a index column in pyspark pandas?
I have a pandas API for spark (Koala) dataframe as below,
c1 c2
id name
a1 a 1 1
a1 b 2 2
b1 c 3 3
How can I do groupby like below,
df.groupby(level=1).sum()

Selva
- 951
- 7
- 23
1
vote
1 answer
AttachDistributedSequence is not supported in Unity Catalog
I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error:
AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity…

Toivo Mattila
- 377
- 1
- 9
1
vote
1 answer
ArrowInvalid: Could not convert ... with type DataFrame: did not recognize Python value type when inferring an Arrow data type
Using IForest library implementing a function for detection outliers using the following code:
import pyspark.pandas as pd
import numpy as np
from alibi_detect.od import IForest
# **************** Modelo IForest…

Daniel Vera
- 77
- 1
- 10
1
vote
0 answers
How to use Parallel processing using Concurrent Jobs in Databricks?
A. Background:
I have to text manipulation using python (like concatenation , convert to spacy doc , get verbs from the spacy doc etc) for 1 million records
Each record takes 1 sec, meaning it will take 10 days for 1 million records !
There's no…

newbie101
- 65
- 7
1
vote
0 answers
pyspark calculate custom metric on grouped data
I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe has group_key and I want to derive a single value…

user14297339
- 33
- 5
1
vote
1 answer
How do I distribute application of a function that returns a scalar over a grouped dataframe using pandas API on Spark with Azure Databricks?
[I managed to answer my own question in a narrow sense, but hopefully someone who knows more than I do can explain why my solution works and give a more general answer.]
I am new to Databricks, Spark, and modern distributed computing. (I do…

jtolle
- 7,023
- 2
- 28
- 50
1
vote
0 answers
pyspark.pandas - type Series doesn't define __round__ method
I have a Spark DF that I've converted to a pySpark Pandas DF using df.to_pandas_on_spark().
I have some logic that rounds a column i.e:
df[Column_Name] = round(df[Income] - df[Fees],2)
but get the following error
TypeError: type Series doesn't…

mikelowry
- 1,307
- 4
- 21
- 43