Questions tagged [pandas-udf]
41 questions
0
votes
1 answer
pandas udf into column in array type
my assignment is to store the following into an array type column:
def sample_udf(df:SparkDataFrame):
device_issues = []
if (df['altitude'] == 0):
return "alt"
elif (df['latitude'] <= -90
or df['latitude'] >=90):
…

Jilinnie Park
- 11
- 3
0
votes
1 answer
Correct type hints for PandasUDFType.GROUPED_AGG that returns an array of doubles
I am using a Grouped Agg Pandas UDF to average the values of an array column element-wise (aka mean pooling). I keep getting the following warning and have not been able to find the correct type hints to provide for PandasUDFType.GROUPED_AGG with…

David
- 2,200
- 1
- 12
- 22
0
votes
0 answers
Pyspark PandasUDF: One pd.Series element per Dataframe row
I work with a couple of pyspark UDFs which slow down my code, hence I want to transform some of them to PandasUDFs. One UDF takes an list of strings as argument (which comes from another column of the Spark DF). If I declare the python function as…

Moritz
- 495
- 1
- 7
- 17
0
votes
0 answers
PySpark PandasUDF with 2 different argument data types
I have a dataframe A with a column containing a string. I want to compare this string with another dataframe B, which only has one column that contains a list of tuples of strings. What I did so far: I transformed B into a list, through which I…

Moritz
- 495
- 1
- 7
- 17
0
votes
1 answer
Pandas UDF Structfield return
I am trying to return a StructField from a Pandas UDF in Pyspark used with aggregation with the following function signature:
def parcel_to_polygon(geom:pd.Series,entity_ids:pd.Series) -> Tuple[int,str,List[List[str]]]:
But it turns out that the…

Tarique
- 463
- 3
- 16
0
votes
0 answers
separating dates and getting all permutations of products in Pandas UDF
I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by that but I get an error. Here is a small example to…

Matt
- 85
- 6
0
votes
0 answers
pandas_udf failing for large dataset
I have a problem with a large dataframe (40 billions+ rows) where I am creating a new column passing an array column to a UDF. A PySpark UDF works for a smaller dataset and not working for more than few thousand records. So I am trying pandas_udf…

user14297339
- 33
- 5
0
votes
1 answer
Databricks notebook runs faster when triggered manually compared to when run as a job
I don't know if this question has been covered earlier, but here it goes - I have a notebook that I can run manually using the 'Run' button in the notebook or as a job.
The runtime for running the notebook directly is roughly 2 hours. But when I…

Vidisha Kanodia
- 21
- 5
0
votes
1 answer
Dividing a set of columns by its average in Pyspark
I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code.
Input Data
columns = ["Col1", "Col2", "Col3","Name"]
data…

Deb
- 499
- 2
- 15
0
votes
1 answer
pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using @udf or @pandas_udf
I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to
DenseVectors but i'm stuck, spent past 3…

n1tk
- 2,406
- 2
- 21
- 35
0
votes
1 answer
PySpark UDF to Pandas UDF for sting columns
I do have an UDF that is slow for large dataset and I try to improve execution time and scalability by leveraging pandas_udfs and all searching and official documentation does more focus to scalar and a mapping approach that I already used but I do…

n1tk
- 2,406
- 2
- 21
- 35