Highest Voted 'pandas-udf' Questions

0

votes

1 answer

pandas udf into column in array type

my assignment is to store the following into an array type column: def sample_udf(df:SparkDataFrame): device_issues = [] if (df['altitude'] == 0): return "alt" elif (df['latitude'] <= -90 or df['latitude'] >=90): …

asked Jan 17 '23 at 08:57

Jilinnie Park

11
3

0

votes

1 answer

Correct type hints for PandasUDFType.GROUPED_AGG that returns an array of doubles

I am using a Grouped Agg Pandas UDF to average the values of an array column element-wise (aka mean pooling). I keep getting the following warning and have not been able to find the correct type hints to provide for PandasUDFType.GROUPED_AGG with…

pyspark type-hinting pandas-udf

asked Dec 21 '22 at 22:28

David

2,200
1
12
22

0

votes

0 answers

Pyspark PandasUDF: One pd.Series element per Dataframe row

I work with a couple of pyspark UDFs which slow down my code, hence I want to transform some of them to PandasUDFs. One UDF takes an list of strings as argument (which comes from another column of the Spark DF). If I declare the python function as…

python pandas dataframe pyspark pandas-udf

asked Nov 21 '22 at 10:34

Moritz

495
1
7
17

0

votes

0 answers

PySpark PandasUDF with 2 different argument data types

I have a dataframe A with a column containing a string. I want to compare this string with another dataframe B, which only has one column that contains a list of tuples of strings. What I did so far: I transformed B into a list, through which I…

pyspark user-defined-functions pyspark-pandas pandas-udf

asked Nov 17 '22 at 09:26

Moritz

495
1
7
17

0

votes

1 answer

Pandas UDF Structfield return

I am trying to return a StructField from a Pandas UDF in Pyspark used with aggregation with the following function signature: def parcel_to_polygon(geom:pd.Series,entity_ids:pd.Series) -> Tuple[int,str,List[List[str]]]: But it turns out that the…

apache-spark pyspark pandas-udf

asked Nov 15 '22 at 11:44

Tarique

463
3
16

0

votes

0 answers

separating dates and getting all permutations of products in Pandas UDF

I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by that but I get an error. Here is a small example to…

pyspark pandas-udf

asked Sep 30 '22 at 20:37

Matt

85
6

0

votes

0 answers

pandas_udf failing for large dataset

I have a problem with a large dataframe (40 billions+ rows) where I am creating a new column passing an array column to a UDF. A PySpark UDF works for a smaller dataset and not working for more than few thousand records. So I am trying pandas_udf…

arrays pyspark scalar pandas-udf

asked Sep 14 '22 at 21:12

user14297339

33
5

0

votes

1 answer

Databricks notebook runs faster when triggered manually compared to when run as a job

I don't know if this question has been covered earlier, but here it goes - I have a notebook that I can run manually using the 'Run' button in the notebook or as a job. The runtime for running the notebook directly is roughly 2 hours. But when I…

python pyspark databricks pmdarima pandas-udf

asked Apr 11 '22 at 08:35

Vidisha Kanodia

21
5

0

votes

1 answer

Dividing a set of columns by its average in Pyspark

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code. Input Data columns = ["Col1", "Col2", "Col3","Name"] data…

pyspark pandas-udf

asked Mar 30 '22 at 11:49

Deb

499
2
15

0

votes

1 answer

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using @udf or @pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe, SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to DenseVectors but i'm stuck, spent past 3…

pyspark user-defined-functions pandas-udf

asked Mar 15 '22 at 18:54

n1tk

2,406
2
21
35

0

votes

1 answer

PySpark UDF to Pandas UDF for sting columns

I do have an UDF that is slow for large dataset and I try to improve execution time and scalability by leveraging pandas_udfs and all searching and official documentation does more focus to scalar and a mapping approach that I already used but I do…

apache-spark pyspark apache-spark-sql user-defined-functions pandas-udf

asked Jan 26 '22 at 14:00

n1tk

2,406
2
21
35

Questions tagged [pandas-udf]