Questions tagged [pandas-udf]

41 questions
0
votes
1 answer

pandas udf into column in array type

my assignment is to store the following into an array type column: def sample_udf(df:SparkDataFrame): device_issues = [] if (df['altitude'] == 0): return "alt" elif (df['latitude'] <= -90 or df['latitude'] >=90): …
0
votes
1 answer

Correct type hints for PandasUDFType.GROUPED_AGG that returns an array of doubles

I am using a Grouped Agg Pandas UDF to average the values of an array column element-wise (aka mean pooling). I keep getting the following warning and have not been able to find the correct type hints to provide for PandasUDFType.GROUPED_AGG with…
David
  • 2,200
  • 1
  • 12
  • 22
0
votes
0 answers

Pyspark PandasUDF: One pd.Series element per Dataframe row

I work with a couple of pyspark UDFs which slow down my code, hence I want to transform some of them to PandasUDFs. One UDF takes an list of strings as argument (which comes from another column of the Spark DF). If I declare the python function as…
Moritz
  • 495
  • 1
  • 7
  • 17
0
votes
0 answers

PySpark PandasUDF with 2 different argument data types

I have a dataframe A with a column containing a string. I want to compare this string with another dataframe B, which only has one column that contains a list of tuples of strings. What I did so far: I transformed B into a list, through which I…
Moritz
  • 495
  • 1
  • 7
  • 17
0
votes
1 answer

Pandas UDF Structfield return

I am trying to return a StructField from a Pandas UDF in Pyspark used with aggregation with the following function signature: def parcel_to_polygon(geom:pd.Series,entity_ids:pd.Series) -> Tuple[int,str,List[List[str]]]: But it turns out that the…
Tarique
  • 463
  • 3
  • 16
0
votes
0 answers

separating dates and getting all permutations of products in Pandas UDF

I am trying to get a permutation of all possible couples of dates using a pandas_udf. As I understand the dataframe has to be grouped to be sent to a pandas_udf so I am adding an ID and grouping by that but I get an error. Here is a small example to…
Matt
  • 85
  • 6
0
votes
0 answers

pandas_udf failing for large dataset

I have a problem with a large dataframe (40 billions+ rows) where I am creating a new column passing an array column to a UDF. A PySpark UDF works for a smaller dataset and not working for more than few thousand records. So I am trying pandas_udf…
0
votes
1 answer

Databricks notebook runs faster when triggered manually compared to when run as a job

I don't know if this question has been covered earlier, but here it goes - I have a notebook that I can run manually using the 'Run' button in the notebook or as a job. The runtime for running the notebook directly is roughly 2 hours. But when I…
0
votes
1 answer

Dividing a set of columns by its average in Pyspark

I have to divide a set of columns in a pyspark.sql.dataframe by their respective column average but I am not able to find an correct way to do it. Below is a sample data and my present code. Input Data columns = ["Col1", "Col2", "Col3","Name"] data…
Deb
  • 499
  • 2
  • 15
0
votes
1 answer

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using @udf or @pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe, SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to DenseVectors but i'm stuck, spent past 3…
n1tk
  • 2,406
  • 2
  • 21
  • 35
0
votes
1 answer

PySpark UDF to Pandas UDF for sting columns

I do have an UDF that is slow for large dataset and I try to improve execution time and scalability by leveraging pandas_udfs and all searching and official documentation does more focus to scalar and a mapping approach that I already used but I do…
1 2
3