Highest Voted 'pandas-udf' Questions

1

vote

0 answers

pyspark calculate custom metric on grouped data

I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe has group_key and I want to derive a single value…

asked Sep 08 '22 at 03:35

user14297339

33
5

1

vote

0 answers

Pyspark error - Invalid argument, not a string or column while implementing inside pandas_udf

This code is working fine outside the pandas_udf but getting this error while trying to implement the same inside udf. To avoid conflicts between pyspark and python function names, I have explicitly imported specific functions from pyspark. Using…

python machine-learning pyspark nlp pandas-udf

asked Jul 19 '22 at 07:18

user22

112
1
9

1

vote

1 answer

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements. def is_pass_in(df): x = list(df["string"]) result = [] for i in…

pyspark apache-arrow pandas-udf

asked Jun 05 '22 at 00:47

AndronikMk

151
1
10

1

vote

1 answer

PySpark: Pandas UDF for scipy statistical transformations

I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working. Here's my example: import pandas as pd from pyspark.sql.functions import pandas_udf,…

pyspark pandas-udf

asked Jun 04 '22 at 20:17

user3771195

26
3

0

votes

0 answers

How to reduce the execution time of multiple models' inference on a large dataset in pyspark?

I have a pyspark data frame of a huge number of rows ( 80 million -100 million rows). I am inferencing a model on it to obtain the model score(probability) for each row. Like the below code: import tensorflow as tf from tensorflow import keras from…

tensorflow keras pyspark pandas-udf

asked Sep 03 '23 at 07:25

krishna kaushik

25
4

0

votes

0 answers

Pandas UDF that take a list of integers from a window and returns an adjusted list of integers

Below I have one Pyspark dataset (test), a function (func) and a window function. I wish to make func a pandas_udf function and apply it on the window I defined below. Func takes a list of values. That list I want to be the four values of the window…

pyspark pandas-udf

asked Aug 16 '23 at 14:27

Henri

1,077
10
24

0

votes

1 answer

Pyspark Error due to data type in pandas_udf

I'm trying to write a filter_words function in pandas_udf Here are the functions I am using: @udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True), …

pyspark apache-spark-sql pyspark-pandas pandas-udf

asked Jun 07 '23 at 10:53

Rory

471
2
11

0

votes

1 answer

(Spark 3.3.2 OpenJDK19 PySpark Pandas_UDF Python3.10 Ubuntu22.04 Dockerized) Test Script producing TypeError: 'JavaPackage' object is not callable

I've created a docker container that installs Ubuntu 22.04, Python 3.10, Spark 3.3.2, Hadoop 3, Scala 13, and Open JDK 19. I'm currently using as a test environment before deploying code in AWS. This container was working swimmingly for the past 2-3…

docker pyspark apache-spark-sql rdd pandas-udf

asked May 18 '23 at 20:37

Bamu

11
2

0

votes

1 answer

pyspark pandas udf not able to return any object

I am moving my code from Pandas to Pypark for NLP task. I have figured out how to apply tokenization (using Keras built-in library) via a pandas UDF. However, I also want to return the fitted tokenizer (for later use on test data). As with pandas…

apache-spark pyspark nlp databricks pandas-udf

asked Apr 20 '23 at 19:06

Abdul Wahab

137
2
11

0

votes

1 answer

pyspark with pandas udf giving java.io.EOFException while writing to CSV

pyspark code using pandas udf functions , works fine with df.limit(20).collect() & write to csv for 20 records. But when i try write 100 records to csv it fails with java.io.EOFException error. Same code works fine with regular udf functions (not…

apache-spark pyspark pyarrow pandas-udf

asked Mar 27 '23 at 01:23

Mohan Rayapuvari

289
1
4
18

0

votes

0 answers

Trying to parallelize hyperparameter tuning using pandas udf, but no success

I've been trying to parallelize hyperparameter tuning for my prophet model for around 100 combinations of hyperparameters saved in the dataframe params_df. I want to parallelize the hyperparameter tuning operation and have done the following: schema…

azure pyspark databricks facebook-prophet pandas-udf

asked Mar 22 '23 at 19:57

Pranav Gupta

1
1

0

votes

0 answers

RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 229

I am trying to run simple pdf_udf code for my understanding . In actual I need to implement complex logic in it . But even for this simple udf where I am converting to upper case and appending hello at end I am getting error . Just to mention…

apache-spark pyspark pandas-udf

asked Mar 17 '23 at 16:08

pbh

186
1
9

0

votes

1 answer

Pyspark Pandas UDF Series operation on Array column

I have a dataframe like this data_df = spark.createDataFrame([([1,2,3],'val1'),([4,5,6],'val2')],['col1','col2']) Col1. Col2 [1,2,3] val1 [4,5,6] val2 I want to get the minimum value from the column 1 arrays. The expected results looks…

dataframe pyspark pandas-udf

asked Mar 16 '23 at 23:28

lserlohn

5,878
10
34
52

0

votes

0 answers

How convert python nested loops into pandas UDF

I'm quite new to pyspark and not skilled python engineer trying to understand pandas UDF application for my case. I have developed ArimaX model, which for each "id" performs 4 outlook forecast (M1 till M4 ahead) while for each outlook 12 models are…

python for-loop pyspark pandas-udf

asked Mar 09 '23 at 17:25

KubaS

1
2

0

votes

0 answers

pandasUDF: how to process a complex column (array of object) and return a string column

pandas pyspark pandas-udf

asked Mar 09 '23 at 00:39

4mla1fn

169
1
15

Questions tagged [pandas-udf]