Questions tagged [pandas-udf]

41 questions
1
vote
0 answers

pyspark calculate custom metric on grouped data

I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe has group_key and I want to derive a single value…
1
vote
0 answers

Pyspark error - Invalid argument, not a string or column while implementing inside pandas_udf

This code is working fine outside the pandas_udf but getting this error while trying to implement the same inside udf. To avoid conflicts between pyspark and python function names, I have explicitly imported specific functions from pyspark. Using…
user22
  • 112
  • 1
  • 9
1
vote
1 answer

Iterating through a DataFrame using Pandas UDF and outputting a dataframe

I have a piece of code that I want to translate into a Pandas UDF in PySpark but I'm having a bit of trouble understanding whether or not you can use conditional statements. def is_pass_in(df): x = list(df["string"]) result = [] for i in…
AndronikMk
  • 151
  • 1
  • 10
1
vote
1 answer

PySpark: Pandas UDF for scipy statistical transformations

I'm trying to create a column of standardized (z-score) of a column x on a Spark dataframe, but am missing something because none of it is working. Here's my example: import pandas as pd from pyspark.sql.functions import pandas_udf,…
0
votes
0 answers

How to reduce the execution time of multiple models' inference on a large dataset in pyspark?

I have a pyspark data frame of a huge number of rows ( 80 million -100 million rows). I am inferencing a model on it to obtain the model score(probability) for each row. Like the below code: import tensorflow as tf from tensorflow import keras from…
0
votes
0 answers

Pandas UDF that take a list of integers from a window and returns an adjusted list of integers

Below I have one Pyspark dataset (test), a function (func) and a window function. I wish to make func a pandas_udf function and apply it on the window I defined below. Func takes a list of values. That list I want to be the four values of the window…
Henri
  • 1,077
  • 10
  • 24
0
votes
1 answer

Pyspark Error due to data type in pandas_udf

I'm trying to write a filter_words function in pandas_udf Here are the functions I am using: @udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True), …
Rory
  • 471
  • 2
  • 11
0
votes
1 answer

(Spark 3.3.2 OpenJDK19 PySpark Pandas_UDF Python3.10 Ubuntu22.04 Dockerized) Test Script producing TypeError: 'JavaPackage' object is not callable

I've created a docker container that installs Ubuntu 22.04, Python 3.10, Spark 3.3.2, Hadoop 3, Scala 13, and Open JDK 19. I'm currently using as a test environment before deploying code in AWS. This container was working swimmingly for the past 2-3…
Bamu
  • 11
  • 2
0
votes
1 answer

pyspark pandas udf not able to return any object

I am moving my code from Pandas to Pypark for NLP task. I have figured out how to apply tokenization (using Keras built-in library) via a pandas UDF. However, I also want to return the fitted tokenizer (for later use on test data). As with pandas…
Abdul Wahab
  • 137
  • 2
  • 11
0
votes
1 answer

pyspark with pandas udf giving java.io.EOFException while writing to CSV

pyspark code using pandas udf functions , works fine with df.limit(20).collect() & write to csv for 20 records. But when i try write 100 records to csv it fails with java.io.EOFException error. Same code works fine with regular udf functions (not…
Mohan Rayapuvari
  • 289
  • 1
  • 4
  • 18
0
votes
0 answers

Trying to parallelize hyperparameter tuning using pandas udf, but no success

I've been trying to parallelize hyperparameter tuning for my prophet model for around 100 combinations of hyperparameters saved in the dataframe params_df. I want to parallelize the hyperparameter tuning operation and have done the following: schema…
0
votes
0 answers

RuntimeError: Result vector from pandas_udf was not the required length: expected 100, got 229

I am trying to run simple pdf_udf code for my understanding . In actual I need to implement complex logic in it . But even for this simple udf where I am converting to upper case and appending hello at end I am getting error . Just to mention…
pbh
  • 186
  • 1
  • 9
0
votes
1 answer

Pyspark Pandas UDF Series operation on Array column

I have a dataframe like this data_df = spark.createDataFrame([([1,2,3],'val1'),([4,5,6],'val2')],['col1','col2']) Col1. Col2 [1,2,3] val1 [4,5,6] val2 I want to get the minimum value from the column 1 arrays. The expected results looks…
lserlohn
  • 5,878
  • 10
  • 34
  • 52
0
votes
0 answers

How convert python nested loops into pandas UDF

I'm quite new to pyspark and not skilled python engineer trying to understand pandas UDF application for my case. I have developed ArimaX model, which for each "id" performs 4 outlook forecast (M1 till M4 ahead) while for each outlook 12 models are…
KubaS
  • 1
  • 2
0
votes
0 answers

pandasUDF: how to process a complex column (array of object) and return a string column

i've got this dataframe: root |-- trip_id: string (nullable = true) |-- vehicle_id: string (nullable = true) |-- points: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- time: long (nullable = true) | | …
4mla1fn
  • 169
  • 1
  • 15