Questions tagged [pyspark-pandas]
131 questions
0
votes
0 answers
how to ignore error "ERROR *** Token 0x2d (AreaN) found in NAME formula" from pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF
I am trying to read an xls file which containts #REF values.
When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR ***…

Chris
- 13
- 2
0
votes
0 answers
Pandas API support on Spark Connect
I am trying to use Spark PANDAS API on Spark Connect but I am getting assertion erorr
assert isinstance(spark_frame, SparkDataFrame)
AssertionError
I dont get any error if I use the spark Dataframe API.
Are Pandas-Spark API supported on Spark…
0
votes
1 answer
Spark ML models not able to deploy on Databricks inference
I'm trying to deploy the spark models(sparkxgbregressor, rfregressor) in databricks. Is model inferencing available for ONLY scikit learn models? If yes, is there any other way to deploy spark models in databricks?
As per the ask, adding code for…

Rayzee
- 3
- 1
- 5
0
votes
0 answers
Retrieving timestamp data from kafka using pyspark
I need to parse data from kafka which includes one timestamp column. Unfortunately, my code returns null for timestamp column.
Here is my Timestamp sample 2023-06-18T14:49:11.8545562+03:30 which is saved in CreationAt column, and my entire JSON…

Ali Moayed
- 33
- 5
0
votes
1 answer
Cast String field to datetime64[ns] in parquet file using pandas-on-spark
My input is parquet file with I need to recast as below:
df=spark.read.parquet("input.parquet")
psdf=df.to_pandas_on_spark()
psdf['reCasted'] = psdf['col1'].astype('float64')
psdf['reCasted'] = psdf['col2'].astype('int32')
psdf['reCasted'] =…

user2531569
- 609
- 4
- 18
- 36
0
votes
1 answer
Writing Pyspark dataframe as parquet by PartitionBy dataframe becomes very slow
I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by doing the partitioning
This take close to 1.67…

Raja Sabarish PV
- 115
- 1
- 14
0
votes
0 answers
How to run 10 different models on 10 different partitions using pyspark. Without using sparkmllib or standard parrallelizing implementations?
I have broadcasted training dataset to all partitions.
Now I want to share the information of 10 different hyperparameters/models to 10 different partitions and train them indepedently. How to share this modes/hyperparameters information ?
Is this…

thiran509
- 11
- 2
0
votes
2 answers
Create new columns with running count based on categorical column value counts in pyspark
Suppose a given dataframe:
Model
Color
Car
Red
Car
Red
Car
Blue
Truck
Red
Truck
Blue
Truck
Yellow
SUV
Blue
SUV
Blue
Car
Blue
Car
Yellow
I want to add color columns that keep a count of each color across each model to…

jay-elliot
- 9
- 1
0
votes
1 answer
how can merge multiple part file into single file in databricks
i am trying to merge multiple part file into single file. In staging folder, it itterating the all files, schema is same. part file we are converting .Tab files. Files are generating based on salesorgcode ex:7001 ,600,8002 every country having…

KIRAN KUMAR
- 7
- 2
0
votes
1 answer
Pyspark Error due to data type in pandas_udf
I'm trying to write a filter_words function in pandas_udf
Here are the functions I am using:
@udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True),
…

Rory
- 471
- 2
- 11
0
votes
1 answer
TypeError in pySpark UDF functions
I've got this function:
def ead(lista):
ind_mmff, isdebala, isfubala, k1, k2, ead = lista
try:
isdebala = float(isdebala)
isfubala = float(isfubala)
k1 = float(k1)
k2 = float(k2)
ead = float(ead)
…

JMP
- 38
- 8
0
votes
1 answer
retrieve the non null values from a PySpark dataframe row and store this value in a new column
I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value". How do i create a new column which only has this 1…

pscodes
- 11
- 1
0
votes
0 answers
Changing Python version of pyspark worker node
I need help changing the python version of a spark worker node to get rid of the following error message:
RuntimeError: Python in worker has different version 3.10 than that in driver 3.9, PySpark cannot run with different minor versions. Please…
0
votes
1 answer
Pandas on Spark apply() seems to be reshaping columns
Can anybody explain the following behavior?
import pyspark.pandas as ps
loan_information = ps.read_sql_query([blah])
loan_information.shape
#748834, 84
loan_information.apply(lambda col: col.shape)
#Each column has 75 dimensions. The first 74 are…

Cody Dance
- 115
- 5
0
votes
1 answer
i want to sum date in a looping 13 times using pyspark
Please help me to solve this issue, as I am still new to Python/Pyspark.
I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column.
I have a master table like this :
id
date
1
2019-02-21 10:00:00
2
2019-02-27…

rezha nanda
- 13
- 3