Questions tagged [pyspark-pandas]

131 questions
0
votes
0 answers

how to ignore error "ERROR *** Token 0x2d (AreaN) found in NAME formula" from pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF

I am trying to read an xls file which containts #REF values. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR ***…
Chris
  • 13
  • 2
0
votes
0 answers

Pandas API support on Spark Connect

I am trying to use Spark PANDAS API on Spark Connect but I am getting assertion erorr assert isinstance(spark_frame, SparkDataFrame) AssertionError I dont get any error if I use the spark Dataframe API. Are Pandas-Spark API supported on Spark…
0
votes
1 answer

Spark ML models not able to deploy on Databricks inference

I'm trying to deploy the spark models(sparkxgbregressor, rfregressor) in databricks. Is model inferencing available for ONLY scikit learn models? If yes, is there any other way to deploy spark models in databricks? As per the ask, adding code for…
0
votes
0 answers

Retrieving timestamp data from kafka using pyspark

I need to parse data from kafka which includes one timestamp column. Unfortunately, my code returns null for timestamp column. Here is my Timestamp sample 2023-06-18T14:49:11.8545562+03:30 which is saved in CreationAt column, and my entire JSON…
Ali Moayed
  • 33
  • 5
0
votes
1 answer

Cast String field to datetime64[ns] in parquet file using pandas-on-spark

My input is parquet file with I need to recast as below: df=spark.read.parquet("input.parquet") psdf=df.to_pandas_on_spark() psdf['reCasted'] = psdf['col1'].astype('float64') psdf['reCasted'] = psdf['col2'].astype('int32') psdf['reCasted'] =…
user2531569
  • 609
  • 4
  • 18
  • 36
0
votes
1 answer

Writing Pyspark dataframe as parquet by PartitionBy dataframe becomes very slow

I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by doing the partitioning This take close to 1.67…
0
votes
0 answers

How to run 10 different models on 10 different partitions using pyspark. Without using sparkmllib or standard parrallelizing implementations?

I have broadcasted training dataset to all partitions. Now I want to share the information of 10 different hyperparameters/models to 10 different partitions and train them indepedently. How to share this modes/hyperparameters information ? Is this…
thiran509
  • 11
  • 2
0
votes
2 answers

Create new columns with running count based on categorical column value counts in pyspark

Suppose a given dataframe: Model Color Car Red Car Red Car Blue Truck Red Truck Blue Truck Yellow SUV Blue SUV Blue Car Blue Car Yellow I want to add color columns that keep a count of each color across each model to…
0
votes
1 answer

how can merge multiple part file into single file in databricks

i am trying to merge multiple part file into single file. In staging folder, it itterating the all files, schema is same. part file we are converting .Tab files. Files are generating based on salesorgcode ex:7001 ,600,8002 every country having…
0
votes
1 answer

Pyspark Error due to data type in pandas_udf

I'm trying to write a filter_words function in pandas_udf Here are the functions I am using: @udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True), …
Rory
  • 471
  • 2
  • 11
0
votes
1 answer

TypeError in pySpark UDF functions

I've got this function: def ead(lista): ind_mmff, isdebala, isfubala, k1, k2, ead = lista try: isdebala = float(isdebala) isfubala = float(isfubala) k1 = float(k1) k2 = float(k2) ead = float(ead) …
JMP
  • 38
  • 8
0
votes
1 answer

retrieve the non null values from a PySpark dataframe row and store this value in a new column

I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value". How do i create a new column which only has this 1…
pscodes
  • 11
  • 1
0
votes
0 answers

Changing Python version of pyspark worker node

I need help changing the python version of a spark worker node to get rid of the following error message: RuntimeError: Python in worker has different version 3.10 than that in driver 3.9, PySpark cannot run with different minor versions. Please…
0
votes
1 answer

Pandas on Spark apply() seems to be reshaping columns

Can anybody explain the following behavior? import pyspark.pandas as ps loan_information = ps.read_sql_query([blah]) loan_information.shape #748834, 84 loan_information.apply(lambda col: col.shape) #Each column has 75 dimensions. The first 74 are…
Cody Dance
  • 115
  • 5
0
votes
1 answer

i want to sum date in a looping 13 times using pyspark

Please help me to solve this issue, as I am still new to Python/Pyspark. I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column. I have a master table like this : id date 1 2019-02-21 10:00:00 2 2019-02-27…
1 2 3
8 9