Highest Voted 'pyspark-pandas' Questions

0

votes

0 answers

how to ignore error "ERROR *** Token 0x2d (AreaN) found in NAME formula" from pyspark.pandas.read_excel(engine = xlrd) reading xls file with #REF

I am trying to read an xls file which containts #REF values. When I try to read the file with "pyspark.pandas.read_excel(file_path, sheet_name = 'sheet_name', engine='xlrd', convert_float=False, dtype='str').to_spark()" I get the error "ERROR ***…

asked Jul 21 '23 at 13:07

Chris

13
2

0

votes

0 answers

Pandas API support on Spark Connect

I am trying to use Spark PANDAS API on Spark Connect but I am getting assertion erorr assert isinstance(spark_frame, SparkDataFrame) AssertionError I dont get any error if I use the spark Dataframe API. Are Pandas-Spark API supported on Spark…

apache-spark pyspark pyspark-pandas spark-connect

asked Jul 21 '23 at 11:44

prateek gupta

1

0

votes

1 answer

Spark ML models not able to deploy on Databricks inference

I'm trying to deploy the spark models(sparkxgbregressor, rfregressor) in databricks. Is model inferencing available for ONLY scikit learn models? If yes, is there any other way to deploy spark models in databricks? As per the ask, adding code for…

pyspark scikit-learn databricks azure-databricks pyspark-pandas

asked Jul 13 '23 at 03:48

Rayzee

3
1
5

0

votes

0 answers

Retrieving timestamp data from kafka using pyspark

I need to parse data from kafka which includes one timestamp column. Unfortunately, my code returns null for timestamp column. Here is my Timestamp sample 2023-06-18T14:49:11.8545562+03:30 which is saved in CreationAt column, and my entire JSON…

python pyspark pyspark-pandas pyspark-schema

asked Jul 12 '23 at 03:44

Ali Moayed

33
5

0

votes

1 answer

Cast String field to datetime64[ns] in parquet file using pandas-on-spark

My input is parquet file with I need to recast as below: df=spark.read.parquet("input.parquet") psdf=df.to_pandas_on_spark() psdf['reCasted'] = psdf['col1'].astype('float64') psdf['reCasted'] = psdf['col2'].astype('int32') psdf['reCasted'] =…

python pandas pyspark-pandas

asked Jul 10 '23 at 21:01

user2531569

609
4
18
36

0

votes

1 answer

Writing Pyspark dataframe as parquet by PartitionBy dataframe becomes very slow

I have a pyspark dataframe which does multiple groupby, pivot kind of transformations and when I get a final dataframe after applying all mentioned transformations. Writing back the df as parquet by doing the partitioning This take close to 1.67…

python-3.x apache-spark pyspark apache-spark-sql pyspark-pandas

asked Jun 23 '23 at 14:03

Raja Sabarish PV

115
1
14

0

votes

0 answers

How to run 10 different models on 10 different partitions using pyspark. Without using sparkmllib or standard parrallelizing implementations?

I have broadcasted training dataset to all partitions. Now I want to share the information of 10 different hyperparameters/models to 10 different partitions and train them indepedently. How to share this modes/hyperparameters information ? Is this…

apache-spark pyspark pyspark-pandas

asked Jun 22 '23 at 19:46

thiran509

11
2

0

votes

2 answers

Create new columns with running count based on categorical column value counts in pyspark

Suppose a given dataframe: Model Color Car Red Car Red Car Blue Truck Red Truck Blue Truck Yellow SUV Blue SUV Blue Car Blue Car Yellow I want to add color columns that keep a count of each color across each model to…

python apache-spark pyspark apache-spark-sql pyspark-pandas

asked Jun 14 '23 at 22:34

jay-elliot

9
1

0

votes

1 answer

how can merge multiple part file into single file in databricks

i am trying to merge multiple part file into single file. In staging folder, it itterating the all files, schema is same. part file we are converting .Tab files. Files are generating based on salesorgcode ex:7001 ,600,8002 every country having…

pandas azure azure-databricks data-science-experience pyspark-pandas

asked Jun 14 '23 at 08:46

KIRAN KUMAR

7
2

0

votes

1 answer

Pyspark Error due to data type in pandas_udf

I'm trying to write a filter_words function in pandas_udf Here are the functions I am using: @udf_annotator(returnType=ArrayType(StructType([StructField("position", IntegerType(), True), …

pyspark apache-spark-sql pyspark-pandas pandas-udf

asked Jun 07 '23 at 10:53

Rory

471
2
11

0

votes

1 answer

TypeError in pySpark UDF functions

I've got this function: def ead(lista): ind_mmff, isdebala, isfubala, k1, k2, ead = lista try: isdebala = float(isdebala) isfubala = float(isfubala) k1 = float(k1) k2 = float(k2) ead = float(ead) …

apache-spark pyspark apache-spark-sql pyspark-pandas

asked Jun 05 '23 at 17:13

JMP

38
8

0

votes

1 answer

retrieve the non null values from a PySpark dataframe row and store this value in a new column

I have a PySpark dataframe which has column names which are unique_id's generated by UUID library. So I cannot query using column names. Each row in this pySpark dataframe has 1 "non null value". How do i create a new column which only has this 1…

python pyspark pyspark-pandas

asked May 25 '23 at 04:39

pscodes

11
1

0

votes

0 answers

Changing Python version of pyspark worker node

I need help changing the python version of a spark worker node to get rid of the following error message: RuntimeError: Python in worker has different version 3.10 than that in driver 3.9, PySpark cannot run with different minor versions. Please…

pyspark worker anaconda3 pyspark-pandas

asked May 20 '23 at 21:40

Austin Paxton

1

0

votes

1 answer

Pandas on Spark apply() seems to be reshaping columns

Can anybody explain the following behavior? import pyspark.pandas as ps loan_information = ps.read_sql_query([blah]) loan_information.shape #748834, 84 loan_information.apply(lambda col: col.shape) #Each column has 75 dimensions. The first 74 are…

pyspark aggregate pyspark-pandas

asked May 09 '23 at 15:40

Cody Dance

115
5

0

votes

1 answer

i want to sum date in a looping 13 times using pyspark

Please help me to solve this issue, as I am still new to Python/Pyspark. I want to do a loop to do a date sum in multiples of 7 for 13 times in the same column. I have a master table like this : id date 1 2019-02-21 10:00:00 2 2019-02-27…

python pandas pyspark pyspark-pandas

asked May 03 '23 at 20:50

rezha nanda

13
3

Questions tagged [pyspark-pandas]