Questions tagged [pyspark-pandas]

131 questions
0
votes
1 answer

Rewrite UDF to pandas UDF Pyspark

I have a dataframe: import pyspark.sql.functions as F sdf1 = spark.createDataFrame( [ (2022, 1, ["apple", "edible"]), (2022, 1, ["edible", "fruit"]), (2022, 1, ["orange", "sweet"]), (2022, 4, ["flowering ",…
Rory
  • 471
  • 2
  • 11
0
votes
4 answers

find the top n unique values of a column based on ranking of another column within groups in pyspark

I have a dataframe like below: df = pd.DataFrame({ 'region': [1,1,1,1,1,1,2,2,2,3], 'store': ['A', 'A', 'C', 'C', 'D', 'B', 'F', 'F', 'E', 'G'], 'call_date': ['2022-03-10', '2022-03-09', '2022-03-08', '2022-03-07',…
zesla
  • 11,155
  • 16
  • 82
  • 147
0
votes
0 answers

Cross validate in Pyspark

I have a need for cross_validate in pyspark. I have unerdstood the code with python but can anyone help me with the pyspark coding of this? Python code: from sklearn.model_selection import cross_validate cv_set = cross_validate(gamma_cdf(),…
Deb
  • 499
  • 2
  • 15
0
votes
1 answer

Read latest file grouped by monthYear in directory in pyspark

I have multiple files in a directory. File name are similar to those added in picture 1. I want to read only latest file for each month from the directory in pyspark as dataframe. Expected files to be read as shown in picture 2
DigiLearner
  • 77
  • 1
  • 9
0
votes
0 answers

How to write the data back to Big Query using Databricks?

I would like to upload my data frame to a Big query table using data bricks. I used the below code and got the following errors. bucket = "databricks-ci" table =…
sthambi
  • 197
  • 2
  • 17
0
votes
1 answer

Convert event time into date and time in Pyspark?

I have below event_time in my data frame I would like to convert the event_time into date/time. Used below code, however it's not coming properly import pyspark.sql.functions as f df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy…
sthambi
  • 197
  • 2
  • 17
0
votes
0 answers

arrow is not supported when using file-based collect, while conversion from pandas to spark and vice versa

I am trying to use arrow by enabling spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true"), but getting following error /databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame attempted Arrow…
qaiser
  • 2,770
  • 2
  • 17
  • 29
-1
votes
1 answer

Is there any efficient way to store streaming data from different stock exchanges in Python besides Parquet files while using CCXT library?

What is the best way to store streaming data from different stock exchanges in order to minimise data weight? Right now I'm using CCXT library on Python and in order to get current order book information and save in into Parquet type file using code…
-1
votes
1 answer

Execute query in parallel over a list of rows in pyspark

In databricks I have N delta tables of stores with their products with this schema: store_1: store product sku 1 prod 1 abc 1 prod 2 def 1 prod 3 ghi store_2: store product sku 2 prod 1 abc 2 prod 10 xyz 2 prod…
-1
votes
1 answer

Pyspark read all files and write it back it to same file after transformation

Hi I have files in a directory Folder/1.csv Folder/2.csv Folder/3.csv I want to read all these files in a pyspark dataframe/rdd and change some column value and write it back to same file. I have tried it but it creating new file in the folder…
-1
votes
2 answers

compare two dataframes and display the data that are different

i have two dataframes and i want to compare the values of two columns and display those who are different, for exemple: compare this Table…
1 2 3
8
9