Questions tagged [pyspark-pandas]
131 questions
0
votes
1 answer
Rewrite UDF to pandas UDF Pyspark
I have a dataframe:
import pyspark.sql.functions as F
sdf1 = spark.createDataFrame(
[
(2022, 1, ["apple", "edible"]),
(2022, 1, ["edible", "fruit"]),
(2022, 1, ["orange", "sweet"]),
(2022, 4, ["flowering ",…

Rory
- 471
- 2
- 11
0
votes
4 answers
find the top n unique values of a column based on ranking of another column within groups in pyspark
I have a dataframe like below:
df = pd.DataFrame({ 'region': [1,1,1,1,1,1,2,2,2,3],
'store': ['A', 'A', 'C', 'C', 'D', 'B', 'F', 'F', 'E', 'G'],
'call_date': ['2022-03-10', '2022-03-09', '2022-03-08', '2022-03-07',…

zesla
- 11,155
- 16
- 82
- 147
0
votes
0 answers
Cross validate in Pyspark
I have a need for cross_validate in pyspark. I have unerdstood the code with python but can anyone help me with the pyspark coding of this?
Python code:
from sklearn.model_selection import cross_validate
cv_set = cross_validate(gamma_cdf(),…

Deb
- 499
- 2
- 15
0
votes
1 answer
Read latest file grouped by monthYear in directory in pyspark
I have multiple files in a directory.
File name are similar to those added in picture 1.
I want to read only latest file for each month from the directory in pyspark as dataframe.
Expected files to be read as shown in picture 2

DigiLearner
- 77
- 1
- 9
0
votes
0 answers
How to write the data back to Big Query using Databricks?
I would like to upload my data frame to a Big query table using data bricks. I used the below code and got the following errors.
bucket = "databricks-ci"
table =…

sthambi
- 197
- 2
- 17
0
votes
1 answer
Convert event time into date and time in Pyspark?
I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy…

sthambi
- 197
- 2
- 17
0
votes
0 answers
arrow is not supported when using file-based collect, while conversion from pandas to spark and vice versa
I am trying to use arrow by
enabling spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true"), but getting following error
/databricks/spark/python/pyspark/sql/pandas/conversion.py:340: UserWarning: createDataFrame
attempted Arrow…

qaiser
- 2,770
- 2
- 17
- 29
-1
votes
1 answer
Is there any efficient way to store streaming data from different stock exchanges in Python besides Parquet files while using CCXT library?
What is the best way to store streaming data from different stock exchanges in order to minimise data weight?
Right now I'm using CCXT library on Python and in order to get current order book information and save in into Parquet type file using code…
-1
votes
1 answer
Execute query in parallel over a list of rows in pyspark
In databricks I have N delta tables of stores with their products with this schema:
store_1:
store
product
sku
1
prod 1
abc
1
prod 2
def
1
prod 3
ghi
store_2:
store
product
sku
2
prod 1
abc
2
prod 10
xyz
2
prod…

Andrés Bustamante
- 442
- 1
- 4
- 15
-1
votes
1 answer
Pyspark read all files and write it back it to same file after transformation
Hi I have files in a directory
Folder/1.csv
Folder/2.csv
Folder/3.csv
I want to read all these files in a pyspark dataframe/rdd and change some column value and write it back to same file.
I have tried it but it creating new file in the folder…

Priya p
- 1
- 2
-1
votes
2 answers
compare two dataframes and display the data that are different
i have two dataframes and i want to compare the values of two columns and display those who are different, for exemple: compare this Table…

sunny
- 11
- 5