Questions tagged [pyspark-pandas]
131 questions
0
votes
0 answers
pyspark.pandas.read_sql_query from postgresql on AWS Glue
jdbcUrl = "jdbc:postgresql://"+domain+":"+str(port)+"/"+database+"?user="+user+"&password="+password+""
import pyspark.pandas as pd
pd.read_sql_query(select * from table, jdbcurl)
Giving error on AWS Glue
Any suggestion??
Official Documentation…

Pawan Rai
- 11
- 1
0
votes
1 answer
How to create this function in PySpark?
I have a large data frame, consisting of 400+ columns and 14000+ records, that I need to clean.
I have defined a python code to do this, but due to the size of my dataset, I need to use PySpark to clean it. However, I am very unfamiliar with PySpark…

user21035178
- 3
- 1
0
votes
1 answer
pandas_udf with pd.Series and other object as arguments
I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.
However, the most straight forward solution doesn't seem to be supported by the Pandas on…

Wim Schmitz
- 15
- 6
0
votes
0 answers
Transforming PipelinedRDD to spark dataframe
I am trying to convert a PipelinedRDD into a Spark dataframe, but I am getting the following error:
ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Creating Spark session and taking a…

beeeZeee
- 51
- 1
- 2
- 10
0
votes
1 answer
Dropping rows in PySpark based on indexes
I'm working with a PySpark Pandas DataFrame that looks similar to this:
| col1 | col2 | col3 |
|------|----------------------|------|
| 1 |'C:\windows\a\folder1'| 3 |
| 2 |'C:\windows\a\folder2'| 4 |
| 3 …

Joakim Torsvik
- 75
- 7
0
votes
0 answers
Read csv or json files that are written in Pyspark from Python?
In Pyspark, I use either df.write.json or ```ndf.write.csv to write a dataframe into csv/json files. Below is an example/
df.write.json("s3://bucket/folder/")
df.write.csv("s3://bucket/folder/")
However, when I try to read these written files from…

armin
- 591
- 3
- 10
0
votes
1 answer
How do I reduce the run-time for Big Data PySpark scripts?
I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a table such as this is quite expensive.
I would call myself an experienced coder, but with big data I am…

Joakim Torsvik
- 75
- 7
0
votes
0 answers
Using PySpark Pandas API, how can I store multiple splits of a dataframe into a list without using a loop?
Using PySpark Pandas API, I would like to split the rows of a dataframe based on the column SEGMENT (then do some operations on each of these split dataframes later on).
d = {'col1': [1, 2, 3], 'SEGMENT': ['a', 'b', 'c']}
df = ps.DataFrame(data =…

user10443249
- 87
- 5
0
votes
0 answers
Pyspark apply multiple groupBy UDF's
I am trying to call 2 UDF's within the same groupBy function.
I have one UDF that takes a group and returns a Pandas dataframe with one row and multiple columns.
I have another that takes just one feature and returns a single value.
Is there a way…

Mike Anthony
- 5
- 2
0
votes
2 answers
Delete rows on the basis of another data frame if the data matched and insert new data
I have two files one is file1.csv and another one is file2.csv
I have put file1 data in one dataframe and when second file file2.csv will arrive then
I have to write a code in such a way that if second file data matches in first file data on basis…

codetech
- 113
- 8
0
votes
1 answer
Error writing parquet to specific container in Azure Data Lake
I'm retrieving two files from container1, transforming them and merging before writing to a container2 within the same Storage Account in Azure. I'm mounting container1, unmouting and mounting countainer2 before writing.
My code for writing the…

Fred
- 1
- 1
0
votes
0 answers
PySpark PandasUDF with 2 different argument data types
I have a dataframe A with a column containing a string. I want to compare this string with another dataframe B, which only has one column that contains a list of tuples of strings. What I did so far: I transformed B into a list, through which I…

Moritz
- 495
- 1
- 7
- 17
0
votes
0 answers
How to change timedelta format in pyspark?
I have a dataframe like below values, I'm able to achieve the expected output in pandas but not in pyspark.
SAMPLE INPUT DF
number time
12344 5 days, 04 hours, 52 minutes, 10
14566 8 days, 16 hours, 10 minutes, 09
13477 0 days, 21 hours,…

Anos
- 57
- 8
0
votes
2 answers
Pandas API on Spark - Difference between two date columns
I want the difference between two date columns in the number of days.
In pandas dataframe difference in two "datetime64" type columns returns number of days
but in pyspark.pandas dataframe the difference is returned in the "int" type.
import…

Rudra
- 138
- 6
0
votes
1 answer
PySpark Create a new lag() column from an existing column and fillna with existing column value
I am looking to convert my Pandas code to PySpark and create a new column with the existing one by grouping the data on 'session' and shifting data to get the next row value for 'next_timestamp'.
But for the last row in every group, I am getting…

user20480394
- 3
- 2