Questions tagged [pyspark-pandas]

131 questions
0
votes
0 answers

pyspark.pandas.read_sql_query from postgresql on AWS Glue

jdbcUrl = "jdbc:postgresql://"+domain+":"+str(port)+"/"+database+"?user="+user+"&password="+password+"" import pyspark.pandas as pd pd.read_sql_query(select * from table, jdbcurl) Giving error on AWS Glue Any suggestion?? Official Documentation…
0
votes
1 answer

How to create this function in PySpark?

I have a large data frame, consisting of 400+ columns and 14000+ records, that I need to clean. I have defined a python code to do this, but due to the size of my dataset, I need to use PySpark to clean it. However, I am very unfamiliar with PySpark…
0
votes
1 answer

pandas_udf with pd.Series and other object as arguments

I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe. However, the most straight forward solution doesn't seem to be supported by the Pandas on…
0
votes
0 answers

Transforming PipelinedRDD to spark dataframe

I am trying to convert a PipelinedRDD into a Spark dataframe, but I am getting the following error: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). Creating Spark session and taking a…
beeeZeee
  • 51
  • 1
  • 2
  • 10
0
votes
1 answer

Dropping rows in PySpark based on indexes

I'm working with a PySpark Pandas DataFrame that looks similar to this: | col1 | col2 | col3 | |------|----------------------|------| | 1 |'C:\windows\a\folder1'| 3 | | 2 |'C:\windows\a\folder2'| 4 | | 3 …
0
votes
0 answers

Read csv or json files that are written in Pyspark from Python?

In Pyspark, I use either df.write.json or ```ndf.write.csv to write a dataframe into csv/json files. Below is an example/ df.write.json("s3://bucket/folder/") df.write.csv("s3://bucket/folder/") However, when I try to read these written files from…
armin
  • 591
  • 3
  • 10
0
votes
1 answer

How do I reduce the run-time for Big Data PySpark scripts?

I am currently working on a project in Databricks with approximately 6 GiB's of data in a single table, so you can imagine that the run-time on a table such as this is quite expensive. I would call myself an experienced coder, but with big data I am…
0
votes
0 answers

Using PySpark Pandas API, how can I store multiple splits of a dataframe into a list without using a loop?

Using PySpark Pandas API, I would like to split the rows of a dataframe based on the column SEGMENT (then do some operations on each of these split dataframes later on). d = {'col1': [1, 2, 3], 'SEGMENT': ['a', 'b', 'c']} df = ps.DataFrame(data =…
0
votes
0 answers

Pyspark apply multiple groupBy UDF's

I am trying to call 2 UDF's within the same groupBy function. I have one UDF that takes a group and returns a Pandas dataframe with one row and multiple columns. I have another that takes just one feature and returns a single value. Is there a way…
0
votes
2 answers

Delete rows on the basis of another data frame if the data matched and insert new data

I have two files one is file1.csv and another one is file2.csv I have put file1 data in one dataframe and when second file file2.csv will arrive then I have to write a code in such a way that if second file data matches in first file data on basis…
codetech
  • 113
  • 8
0
votes
1 answer

Error writing parquet to specific container in Azure Data Lake

I'm retrieving two files from container1, transforming them and merging before writing to a container2 within the same Storage Account in Azure. I'm mounting container1, unmouting and mounting countainer2 before writing. My code for writing the…
0
votes
0 answers

PySpark PandasUDF with 2 different argument data types

I have a dataframe A with a column containing a string. I want to compare this string with another dataframe B, which only has one column that contains a list of tuples of strings. What I did so far: I transformed B into a list, through which I…
Moritz
  • 495
  • 1
  • 7
  • 17
0
votes
0 answers

How to change timedelta format in pyspark?

I have a dataframe like below values, I'm able to achieve the expected output in pandas but not in pyspark. SAMPLE INPUT DF number time 12344 5 days, 04 hours, 52 minutes, 10 14566 8 days, 16 hours, 10 minutes, 09 13477 0 days, 21 hours,…
Anos
  • 57
  • 8
0
votes
2 answers

Pandas API on Spark - Difference between two date columns

I want the difference between two date columns in the number of days. In pandas dataframe difference in two "datetime64" type columns returns number of days but in pyspark.pandas dataframe the difference is returned in the "int" type. import…
Rudra
  • 138
  • 6
0
votes
1 answer

PySpark Create a new lag() column from an existing column and fillna with existing column value

I am looking to convert my Pandas code to PySpark and create a new column with the existing one by grouping the data on 'session' and shifting data to get the next row value for 'next_timestamp'. But for the last row in every group, I am getting…
1 2 3
8 9