Highest Voted 'pyspark-pandas' Questions

0

votes

1 answer

Pandas-on-spark API throws a NotImplementedError even though functionality should be implemented

I am facing a weird issue with pyspark-on-pandas. I am trying to use regex to replace abbreviations with their full counterparts. The function I am using is the following (Simplified it a bit): def resolve_abbreviations(job_list: pspd.Series) ->…

asked May 02 '23 at 12:16

Psychotechnopath

2,471
5
26
47

0

votes

0 answers

Pyspark code to implement a logic that creates dynamic columns and apply if condition

I currently have a code in which I'm using withColumn() to create the columns and use when statement to decide the values for that column, However I'm looking for an approach to implement it in a function call that creates the column name…

pyspark apache-spark-sql pyspark-pandas

asked May 02 '23 at 08:53

TalendDeveloper

35
8

0

votes

1 answer

How to remove quotes from column in pyspark dataframe?

I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it? Example- df: col1 "xyznm""cxvb" I want below output- col1 xyznm""cxvb I have written below…

dataframe apache-spark pyspark pyspark-pandas

asked May 02 '23 at 07:05

alka

71
7

0

votes

0 answers

PandasNotImplementedError in Databricks

I'm using pandas in Databricks, with import pyspark.pandas as ps After reading two tables as a dataframe, df and df_aux, I'm executing the following line: index_list = df.loc[~df['Column_A'].isin(df_aux)].index But it raises the following…

python pandas dataframe pyspark-pandas

asked Apr 27 '23 at 11:02

datadatadata

119
6

0

votes

0 answers

Pandas to pyspark

"file_timestamp = pd.to_datetime(CommonUtils.getLatestModifiedTime(folder_name, file_prefix,mbox_path)) print(file_timestamp) Want to convert this code to pyspark file_timestamp = (spark.read.format("csv")\ .option("header",…

pandas pyspark pyspark-pandas

asked Apr 06 '23 at 17:30

Dhiraj Sandse

1
1

0

votes

0 answers

TypeError: < can not be applied to decimal

I get this error when doing the following pyspark.pandas dataframe operation: billplan.loc[billplan.SEQ_NO < billplan.MAX_SEQ_NO, 'GA_TYPE'] = 'RE-CONTRACT' where billplan.SEQ_NO and billplan.MAX_SEQ_NO are both of type |-- SEQ_NO: decimal(38,0)…

pyspark-pandas

asked Apr 05 '23 at 09:37

Ee Ann Ng

109
1
8

0

votes

1 answer

How to iterate over grouped PySpark Pandas dataframe

I have a grouped pyspark pandas dataframe ==> 'groups', and I'm trying to iterate over the groups the same way it's possible in pandas : import pyspark.pandas as ps dataframe = ps.read_excel("data.xlsx") groups = dataframe.groupby(['col1',…

python pandas dataframe pyspark pyspark-pandas

asked Apr 03 '23 at 15:04

elj96

53
3

0

votes

0 answers

Merge two dataframes such that one column array_contains other

I am currently working on a task that requires me to tokenize the conference titles on DF1 (column "name"), remove stopwords and count the number of occurences of each word. What I have done so far is extract and filter the words (column…

python apache-spark pyspark pyspark-pandas

asked Mar 14 '23 at 17:05

Anđela Todorović

71
4

0

votes

1 answer

Pyspark: Compare Column Values across different dataframe

we are planning to do the following, compare two dataframe, based on comparision add values into first dataframe and then groupby to have combined data. We are using pyspark dataframe and the following are our dataframes. Dataframe1: | Manager |…

python apache-spark pyspark pyspark-pandas pyspark-schema

asked Mar 14 '23 at 11:04

frp farhan

445
5
19

0

votes

2 answers

PySpark: Create a condition from a string

I have to apply conditions to pyspark dataframes based on a distribution. My distribution looks like: mp = [413, 291, 205, 169, 135] And I am generating condition expression like this: when_decile = (F.when((F.col(colm) >= float(mp[0])), F.lit(1)) …

python pyspark pyspark-pandas

asked Mar 03 '23 at 06:43

karan

79
1
7

0

votes

1 answer

How to create lag columns and union multiple dataframes in pyspark?

I am trying to create lag columns for several dataframes individually and then combine them into a single dataframe. As pyspark is lazy evaluated, it's calculating lag after combining dataframes. from pyspark.sql.functions import lag, col from…

pyspark pyspark-pandas

asked Feb 27 '23 at 21:29

Vinay

1,149
3
16
28

0

votes

0 answers

How PySpark allows columns with special characters?

The dataframe df_problematic in PySpark has the following columns: +------------+-----------+------------+ |sepal@length|sepal.width|petal_length| +------------+-----------+------------+ | 5.1| 3.5| 1.4| | 4.9| …

apache-spark pyspark apache-spark-sql pyspark-pandas pyspark-schema

asked Feb 27 '23 at 10:06

Uylenburgh

1,277
4
20
46

0

votes

0 answers

create dynamic columns in pyspark basis inputs and conditions

I have dataframe1 (see image) and dataframe2 (see image) and the inputs are dynamic for dataframe1. I have to create a final dataframe, considering the dynamic nature of dataframe-1 and seeing its mapping to the columns of dataframe-2. Output1:…

apache-spark pyspark apache-spark-sql pyspark-pandas

asked Feb 12 '23 at 16:32

Akanksha Verma

11
1

0

votes

1 answer

What is the best practice to handle non-datetime timestamp column within pandas dataframe?

Let's say I have the following pandas dataframe with a non-standard timestamp column without datetime format. Due to I need to include a new column and convert it into an 24hourly-based timestamp for time-series visualizing matter…

pyspark apache-spark-sql time-series missing-data pyspark-pandas

asked Feb 02 '23 at 18:00

Mario

1,631
2
21
51

0

votes

1 answer

Generate subsample based on age using PySpark

I wanted to collect sample based on age with a condition on the Failure status. I am interested in 3 days old serial number. However, I don't need healthy serial number that is less than 3 days old, but I want to include all failed serial numbers…

python pyspark apache-spark-sql pyspark-pandas

asked Jan 29 '23 at 15:22

ForestGump

50
2
19

Questions tagged [pyspark-pandas]