Questions tagged [pyspark-pandas]

131 questions
0
votes
1 answer

Pandas-on-spark API throws a NotImplementedError even though functionality should be implemented

I am facing a weird issue with pyspark-on-pandas. I am trying to use regex to replace abbreviations with their full counterparts. The function I am using is the following (Simplified it a bit): def resolve_abbreviations(job_list: pspd.Series) ->…
Psychotechnopath
  • 2,471
  • 5
  • 26
  • 47
0
votes
0 answers

Pyspark code to implement a logic that creates dynamic columns and apply if condition

I currently have a code in which I'm using withColumn() to create the columns and use when statement to decide the values for that column, However I'm looking for an approach to implement it in a function call that creates the column name…
0
votes
1 answer

How to remove quotes from column in pyspark dataframe?

I have csv file in which I am getting double quotes in a column. While reading and writing I have to remove those quotes. Please guide me how can I do it? Example- df: col1 "xyznm""cxvb" I want below output- col1 xyznm""cxvb I have written below…
alka
  • 71
  • 7
0
votes
0 answers

PandasNotImplementedError in Databricks

I'm using pandas in Databricks, with import pyspark.pandas as ps After reading two tables as a dataframe, df and df_aux, I'm executing the following line: index_list = df.loc[~df['Column_A'].isin(df_aux)].index But it raises the following…
datadatadata
  • 119
  • 6
0
votes
0 answers

Pandas to pyspark

"file_timestamp = pd.to_datetime(CommonUtils.getLatestModifiedTime(folder_name, file_prefix,mbox_path)) print(file_timestamp) Want to convert this code to pyspark file_timestamp = (spark.read.format("csv")\ .option("header",…
0
votes
0 answers

TypeError: < can not be applied to decimal

I get this error when doing the following pyspark.pandas dataframe operation: billplan.loc[billplan.SEQ_NO < billplan.MAX_SEQ_NO, 'GA_TYPE'] = 'RE-CONTRACT' where billplan.SEQ_NO and billplan.MAX_SEQ_NO are both of type |-- SEQ_NO: decimal(38,0)…
Ee Ann Ng
  • 109
  • 1
  • 8
0
votes
1 answer

How to iterate over grouped PySpark Pandas dataframe

I have a grouped pyspark pandas dataframe ==> 'groups', and I'm trying to iterate over the groups the same way it's possible in pandas : import pyspark.pandas as ps dataframe = ps.read_excel("data.xlsx") groups = dataframe.groupby(['col1',…
elj96
  • 53
  • 3
0
votes
0 answers

Merge two dataframes such that one column array_contains other

I am currently working on a task that requires me to tokenize the conference titles on DF1 (column "name"), remove stopwords and count the number of occurences of each word. What I have done so far is extract and filter the words (column…
0
votes
1 answer

Pyspark: Compare Column Values across different dataframe

we are planning to do the following, compare two dataframe, based on comparision add values into first dataframe and then groupby to have combined data. We are using pyspark dataframe and the following are our dataframes. Dataframe1: | Manager |…
0
votes
2 answers

PySpark: Create a condition from a string

I have to apply conditions to pyspark dataframes based on a distribution. My distribution looks like: mp = [413, 291, 205, 169, 135] And I am generating condition expression like this: when_decile = (F.when((F.col(colm) >= float(mp[0])), F.lit(1)) …
karan
  • 79
  • 1
  • 7
0
votes
1 answer

How to create lag columns and union multiple dataframes in pyspark?

I am trying to create lag columns for several dataframes individually and then combine them into a single dataframe. As pyspark is lazy evaluated, it's calculating lag after combining dataframes. from pyspark.sql.functions import lag, col from…
Vinay
  • 1,149
  • 3
  • 16
  • 28
0
votes
0 answers

How PySpark allows columns with special characters?

The dataframe df_problematic in PySpark has the following columns: +------------+-----------+------------+ |sepal@length|sepal.width|petal_length| +------------+-----------+------------+ | 5.1| 3.5| 1.4| | 4.9| …
0
votes
0 answers

create dynamic columns in pyspark basis inputs and conditions

I have dataframe1 (see image) and dataframe2 (see image) and the inputs are dynamic for dataframe1. I have to create a final dataframe, considering the dynamic nature of dataframe-1 and seeing its mapping to the columns of dataframe-2. Output1:…
0
votes
1 answer

What is the best practice to handle non-datetime timestamp column within pandas dataframe?

Let's say I have the following pandas dataframe with a non-standard timestamp column without datetime format. Due to I need to include a new column and convert it into an 24hourly-based timestamp for time-series visualizing matter…
Mario
  • 1,631
  • 2
  • 21
  • 51
0
votes
1 answer

Generate subsample based on age using PySpark

I wanted to collect sample based on age with a condition on the Failure status. I am interested in 3 days old serial number. However, I don't need healthy serial number that is less than 3 days old, but I want to include all failed serial numbers…
ForestGump
  • 50
  • 2
  • 19
1 2 3
8 9