Questions tagged [pyspark-pandas]

131 questions
1
vote
1 answer

pyspark df.filter for dynamic columns based on user arguments

I am trying to perform isin() using df.filter but I need to filter columns dynamically based on user arguments passed. I am appending custom columnns to list and join later which converts it into string, since isin() is also getting converted into…
1
vote
1 answer

Pandas API on spark runs too slow according to pandas

I am making transform on my dataframe. while the process takes just 3 seconds with pandas, when I use pyspark and Pandas API on spark it takes approximately 30 minutes, yes 30 minutes! my data is 10k rows. The following is my pandas approach; def…
Ugur Selim Ozen
  • 159
  • 3
  • 10
1
vote
2 answers

Pyspark - Transpose distinct row values to a column header where ill insert values from the same row I transposed but different column

This is the table I want to transpose I created a list of the distinct values at DESC_INFO using this: columnsToPivot = list(dict.fromkeys(df.filter(F.col("DESC_INFO") != '').rdd.map(lambda x: (x.DESC_INFO, x.RAW_INFO)).collect())) And then I tried…
1
vote
0 answers

How to flatten complex nested json file using pyspark

I have a nested complex json file which has struct type, array type, list, dict nested within each other. i have a function which flattens the columns with struct type, but when it encounters any other type it fails. is their any recursive function…
1
vote
1 answer

PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

Since we upgraded to pyspark 3.3.0 for our job we have issues with cached ps.Dataframe that are then concat using pyspark pandas : ps.concat([df1,df2]) This issue is that the concatenated data frame is not using the cached data but is re-reading the…
frco9
  • 11
  • 2
1
vote
1 answer

How to replace any null in pyspark df with value from the below row, same column

Let's say I have a pyspark DF: | Column A | Column B | | -------- | -------- | | val1 | val1B | | null | val2B | | val2 | null | | val3 | val3B | Can someone help me with replacing any null value in any column (for the…
newmon
  • 25
  • 3
1
vote
0 answers

How to filter out groups of groupby of which not all rows are present in another DataFrame

The purpose of this code is to filter out the groups of a pyspark DataFrame that contains rows that do not have a mathing row in the hub DataFrame. The groups are determined based on the primary key. To check if all rows a the group have matching…
jcnouwens
  • 11
  • 3
1
vote
1 answer

How to replace text in column by the value contained in the columns named in this text

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula). So to be clear, here is an example : Input: |param_1|param_2|calc…
1
vote
1 answer

How to groupby on a index column in pyspark pandas?

I have a pandas API for spark (Koala) dataframe as below, c1 c2 id name a1 a 1 1 a1 b 2 2 b1 c 3 3 How can I do groupby like below, df.groupby(level=1).sum()
Selva
  • 951
  • 7
  • 23
1
vote
1 answer

AttachDistributedSequence is not supported in Unity Catalog

I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error: AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity…
1
vote
1 answer

ArrowInvalid: Could not convert ... with type DataFrame: did not recognize Python value type when inferring an Arrow data type

Using IForest library implementing a function for detection outliers using the following code: import pyspark.pandas as pd import numpy as np from alibi_detect.od import IForest # **************** Modelo IForest…
1
vote
0 answers

How to use Parallel processing using Concurrent Jobs in Databricks?

A. Background: I have to text manipulation using python (like concatenation , convert to spacy doc , get verbs from the spacy doc etc) for 1 million records Each record takes 1 sec, meaning it will take 10 days for 1 million records ! There's no…
newbie101
  • 65
  • 7
1
vote
0 answers

pyspark calculate custom metric on grouped data

I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe has group_key and I want to derive a single value…
1
vote
1 answer

How do I distribute application of a function that returns a scalar over a grouped dataframe using pandas API on Spark with Azure Databricks?

[I managed to answer my own question in a narrow sense, but hopefully someone who knows more than I do can explain why my solution works and give a more general answer.] I am new to Databricks, Spark, and modern distributed computing. (I do…
jtolle
  • 7,023
  • 2
  • 28
  • 50
1
vote
0 answers

pyspark.pandas - type Series doesn't define __round__ method

I have a Spark DF that I've converted to a pySpark Pandas DF using df.to_pandas_on_spark(). I have some logic that rounds a column i.e: df[Column_Name] = round(df[Income] - df[Fees],2) but get the following error TypeError: type Series doesn't…
mikelowry
  • 1,307
  • 4
  • 21
  • 43
1
2
3
8 9