Highest Voted 'pyspark-pandas' Questions

1

vote

1 answer

pyspark df.filter for dynamic columns based on user arguments

I am trying to perform isin() using df.filter but I need to filter columns dynamically based on user arguments passed. I am appending custom columnns to list and join later which converts it into string, since isin() is also getting converted into…

asked Mar 13 '23 at 20:30

avinash reddy

25
4

1

vote

1 answer

Pandas API on spark runs too slow according to pandas

I am making transform on my dataframe. while the process takes just 3 seconds with pandas, when I use pyspark and Pandas API on spark it takes approximately 30 minutes, yes 30 minutes! my data is 10k rows. The following is my pandas approach; def…

python pandas pyspark pyspark-pandas

asked Mar 12 '23 at 22:55

Ugur Selim Ozen

159
3
10

1

vote

2 answers

Pyspark - Transpose distinct row values to a column header where ill insert values from the same row I transposed but different column

This is the table I want to transpose I created a list of the distinct values at DESC_INFO using this: columnsToPivot = list(dict.fromkeys(df.filter(F.col("DESC_INFO") != '').rdd.map(lambda x: (x.DESC_INFO, x.RAW_INFO)).collect())) And then I tried…

python dataframe dynamic-pivot pyspark-pandas

asked Feb 22 '23 at 17:33

José Bastos

11
2

1

vote

0 answers

How to flatten complex nested json file using pyspark

I have a nested complex json file which has struct type, array type, list, dict nested within each other. i have a function which flattens the columns with struct type, but when it encounters any other type it fails. is their any recursive function…

python python-3.x apache-spark pyspark pyspark-pandas

asked Feb 19 '23 at 07:22

harshith

71
9

1

vote

1 answer

PySpark 3.3.0 is not using cached DataFrame when performing a concat with Pandas API

Since we upgraded to pyspark 3.3.0 for our job we have issues with cached ps.Dataframe that are then concat using pyspark pandas : ps.concat([df1,df2]) This issue is that the concatenated data frame is not using the cached data but is re-reading the…

apache-spark caching pyspark pyspark-pandas

asked Feb 09 '23 at 16:59

frco9

11
2

1

vote

1 answer

How to replace any null in pyspark df with value from the below row, same column

Let's say I have a pyspark DF: | Column A | Column B | | -------- | -------- | | val1 | val1B | | null | val2B | | val2 | null | | val3 | val3B | Can someone help me with replacing any null value in any column (for the…

python pyspark apache-spark-sql pyspark-pandas

asked Jan 06 '23 at 19:46

newmon

25
3

1

vote

0 answers

How to filter out groups of groupby of which not all rows are present in another DataFrame

The purpose of this code is to filter out the groups of a pyspark DataFrame that contains rows that do not have a mathing row in the hub DataFrame. The groups are determined based on the primary key. To check if all rows a the group have matching…

lambda group-by merge typeerror pyspark-pandas

asked Jan 06 '23 at 14:16

jcnouwens

11
3

1

vote

1 answer

How to replace text in column by the value contained in the columns named in this text

In pyspark, I'm trying to replace multiple text values in a column by the value that are present in the columns which names are present in the calc column (formula). So to be clear, here is an example : Input: |param_1|param_2|calc…

apache-spark pyspark apache-spark-sql pyspark-pandas

asked Jan 05 '23 at 20:06

Cazau

13
3

1

vote

1 answer

How to groupby on a index column in pyspark pandas?

I have a pandas API for spark (Koala) dataframe as below, c1 c2 id name a1 a 1 1 a1 b 2 2 b1 c 3 3 How can I do groupby like below, df.groupby(level=1).sum()

python spark-koalas pyspark-pandas

asked Oct 28 '22 at 17:50

Selva

951
7
23

1

vote

1 answer

AttachDistributedSequence is not supported in Unity Catalog

I'm trying to read a table on Databricks to a DataFrame using the pyspark.pandas.read_table and receive the following error: AnalysisException: [UC_COMMAND_NOT_SUPPORTED] AttachDistributedSequence is not supported in Unity…

python pyspark databricks pyspark-pandas databricks-unity-catalog

asked Oct 24 '22 at 08:03

Toivo Mattila

377
1
9

1

vote

1 answer

ArrowInvalid: Could not convert ... with type DataFrame: did not recognize Python value type when inferring an Arrow data type

Using IForest library implementing a function for detection outliers using the following code: import pyspark.pandas as pd import numpy as np from alibi_detect.od import IForest # **************** Modelo IForest…

pandas pyspark group-by apache-spark-sql pyspark-pandas

asked Oct 19 '22 at 03:59

Daniel Vera

77
1
10

1

vote

0 answers

How to use Parallel processing using Concurrent Jobs in Databricks?

A. Background: I have to text manipulation using python (like concatenation , convert to spacy doc , get verbs from the spacy doc etc) for 1 million records Each record takes 1 sec, meaning it will take 10 days for 1 million records ! There's no…

python apache-spark databricks pyspark-pandas

asked Sep 08 '22 at 07:41

newbie101

65
7

1

vote

0 answers

pyspark calculate custom metric on grouped data

I have a large dataframe (40 billion rows+) which can be grouped by key, I want to apply a custom calculation on few fields of each group and derive a single value for that group. eg, below dataframe has group_key and I want to derive a single value…

apache-spark pyspark large-data pyspark-pandas pandas-udf

asked Sep 08 '22 at 03:35

user14297339

33
5

1

vote

1 answer

How do I distribute application of a function that returns a scalar over a grouped dataframe using pandas API on Spark with Azure Databricks?

[I managed to answer my own question in a narrow sense, but hopefully someone who knows more than I do can explain why my solution works and give a more general answer.] I am new to Databricks, Spark, and modern distributed computing. (I do…

pandas pyspark parallel-processing databricks pyspark-pandas

asked Aug 04 '22 at 00:49

jtolle

7,023
2
28
50

1

vote

0 answers

pyspark.pandas - type Series doesn't define round method

I have a Spark DF that I've converted to a pySpark Pandas DF using df.to_pandas_on_spark(). I have some logic that rounds a column i.e: df[Column_Name] = round(df[Income] - df[Fees],2) but get the following error TypeError: type Series doesn't…

apache-spark pyspark pyspark-pandas

asked Jul 13 '22 at 18:18

mikelowry

1,307
4
21
43

Questions tagged [pyspark-pandas]