Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
71
votes
8 answers

Pyspark dataframe operator "IS NOT IN"

I would like to rewrite this from R to Pyspark, any nice looking suggestions? array <- c(1,2,3) dataset <- filter(!(column %in% array))
Babu
  • 4,324
  • 6
  • 41
  • 60
71
votes
2 answers

Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this? In my current use case, I have a list of addresses that I want to normalize. For example this dataframe: id …
Luke
  • 6,699
  • 13
  • 50
  • 88
70
votes
4 answers

How to split Vector into columns - using PySpark

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT. An Example: word | vector assert | [435,323,324,212...] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert |…
sedioben
  • 935
  • 1
  • 10
  • 16
69
votes
1 answer

Spark load data and add filename as dataframe column

I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode",…
yee379
  • 6,498
  • 10
  • 56
  • 101
69
votes
6 answers

Retrieve top n in each group of a DataFrame in pyspark

There's a DataFrame in pyspark with data as below: user_id object_id score user_1 object_1 3 user_1 object_1 1 user_1 object_2 2 user_2 object_1 5 user_2 object_2 2 user_2 object_2 6 What I expect is returning 2 records in each group…
KAs
  • 1,818
  • 4
  • 19
  • 37
69
votes
5 answers

PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived…
sjishan
  • 3,392
  • 9
  • 29
  • 53
66
votes
8 answers

Pyspark: Pass multiple columns in UDF

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary. I know I can hard code…
sjishan
  • 3,392
  • 9
  • 29
  • 53
66
votes
3 answers

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this…
user2205916
  • 3,196
  • 11
  • 54
  • 82
66
votes
9 answers

spark dataframe drop duplicates and keep first

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes? Pandas: df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime',…
ad_s
  • 1,560
  • 4
  • 15
  • 16
66
votes
3 answers

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows: [Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0),…
curtisp
  • 2,227
  • 3
  • 30
  • 62
66
votes
3 answers

How to convert a DataFrame back to normal RDD in pyspark?

I need to use the (rdd.)partitionBy(npartitions, custom_partitioner) method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data? Note: this is…
WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560
65
votes
5 answers

How to count unique ID after groupBy in pyspark

I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year. from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped =…
Lizou
  • 863
  • 1
  • 11
  • 16
65
votes
8 answers

get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and…
Sreenuvasulu
  • 653
  • 1
  • 5
  • 9
65
votes
6 answers

How to melt Spark DataFrame?

Is there an equivalent of Pandas Melt function in Apache Spark in PySpark or at least in Scala? I was running a sample dataset till now in Python and now I want to use Spark for the entire dataset.
64
votes
2 answers

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27…
Markus
  • 3,562
  • 12
  • 48
  • 85