Highest Voted 'pyspark' Questions

71

votes

8 answers

Pyspark dataframe operator "IS NOT IN"

I would like to rewrite this from R to Pyspark, any nice looking suggestions? array <- c(1,2,3) dataset <- filter(!(column %in% array))

pyspark

asked Oct 27 '16 at 14:26

Babu

4,324
6
41
60

71

votes

2 answers

Pyspark replace strings in Spark dataframe column

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this? In my current use case, I have a list of addresses that I want to normalize. For example this dataframe: id …

python apache-spark pyspark

asked May 04 '16 at 21:06

Luke

6,699
13
50
88

70

votes

4 answers

How to split Vector into columns - using PySpark

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT. An Example: word | vector assert | [435,323,324,212...] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert |…

python apache-spark pyspark apache-spark-sql apache-spark-ml

asked Jul 14 '16 at 21:12

sedioben

935
1
10
16

69

votes

1 answer

Spark load data and add filename as dataframe column

I am loading some data into Spark with a wrapper function: def load_data( filename ): df = sqlContext.read.format("com.databricks.spark.csv")\ .option("delimiter", "\t")\ .option("header", "false")\ .option("mode",…

apache-spark pyspark apache-spark-sql

asked Oct 05 '16 at 07:50

yee379

6,498
10
56
101

69

votes

6 answers

Retrieve top n in each group of a DataFrame in pyspark

There's a DataFrame in pyspark with data as below: user_id object_id score user_1 object_1 3 user_1 object_1 1 user_1 object_2 2 user_2 object_1 5 user_2 object_2 2 user_2 object_2 6 What I expect is returning 2 records in each group…

python apache-spark dataframe pyspark apache-spark-sql

asked Jul 15 '16 at 13:49

KAs

1,818
4
19
37

69

votes

5 answers

PySpark: multiple conditions in when clause

I would like to modify the cell values of a dataframe column (Age) where currently it is blank and I would only do it if another column (Survived) has the value 0 for the corresponding row where it is blank for Age. If it is 1 in the Survived…

python apache-spark dataframe pyspark apache-spark-sql

asked Jun 08 '16 at 15:51

sjishan

3,392
9
29
53

66

votes

8 answers

Pyspark: Pass multiple columns in UDF

I am writing a User Defined Function which will take all the columns except the first one in a dataframe and do sum (or any other operation). Now the dataframe can sometimes have 3 columns or 4 columns or more. It will vary. I know I can hard code…

apache-spark pyspark apache-spark-sql

asked Mar 01 '17 at 19:17

sjishan

3,392
9
29
53

66

votes

3 answers

PySpark: withColumn() with two conditions and three outcomes

I am working with Spark and PySpark. I am trying to achieve the result equivalent to the following pseudocode: df = df.withColumn('new_column', IF fruit1 == fruit2 THEN 1, ELSE 0. IF fruit1 IS NULL OR fruit2 IS NULL 3.) I am trying to do this…

apache-spark hive pyspark apache-spark-sql hiveql

asked Oct 20 '16 at 18:27

user2205916

3,196
11
54
82

66

votes

9 answers

spark dataframe drop duplicates and keep first

Question: in pandas when dropping duplicates you can specify which columns to keep. Is there an equivalent in Spark Dataframes? Pandas: df.sort_values('actual_datetime', ascending=False).drop_duplicates(subset=['scheduled_datetime',…

dataframe apache-spark pyspark apache-spark-sql duplicates

asked Jul 31 '16 at 18:35

ad_s

1,560
4
15
16

66

votes

3 answers

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

I have Spark DataFrame with take(5) top rows as follows: [Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=1, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0), hour=2, value=638.55), Row(date=datetime.datetime(1984, 1, 1, 0, 0),…

python timestamp apache-spark pyspark

asked Jun 20 '15 at 00:51

curtisp

2,227
3
30
62

66

votes

3 answers

How to convert a DataFrame back to normal RDD in pyspark?

I need to use the (rdd.)partitionBy(npartitions, custom_partitioner) method that is not available on the DataFrame. All of the DataFrame methods refer only to DataFrame results. So then how to create an RDD from the DataFrame data? Note: this is…

python apache-spark pyspark

asked Mar 12 '15 at 01:36

WestCoastProjects

58,982
91
316
560

65

votes

5 answers

How to count unique ID after groupBy in pyspark

I'm using the following code to agregate students per year. The purpose is to know the total number of student for each year. from pyspark.sql.functions import col import pyspark.sql.functions as fn gr = Df2.groupby(['Year']) df_grouped =…

python pyspark apache-spark-sql

asked Sep 26 '17 at 08:43

Lizou

863
1
11
16

65

votes

8 answers

get datatype of column using pyspark

We are reading data from MongoDB Collection. Collection column has two different values (e.g.: (bson.Int64,int) (int,float) ). I am trying to get a datatype using pyspark. My problem is some columns have different datatype. Assume quantity and…

apache-spark pyspark apache-spark-sql

asked Jul 11 '17 at 11:29

Sreenuvasulu

653
1
5
9

65

votes

6 answers

How to melt Spark DataFrame?

Is there an equivalent of Pandas Melt function in Apache Spark in PySpark or at least in Scala? I was running a sample dataset till now in Python and now I want to use Spark for the entire dataset.

apache-spark pyspark apache-spark-sql melt

asked Jan 16 '17 at 05:42

Venkatesh Durgumahanthi

885
2
8
12

64

votes

2 answers

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I get the following error when I add --conf spark.driver.maxResultSize=2050 to my spark-submit command. 17/12/27 18:33:19 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /XXX.XX.XXX.XX:36245 is closed 17/12/27…

python apache-spark pyspark apache-spark-sql

asked Dec 27 '17 at 17:38

Markus

3,562
12
48
85

Questions tagged [pyspark]

Useful Links:

Related Tags:

Pyspark dataframe operator "IS NOT IN"

Pyspark replace strings in Spark dataframe column

How to split Vector into columns - using PySpark

Spark load data and add filename as dataframe column

Retrieve top n in each group of a DataFrame in pyspark

PySpark: multiple conditions in when clause

Pyspark: Pass multiple columns in UDF

PySpark: withColumn() with two conditions and three outcomes

spark dataframe drop duplicates and keep first

Spark DataFrame TimestampType - how to get Year, Month, Day values from field?

How to convert a DataFrame back to normal RDD in pyspark?

How to count unique ID after groupBy in pyspark

get datatype of column using pyspark

How to melt Spark DataFrame?

Total size of serialized results of 16 tasks (1048.5 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)