Highest Voted 'pyspark' Questions

83

votes

4 answers

Spark functions vs UDF performance?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times…

performance apache-spark pyspark apache-spark-sql user-defined-functions

asked Jul 10 '16 at 21:26

alfredox

4,082
6
21
29

82

votes

3 answers

How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type. How I can change them to int type. I replaced the nan values with 0…

python dataframe apache-spark pyspark apache-spark-sql

asked Oct 26 '17 at 13:43

neha

1,858
5
21
35

81

votes

2 answers

pyspark collect_set or collect_list with groupby

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'

list group-by set pyspark collect

asked Jun 02 '16 at 00:17

Hanan Shteingart

8,480
10
53
66

81

votes

7 answers

How to loop through each row of dataFrame in pyspark

E.g sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show() The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further…

apache-spark dataframe for-loop pyspark apache-spark-sql

asked Apr 01 '16 at 06:15

Arti Berde

1,182
1
11
23

81

votes

3 answers

How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark possible issues with JDBC sources and know solutions With small changes these methods should work with other supported languages…

python scala apache-spark apache-spark-sql pyspark

asked Jun 22 '15 at 15:30

zero323

322,348
103
959
935

80

votes

8 answers

How to get name of dataframe column in PySpark?

In pandas, this can be done by column.name. But how to do the same when it's a column of Spark dataframe? E.g. the calling program has a Spark dataframe: spark_df >>> spark_df.columns ['admit', 'gre', 'gpa', 'rank'] This program calls my function:…

dataframe apache-spark pyspark apache-spark-sql

asked Sep 28 '16 at 11:55

Kaushik Acharya

1,520
2
16
25

79

votes

8 answers

Median / quantiles within PySpark groupBy

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg, so that I can mix it with other PySpark…

apache-spark pyspark group-by apache-spark-sql median

asked Oct 20 '17 at 08:58

abeboparebop

7,396
6
37
46

79

votes

4 answers

Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'. I have tried: import pyspark.sql.functions as…

python apache-spark pyspark apache-spark-sql

asked Jan 27 '17 at 08:49

gaatjeniksaan

1,412
2
12
17

79

votes

12 answers

Rename more than one column using withColumnRenamed

I want to change names of two columns using spark withColumnRenamed function. Of course, I can write: data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) data = (data .withColumnRenamed('x1','x3') .withColumnRenamed('x2',…

apache-spark pyspark apache-spark-sql rename

asked Aug 05 '16 at 22:30

user2280549

1,204
2
12
19

79

votes

4 answers

PySpark: java.lang.OutofMemoryError: Java heap space

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code: train_dataRDD = (train.map(lambda…

java apache-spark out-of-memory heap-memory pyspark

asked Sep 01 '15 at 16:45

pg2455

5,039
14
51
78

77

votes

3 answers

How do I convert an array (i.e. list) column to Vector

Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York",…

python apache-spark pyspark apache-spark-sql apache-spark-ml

asked Feb 09 '17 at 13:49

Arthur Tacca

8,833
2
31
49

76

votes

2 answers

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame: a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | And I want to replace null values only in the first 2 columns - Column "a" and "b": a | b | c | 1 | 2 | 4 …

apache-spark pyspark apache-spark-sql

asked Jul 12 '17 at 19:02

Rakesh Adhikesavan

11,966
18
51
76

75

votes

15 answers

How to flatten a struct in a Spark dataframe?

java apache-spark pyspark apache-spark-sql

asked Aug 03 '16 at 21:24

djWann

2,017
4
31
36

73

votes

7 answers

Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not…

dataframe apache-spark pyspark rdd

asked Apr 09 '15 at 11:42

Interfector

1,868
1
23
43

72

votes

4 answers

Pyspark: Filter dataframe based on multiple conditions

I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3). If the original dataframe DF is as…

sql pyspark filter apache-spark-sql

asked Mar 15 '18 at 13:55

Sidhom

935
1
8
15

Questions tagged [pyspark]

Useful Links:

Related Tags:

Spark functions vs UDF performance?

How to convert column with string type to int form in pyspark data frame?

pyspark collect_set or collect_list with groupby

How to loop through each row of dataFrame in pyspark

How to use JDBC source to write and read data in (Py)Spark?

How to get name of dataframe column in PySpark?

Median / quantiles within PySpark groupBy

Filter df when values matches part of a string in pyspark

Rename more than one column using withColumnRenamed

PySpark: java.lang.OutofMemoryError: Java heap space

How do I convert an array (i.e. list) column to Vector

PySpark: How to fillna values in dataframe for specific columns?

How to flatten a struct in a Spark dataframe?

Spark: subtract two DataFrames

Pyspark: Filter dataframe based on multiple conditions