Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
83
votes
4 answers

Spark functions vs UDF performance?

Spark now offers predefined functions that can be used in dataframes, and it seems they are highly optimized. My original question was going to be on which is faster, but I did some testing myself and found the spark functions to be about 10 times…
82
votes
3 answers

How to convert column with string type to int form in pyspark data frame?

I have dataframe in pyspark. Some of its numerical columns contain nan so when I am reading the data and checking for the schema of dataframe, those columns will have string type. How I can change them to int type. I replaced the nan values with 0…
neha
  • 1,858
  • 5
  • 21
  • 35
81
votes
2 answers

pyspark collect_set or collect_list with groupby

How can I use collect_set or collect_list on a dataframe after groupby. for example: df.groupby('key').collect_set('values'). I get an error: AttributeError: 'GroupedData' object has no attribute 'collect_set'
Hanan Shteingart
  • 8,480
  • 10
  • 53
  • 66
81
votes
7 answers

How to loop through each row of dataFrame in pyspark

E.g sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show() The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further…
Arti Berde
  • 1,182
  • 1
  • 11
  • 23
81
votes
3 answers

How to use JDBC source to write and read data in (Py)Spark?

The goal of this question is to document: steps required to read and write data using JDBC connections in PySpark possible issues with JDBC sources and know solutions With small changes these methods should work with other supported languages…
zero323
  • 322,348
  • 103
  • 959
  • 935
80
votes
8 answers

How to get name of dataframe column in PySpark?

In pandas, this can be done by column.name. But how to do the same when it's a column of Spark dataframe? E.g. the calling program has a Spark dataframe: spark_df >>> spark_df.columns ['admit', 'gre', 'gpa', 'rank'] This program calls my function:…
Kaushik Acharya
  • 1,520
  • 2
  • 16
  • 25
79
votes
8 answers

Median / quantiles within PySpark groupBy

I would like to calculate group quantiles on a Spark dataframe (using PySpark). Either an approximate or exact result would be fine. I prefer a solution that I can use within the context of groupBy / agg, so that I can mix it with other PySpark…
abeboparebop
  • 7,396
  • 6
  • 37
  • 46
79
votes
4 answers

Filter df when values matches part of a string in pyspark

I have a large pyspark.sql.dataframe.DataFrame and I want to keep (so filter) all rows where the URL saved in the location column contains a pre-determined string, e.g. 'google.com'. I have tried: import pyspark.sql.functions as…
gaatjeniksaan
  • 1,412
  • 2
  • 12
  • 17
79
votes
12 answers

Rename more than one column using withColumnRenamed

I want to change names of two columns using spark withColumnRenamed function. Of course, I can write: data = sqlContext.createDataFrame([(1,2), (3,4)], ['x1', 'x2']) data = (data .withColumnRenamed('x1','x3') .withColumnRenamed('x2',…
user2280549
  • 1,204
  • 2
  • 12
  • 19
79
votes
4 answers

PySpark: java.lang.OutofMemoryError: Java heap space

I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code: train_dataRDD = (train.map(lambda…
pg2455
  • 5,039
  • 14
  • 51
  • 78
77
votes
3 answers

How do I convert an array (i.e. list) column to Vector

Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York",…
Arthur Tacca
  • 8,833
  • 2
  • 31
  • 49
76
votes
2 answers

PySpark: How to fillna values in dataframe for specific columns?

I have the following sample DataFrame: a | b | c | 1 | 2 | 4 | 0 | null | null| null | 3 | 4 | And I want to replace null values only in the first 2 columns - Column "a" and "b": a | b | c | 1 | 2 | 4 …
Rakesh Adhikesavan
  • 11,966
  • 18
  • 51
  • 76
75
votes
15 answers

How to flatten a struct in a Spark dataframe?

I have a dataframe with the following structure: |-- data: struct (nullable = true) | |-- id: long (nullable = true) | |-- keyNote: struct (nullable = true) | | |-- key: string (nullable = true) | | |-- note: string (nullable…
djWann
  • 2,017
  • 4
  • 31
  • 36
73
votes
7 answers

Spark: subtract two DataFrames

In Spark version 1.2.0 one could use subtract with 2 SchemRDDs to end up with only the different content from the first one val onlyNewData = todaySchemaRDD.subtract(yesterdaySchemaRDD) onlyNewData contains the rows in todaySchemRDD that do not…
Interfector
  • 1,868
  • 1
  • 23
  • 43
72
votes
4 answers

Pyspark: Filter dataframe based on multiple conditions

I want to filter dataframe according to the following conditions firstly (d<5) and secondly (value of col2 not equal its counterpart in col4 if value in col1 equal its counterpart in col3). If the original dataframe DF is as…
Sidhom
  • 935
  • 1
  • 8
  • 15