Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
97
votes
7 answers

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?
Bamqf
  • 3,382
  • 8
  • 33
  • 47
97
votes
19 answers

How do I set the driver's python version in spark?

I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .bashrc file, I can run spark interactively with python 3. However, if I want to run a standalone program in local mode, I get an…
Kevin
  • 3,391
  • 5
  • 30
  • 40
97
votes
8 answers

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), …
Jason
  • 2,834
  • 6
  • 31
  • 35
95
votes
5 answers

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially…
architectonic
  • 2,871
  • 2
  • 21
  • 35
95
votes
5 answers

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be: df.ix[x,y] = new_value Edit: Consolidating…
Luke
  • 6,699
  • 13
  • 50
  • 88
93
votes
4 answers

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame? When I do myFloatRDD.toDF() I get an error: TypeError: Can not infer schema for type: type 'float' I don't understand why... Example: myFloatRdd =…
Breach
  • 1,288
  • 1
  • 11
  • 25
91
votes
7 answers

Pyspark: display a spark data frame in a table format

I am using pyspark to read a parquet file like below: my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**') Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame. Is…
Edamame
  • 23,718
  • 73
  • 186
  • 320
91
votes
4 answers

How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. numeric.registerTempTable("numeric") Ref.registerTempTable("Ref") test = numeric.join(Ref,…
user3803714
  • 5,269
  • 10
  • 42
  • 61
90
votes
10 answers

collect_list by preserving order based on another variable

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: ------------------------ id | date | value ------------------------ 1 |2014-01-03 …
Ravi
  • 3,223
  • 7
  • 37
  • 49
90
votes
10 answers

How to pivot Spark DataFrame?

I am starting to use Spark DataFrames and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality for that in Scalding and I believe in Pandas in Python, but I can't find…
J Calbreath
  • 2,665
  • 4
  • 22
  • 31
89
votes
17 answers

How to link PyCharm with PySpark?

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible…
tumbleweed
  • 4,624
  • 12
  • 50
  • 81
86
votes
4 answers

Pyspark: Split multiple array columns into rows

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as…
Steve
  • 2,401
  • 3
  • 24
  • 28
85
votes
22 answers

How to perform union on two DataFrames with different amounts of columns in Spark?

I have 2 DataFrames: I need union like this: The unionAll function doesn't work because the number and the name of columns are different. How can I do this?
Allan Feliph
  • 862
  • 1
  • 8
  • 8
85
votes
9 answers

How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median. This question is similar to this question: How can I…
pr338
  • 8,730
  • 19
  • 52
  • 71
83
votes
4 answers

How to make good reproducible Apache Spark examples

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them…
pault
  • 41,343
  • 15
  • 107
  • 149