Highest Voted 'pyspark' Questions

97

votes

7 answers

Cannot find col function in pyspark

In pyspark 1.6.2, I can import col function by from pyspark.sql.functions import col but when I try to look it up in the Github source code I find no col function in functions.py file, how can python import a function that doesn't exist?

python apache-spark pyspark apache-spark-sql

asked Oct 20 '16 at 19:38

Bamqf

3,382
8
33
47

97

votes

19 answers

How do I set the driver's python version in spark?

I'm using spark 1.4.0-rc2 so I can use python 3 with spark. If I add export PYSPARK_PYTHON=python3 to my .bashrc file, I can run spark interactively with python 3. However, if I want to run a standalone program in local mode, I get an…

python apache-spark pyspark

asked May 28 '15 at 22:52

Kevin

3,391
5
30
40

97

votes

8 answers

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Let's say I have a rather large dataset in the following form: data = sc.parallelize([('Foo',41,'US',3), ('Foo',39,'UK',1), ('Bar',57,'CA',2), ('Bar',72,'CA',2), …

apache-spark apache-spark-sql pyspark

asked May 14 '15 at 22:03

Jason

2,834
6
31
35

95

votes

5 answers

Add an empty column to Spark DataFrame

As mentioned in many other locations on the web, adding a new column to an existing DataFrame is not straightforward. Unfortunately it is important to have this functionality (even though it is inefficient in a distributed environment) especially…

python apache-spark dataframe pyspark apache-spark-sql

asked Oct 09 '15 at 12:45

architectonic

2,871
2
21
35

95

votes

5 answers

Updating a dataframe column in spark

Looking at the new spark DataFrame API, it is unclear whether it is possible to modify dataframe columns. How would I go about changing a value in row x column y of a dataframe? In pandas this would be: df.ix[x,y] = new_value Edit: Consolidating…

python dataframe apache-spark pyspark apache-spark-sql

asked Mar 17 '15 at 21:19

Luke

6,699
13
50
88

93

votes

4 answers

Create Spark DataFrame. Can not infer schema for type

Could someone help me solve this problem I have with Spark DataFrame? When I do myFloatRDD.toDF() I get an error: TypeError: Can not infer schema for type: type 'float' I don't understand why... Example: myFloatRdd =…

python apache-spark dataframe pyspark apache-spark-sql

asked Sep 23 '15 at 14:13

Breach

1,288
1
11
25

91

votes

7 answers

Pyspark: display a spark data frame in a table format

I am using pyspark to read a parquet file like below: my_df = sqlContext.read.parquet('hdfs://myPath/myDB.db/myTable/**') Then when I do my_df.take(5), it will show [Row(...)], instead of a table format like when we use the pandas data frame. Is…

python pandas pyspark apache-spark-sql

asked Aug 21 '16 at 18:24

Edamame

23,718
73
186
320

91

votes

4 answers

How to join on multiple columns in Pyspark?

I am using Spark 1.3 and would like to join on multiple columns using python interface (SparkSQL) The following works: I first register them as temp tables. numeric.registerTempTable("numeric") Ref.registerTempTable("Ref") test = numeric.join(Ref,…

python apache-spark join pyspark apache-spark-sql

asked Nov 16 '15 at 22:37

user3803714

5,269
10
42
61

90

votes

10 answers

collect_list by preserving order based on another variable

I am trying to create a new column of lists in Pyspark using a groupby aggregation on existing set of columns. An example input data frame is provided below: ------------------------ id | date | value ------------------------ 1 |2014-01-03 …

python apache-spark pyspark

asked Oct 05 '17 at 07:34

Ravi

3,223
7
37
49

90

votes

10 answers

How to pivot Spark DataFrame?

I am starting to use Spark DataFrames and I need to be able to pivot the data to create multiple columns out of 1 column with multiple rows. There is built in functionality for that in Scalding and I believe in Pandas in Python, but I can't find…

dataframe apache-spark pyspark apache-spark-sql pivot

asked May 14 '15 at 18:42

J Calbreath

2,665
4
22
31

89

votes

17 answers

How to link PyCharm with PySpark?

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook: Last login: Fri Jan 8 12:52:04 on console user@MacBook-Pro-de-User-2:~$ pyspark Python 2.7.10 (default, Jul 13 2015, 12:05:58) [GCC 4.2.1 Compatible…

python apache-spark pyspark pycharm homebrew

asked Jan 08 '16 at 20:55

tumbleweed

4,624
12
50
81

86

votes

4 answers

Pyspark: Split multiple array columns into rows

I have a dataframe which has one row, and several columns. Some of the columns are single values, and others are lists. All list columns are the same length. I want to split each list column into a separate row, while keeping any non-list column as…

python apache-spark dataframe pyspark apache-spark-sql

asked Dec 07 '16 at 21:02

Steve

2,401
3
24
28

85

votes

22 answers

How to perform union on two DataFrames with different amounts of columns in Spark?

I have 2 DataFrames: I need union like this: The unionAll function doesn't work because the number and the name of columns are different. How can I do this?

python apache-spark pyspark apache-spark-sql union

asked Sep 28 '16 at 21:34

Allan Feliph

862
1
8
8

85

votes

9 answers

How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median. This question is similar to this question: How can I…

python apache-spark median rdd pyspark

asked Jul 15 '15 at 14:11

pr338

8,730
19
52
71

83

votes

4 answers

How to make good reproducible Apache Spark examples

I've been spending a fair amount of time reading through some questions with the pyspark and spark-dataframe tags and very often I find that posters don't provide enough information to truly understand their question. I usually comment asking them…

dataframe apache-spark pyspark apache-spark-sql

asked Jan 24 '18 at 16:24

pault

41,343
15
107
149

Questions tagged [pyspark]

Useful Links:

Related Tags:

Cannot find col function in pyspark

How do I set the driver's python version in spark?

Removing duplicates from rows based on specific columns in an RDD/Spark DataFrame

Add an empty column to Spark DataFrame

Updating a dataframe column in spark

Create Spark DataFrame. Can not infer schema for type

Pyspark: display a spark data frame in a table format

How to join on multiple columns in Pyspark?

collect_list by preserving order based on another variable

How to pivot Spark DataFrame?

How to link PyCharm with PySpark?

Pyspark: Split multiple array columns into rows

How to perform union on two DataFrames with different amounts of columns in Spark?

How to find median and quantiles using Spark

How to make good reproducible Apache Spark examples