Highest Voted 'pyspark' Questions

134

votes

20 answers

importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736) I have Spark installed properly on my…

python apache-spark pyspark

asked Apr 23 '14 at 22:40

Glenn Strycker

4,816
6
31
51

131

votes

13 answers

Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example: df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show() Which creates: +---+---+ | A| …

python apache-spark pyspark apache-spark-sql

asked Oct 19 '15 at 22:04

xenocyon

2,409
3
20
22

130

votes

6 answers

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date')).show() And I get a string of nulls. Can anyone…

python apache-spark datetime pyspark apache-spark-sql

asked Jun 28 '16 at 15:45

Jenks

1,950
3
20
27

129

votes

14 answers

Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them: from pyspark.sql.functions import randn, rand df_1 = sqlContext.range(0, 10) +--+ |id| +--+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| |…

python apache-spark pyspark apache-spark-sql

asked May 19 '16 at 19:29

Ivan

19,560
31
97
141

126

votes

42 answers

Pyspark: Exception: Java gateway process exited before sending the driver its port number

I'm trying to run pyspark on my macbook air. When i try starting it up I get the error: Exception: Java gateway process exited before sending the driver its port number when sc = SparkContext() is being called upon startup. I have tried running the…

java python macos apache-spark pyspark

asked Aug 05 '15 at 19:45

mt88

2,855
8
24
42

126

votes

13 answers

Load CSV file with PySpark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() I would expect this call to give me a list of…

python csv apache-spark pyspark apache-spark-sql

asked Feb 28 '15 at 14:41

Kernael

3,270
4
22
42

120

votes

15 answers

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other. Is there a way to replicate the following command: sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2…

dataframe apache-spark pyspark apache-spark-sql

asked Mar 21 '16 at 13:27

Francesco Sambo

1,213
2
9
6

119

votes

8 answers

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

I've installed OpenJDK 13.0.1 and python 3.8 and spark 2.4.4. Instructions to test the install is to run .\bin\pyspark from the root of the spark installation. I'm not sure if I missed a step in the spark installation, like setting some…

apache-spark pyspark

asked Nov 04 '19 at 20:10

Chris

1,195
2
7
7

112

votes

3 answers

pyspark dataframe filter or include based on list

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work: # define a dataframe rdd = sc.parallelize([(0,1), (0,1),…

apache-spark filter pyspark apache-spark-sql

asked Nov 04 '16 at 11:44

user3133475

2,951
3
13
11

110

votes

9 answers

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating: (df.groupBy("group") .agg({"money":"sum"}) .show(100) ) This will give me: group SUM(money#2L) A …

dataframe apache-spark pyspark apache-spark-sql

asked May 01 '15 at 14:01

cantdutchthis

31,949
17
74
114

107

votes

11 answers

Spark Error - Unsupported class file major version

I'm trying to install Spark on my Mac. I've used home-brew to install spark 2.4.0 and Scala. I've installed PySpark in my anaconda environment and am using PyCharm for development. I've exported to my bash profile: export SPARK_VERSION=`ls…

java python macos apache-spark pyspark

asked Dec 02 '18 at 18:16

shbfy

2,075
3
16
37

107

votes

5 answers

Split Spark dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very…

string apache-spark pyspark split apache-spark-sql

asked Aug 30 '16 at 19:32

Peter Gaultney

3,269
4
16
20

103

votes

12 answers

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np data = [ (1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float("nan")), (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1",…

apache-spark pyspark apache-spark-sql

asked Jun 19 '17 at 09:54

GeorgeOfTheRF

8,244
23
57
80

101

votes

14 answers

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext. If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?

apache-spark config pyspark

asked May 31 '15 at 17:15

whisperstream

1,897
3
20
25

98

votes

10 answers

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names: df = df1.join(df2, df1['id'] == df2['id']) Join works fine but you can't call the id column because it is ambiguous and you would get the following exception: pyspark.sql.utils.AnalysisException:…

python apache-spark pyspark apache-spark-sql

asked Oct 26 '17 at 01:33

thecheech

2,041
3
18
25

Questions tagged [pyspark]

Useful Links:

Related Tags:

importing pyspark in python shell

Best way to get the max value in a Spark dataframe column

Convert pyspark string to date format

Concatenate two PySpark dataframes

Pyspark: Exception: Java gateway process exited before sending the driver its port number

Load CSV file with PySpark

Join two data frames, select all columns from one and some columns from the other

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

pyspark dataframe filter or include based on list

Renaming columns for PySpark DataFrame aggregates

Spark Error - Unsupported class file major version

Split Spark dataframe string column into multiple columns

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

Is it possible to get the current spark context settings in PySpark?

Removing duplicate columns after a DF join in Spark