Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
134
votes
20 answers

importing pyspark in python shell

This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736) I have Spark installed properly on my…
Glenn Strycker
  • 4,816
  • 6
  • 31
  • 51
131
votes
13 answers

Best way to get the max value in a Spark dataframe column

I'm trying to figure out the best way to get the largest value in a Spark dataframe column. Consider the following example: df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], ["A", "B"]) df.show() Which creates: +---+---+ | A| …
xenocyon
  • 2,409
  • 3
  • 20
  • 22
130
votes
6 answers

Convert pyspark string to date format

I have a date pyspark dataframe with a string column in the format of MM-dd-yyyy and I am attempting to convert this into a date column. I tried: df.select(to_date(df.STRING_COLUMN).alias('new_date')).show() And I get a string of nulls. Can anyone…
Jenks
  • 1,950
  • 3
  • 20
  • 27
129
votes
14 answers

Concatenate two PySpark dataframes

I'm trying to concatenate two PySpark dataframes with some columns that are only on one of them: from pyspark.sql.functions import randn, rand df_1 = sqlContext.range(0, 10) +--+ |id| +--+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| |…
Ivan
  • 19,560
  • 31
  • 97
  • 141
126
votes
42 answers

Pyspark: Exception: Java gateway process exited before sending the driver its port number

I'm trying to run pyspark on my macbook air. When i try starting it up I get the error: Exception: Java gateway process exited before sending the driver its port number when sc = SparkContext() is being called upon startup. I have tried running the…
mt88
  • 2,855
  • 8
  • 24
  • 42
126
votes
13 answers

Load CSV file with PySpark

I'm new to Spark and I'm trying to read CSV data from a file with Spark. Here's what I am doing : sc.textFile('file.csv') .map(lambda line: (line.split(',')[0], line.split(',')[1])) .collect() I would expect this call to give me a list of…
Kernael
  • 3,270
  • 4
  • 22
  • 42
120
votes
15 answers

Join two data frames, select all columns from one and some columns from the other

Let's say I have a spark data frame df1, with several columns (among which the column id) and data frame df2 with two columns, id and other. Is there a way to replicate the following command: sqlContext.sql("SELECT df1.*, df2.other FROM df1 JOIN df2…
Francesco Sambo
  • 1,213
  • 2
  • 9
  • 6
119
votes
8 answers

How to fix 'TypeError: an integer is required (got type bytes)' error when trying to run pyspark after installing spark 2.4.4

I've installed OpenJDK 13.0.1 and python 3.8 and spark 2.4.4. Instructions to test the install is to run .\bin\pyspark from the root of the spark installation. I'm not sure if I missed a step in the spark installation, like setting some…
Chris
  • 1,195
  • 2
  • 7
  • 7
112
votes
3 answers

pyspark dataframe filter or include based on list

I am trying to filter a dataframe in pyspark using a list. I want to either filter based on the list or include only those records with a value in the list. My code below does not work: # define a dataframe rdd = sc.parallelize([(0,1), (0,1),…
user3133475
  • 2,951
  • 3
  • 13
  • 11
110
votes
9 answers

Renaming columns for PySpark DataFrame aggregates

I am analysing some data with PySpark DataFrames. Suppose I have a DataFrame df that I am aggregating: (df.groupBy("group") .agg({"money":"sum"}) .show(100) ) This will give me: group SUM(money#2L) A …
cantdutchthis
  • 31,949
  • 17
  • 74
  • 114
107
votes
11 answers

Spark Error - Unsupported class file major version

I'm trying to install Spark on my Mac. I've used home-brew to install spark 2.4.0 and Scala. I've installed PySpark in my anaconda environment and am using PyCharm for development. I've exported to my bash profile: export SPARK_VERSION=`ls…
shbfy
  • 2,075
  • 3
  • 16
  • 37
107
votes
5 answers

Split Spark dataframe string column into multiple columns

I've seen various people suggesting that Dataframe.explode is a useful way to do this, but it results in more rows than the original dataframe, which isn't what I want at all. I simply want to do the Dataframe equivalent of the very…
Peter Gaultney
  • 3,269
  • 4
  • 16
  • 20
103
votes
12 answers

How to find count of Null and Nan values for each column in a PySpark dataframe efficiently?

import numpy as np data = [ (1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float("nan")), (1, 6, float("nan")), ] df = spark.createDataFrame(data, ("session", "timestamp1",…
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80
101
votes
14 answers

Is it possible to get the current spark context settings in PySpark?

I'm trying to get the path to spark.worker.dir for the current sparkcontext. If I explicitly set it as a config param, I can read it back out of SparkConf, but is there anyway to access the complete config (including all defaults) using PySpark?
whisperstream
  • 1,897
  • 3
  • 20
  • 25
98
votes
10 answers

Removing duplicate columns after a DF join in Spark

When you join two DFs with similar column names: df = df1.join(df2, df1['id'] == df2['id']) Join works fine but you can't call the id column because it is ambiguous and you would get the following exception: pyspark.sql.utils.AnalysisException:…
thecheech
  • 2,041
  • 3
  • 18
  • 25