Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
58
votes
18 answers

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print…
user48956
  • 14,850
  • 19
  • 93
  • 154
58
votes
6 answers

_corrupt_record error when reading a JSON file into Spark

I've got this JSON file { "a": 1, "b": 2 } which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this sc = SparkContext() sqlc =…
mar tin
  • 9,266
  • 23
  • 72
  • 97
58
votes
2 answers

'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark. I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util…
Frederico Oliveira
  • 2,283
  • 3
  • 14
  • 10
57
votes
4 answers

Apply StringIndexer to several columns in a PySpark Dataframe

I have a PySpark dataframe +-------+--------------+----+----+ |address| date|name|food| +-------+--------------+----+----+ |1111111|20151122045510| Yin|gre | |1111111|20151122045501| Yin|gre | |1111111|20151122045500| Yln|gra…
Ivan
  • 19,560
  • 31
  • 97
  • 141
56
votes
13 answers

py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

I am currently on JRE: 1.8.0_181, Python: 3.6.4, spark: 2.3.2 I am trying to execute following code in Python: from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Basics').getOrCreate() This fails with following…
bvkclear
  • 861
  • 1
  • 7
  • 6
56
votes
3 answers

Spark Window Functions - rangeBetween dates

I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out, I need to use a…
Nhor
  • 3,860
  • 6
  • 28
  • 41
54
votes
6 answers

Filtering a Pyspark DataFrame with SQL-like IN clause

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a') where a is the tuple (1, 2, 3). I am getting this…
mar tin
  • 9,266
  • 23
  • 72
  • 97
53
votes
3 answers

Custom delimiter csv reader spark

I would like to read in a file with the following structure with Apache Spark. 628344092\t20070220\t200702\t2007\t2007.1370 The delimiter is \t. How can I implement this while using spark.read.csv()? The csv is much too big to use pandas because…
inneb
  • 1,060
  • 1
  • 9
  • 20
53
votes
8 answers

How do I unit test PySpark programs?

My current Java/Spark Unit Test approach works (detailed here) by instantiating a SparkContext using "local" and running unit tests using JUnit. The code has to be organized to do I/O in one function and then call another with multiple RDDs. This…
pettinato
  • 1,472
  • 2
  • 19
  • 39
53
votes
3 answers

Filtering DataFrame using the length of a column

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO. More specific, I have a DataFrame with only one Column which of…
Alberto Bonsanto
  • 17,556
  • 10
  • 64
  • 93
52
votes
7 answers

AttributeError: Can't get attribute 'new_block' on

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on
Denielll
  • 721
  • 1
  • 5
  • 11
52
votes
3 answers

Un-persisting all dataframes in (py)spark

I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use multiple times. It appears that when I call cache on my dataframe a second…
bjack3
  • 991
  • 2
  • 7
  • 14
52
votes
4 answers

How to exclude multiple columns in Spark dataframe in Python

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time? df.drop(['col1','col2']) TypeError Traceback (most recent…
MYjx
  • 4,157
  • 9
  • 38
  • 53
52
votes
4 answers

Column alias after groupBy in pyspark

I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error. grpdf =…
mhn
  • 2,660
  • 5
  • 31
  • 51
52
votes
10 answers

Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something…
TravisJ
  • 1,592
  • 1
  • 21
  • 37