Highest Voted 'pyspark' Questions

58

votes

18 answers

Unable to infer schema when loading Parquet file

response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print…

apache-spark pyspark parquet

asked Jul 06 '17 at 16:54

user48956

14,850
19
93
154

58

votes

6 answers

_corrupt_record error when reading a JSON file into Spark

I've got this JSON file { "a": 1, "b": 2 } which has been obtained with Python json.dump method. Now, I want to read this file into a DataFrame in Spark, using pyspark. Following documentation, I'm doing this sc = SparkContext() sqlc =…

python json dataframe pyspark

asked Feb 15 '16 at 12:34

mar tin

9,266
23
72
97

58

votes

2 answers

'PipelinedRDD' object has no attribute 'toDF' in PySpark

I'm trying to load an SVM file and convert it to a DataFrame so I can use the ML module (Pipeline ML) from Spark. I've just installed a fresh Spark 1.5.0 on an Ubuntu 14.04 (no spark-env.sh configured). My my_script.py is: from pyspark.mllib.util…

python apache-spark pyspark apache-spark-sql rdd

asked Sep 25 '15 at 18:21

Frederico Oliveira

2,283
3
14
10

57

votes

4 answers

Apply StringIndexer to several columns in a PySpark Dataframe

I have a PySpark dataframe +-------+--------------+----+----+ |address| date|name|food| +-------+--------------+----+----+ |1111111|20151122045510| Yin|gre | |1111111|20151122045501| Yin|gre | |1111111|20151122045500| Yln|gra…

python apache-spark pyspark

asked Apr 29 '16 at 15:28

Ivan

19,560
31
97
141

56

votes

13 answers

py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

I am currently on JRE: 1.8.0_181, Python: 3.6.4, spark: 2.3.2 I am trying to execute following code in Python: from pyspark.sql import SparkSession spark = SparkSession.builder.appName('Basics').getOrCreate() This fails with following…

python python-3.x pyspark

asked Nov 08 '18 at 23:37

bvkclear

861
1
7
6

56

votes

3 answers

Spark Window Functions - rangeBetween dates

I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out, I need to use a…

apache-spark date pyspark apache-spark-sql window-functions

asked Oct 19 '15 at 05:24

Nhor

3,860
6
28
41

54

votes

6 answers

Filtering a Pyspark DataFrame with SQL-like IN clause

I want to filter a Pyspark DataFrame with a SQL-like IN clause, as in sc = SparkContext() sqlc = SQLContext(sc) df = sqlc.sql('SELECT * from my_df WHERE field1 IN a') where a is the tuple (1, 2, 3). I am getting this…

python sql apache-spark dataframe pyspark

asked Mar 08 '16 at 15:00

mar tin

9,266
23
72
97

53

votes

3 answers

Custom delimiter csv reader spark

I would like to read in a file with the following structure with Apache Spark. 628344092\t20070220\t200702\t2007\t2007.1370 The delimiter is \t. How can I implement this while using spark.read.csv()? The csv is much too big to use pandas because…

csv apache-spark pyspark

asked Sep 21 '17 at 17:20

inneb

1,060
1
9
20

53

votes

8 answers

How do I unit test PySpark programs?

My current Java/Spark Unit Test approach works (detailed here) by instantiating a SparkContext using "local" and running unit tests using JUnit. The code has to be organized to do I/O in one function and then call another with multiple RDDs. This…

python unit-testing apache-spark pyspark

asked Nov 19 '15 at 18:40

pettinato

1,472
2
19
39

53

votes

3 answers

Filtering DataFrame using the length of a column

I want to filter a DataFrame using a condition related to the length of a column, this question might be very easy but I didn't find any related question in the SO. More specific, I have a DataFrame with only one Column which of…

python apache-spark dataframe pyspark apache-spark-sql

asked Nov 13 '15 at 14:49

Alberto Bonsanto

17,556
10
64
93

52

votes

7 answers

AttributeError: Can't get attribute 'new_block' on

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Can't get attribute 'new_block' on

python pandas apache-spark pyspark attributeerror

asked Aug 02 '21 at 17:29

Denielll

721
1
5
11

52

votes

3 answers

Un-persisting all dataframes in (py)spark

I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use multiple times. It appears that when I call cache on my dataframe a second…

python caching apache-spark pyspark apache-spark-sql

asked Apr 28 '16 at 05:08

bjack3

991
2
7
14

52

votes

4 answers

How to exclude multiple columns in Spark dataframe in Python

I found PySpark has a method called drop but it seems it can only drop one column at a time. Any ideas about how to drop multiple columns at the same time? df.drop(['col1','col2']) TypeError Traceback (most recent…

apache-spark dataframe pyspark apache-spark-sql

asked Feb 27 '16 at 19:35

MYjx

4,157
9
38
53

52

votes

4 answers

Column alias after groupBy in pyspark

I need the resulting data frame in the line below, to have an alias name "maxDiff" for the max('diff') column after groupBy. However, the below line does not makeany change, nor throw an error. grpdf =…

python scala apache-spark pyspark apache-spark-sql

asked Nov 04 '15 at 07:56

mhn

2,660
5
31
51

52

votes

10 answers

Reduce a key-value pair into a key-list pair with Apache Spark

I am writing a Spark application and want to combine a set of Key-Value pairs (K, V1), (K, V2), ..., (K, Vn) into one Key-Multivalue pair (K, [V1, V2, ..., Vn]). I feel like I should be able to do this using the reduceByKey function with something…

python apache-spark mapreduce pyspark rdd

asked Nov 18 '14 at 19:15

TravisJ

1,592
1
21
37

Prev 1 2 3

…

99 100 Next

Questions tagged [pyspark]

Useful Links:

Related Tags:

Unable to infer schema when loading Parquet file

_corrupt_record error when reading a JSON file into Spark

'PipelinedRDD' object has no attribute 'toDF' in PySpark

Apply StringIndexer to several columns in a PySpark Dataframe

py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM

Spark Window Functions - rangeBetween dates

Filtering a Pyspark DataFrame with SQL-like IN clause

Custom delimiter csv reader spark

How do I unit test PySpark programs?

Filtering DataFrame using the length of a column

AttributeError: Can't get attribute 'new_block' on

Un-persisting all dataframes in (py)spark

How to exclude multiple columns in Spark dataframe in Python

Column alias after groupBy in pyspark

Reduce a key-value pair into a key-list pair with Apache Spark