Highest Voted 'pyspark' Questions

7

votes

2 answers

Use case of spark.executor.allowSparkContext

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation. In code, there is short description for this config If set…

apache-spark pyspark

asked Jan 16 '21 at 06:53

Hyun

566
1
3
13

7

votes

1 answer

Debug Pyspark on EMR using Pycharm

Does anyone have experience with debugging Pyspark that runs on AWS EMR using Pycharm? I couldn't find any good guides or existing threads regrading this one. I know how to debug Scala-Spark with Intellij against EMR but I have no experince with…

amazon-web-services apache-spark pyspark pycharm amazon-emr

asked Jan 10 '21 at 17:03

Ron F

370
2
14

7

votes

3 answers

No Module Named 'delta.tables'

I am getting the following error for the code below, please help: from delta.tables import * ModuleNotFoundError: No module named 'delta.tables' INFO SparkContext: Invoking stop() from shutdown hook Here is the code: ''' from…

python apache-spark pyspark delta-lake

asked Jan 03 '21 at 19:00

RLT

141
1
2
7

7

votes

2 answers

AttributeError: 'DataFrame' object has no attribute '_data'

Azure Databricks execution error while parallelizing on pandas dataframe. The code is able to create RDD but breaks at the time of performing .collect() setup: import pandas as pd # initialize list of lists data = [['tom', 10], ['nick', 15],…

python apache-spark pyspark databricks azure-databricks

asked Dec 28 '20 at 07:11

hashini paramasivam

71
1
1
3

7

votes

0 answers

Is there an Enum type in PySpark?

I just wondered if there is an EnumType in PySpark/Spark. I want to add constraints on StringTypes (or other types as well) to have certain values only in my DataFrame's schema.

apache-spark apache-spark-sql pyspark

asked Nov 26 '20 at 09:36

Ehsan Poursaeed

125
9

7

votes

2 answers

Spark pool taking time to start in azure synapse Analytics

I have created 3 different notebook using pyspark code in Azure synapse Analytics. Notebook is running using spark pool. There is only one spark pool for all 3 notebook. when these 3 notebook run individually, spark pool starts for all 3 notebook by…

python azure apache-spark pyspark azure-synapse

asked Nov 25 '20 at 03:27

kshitiz sinha

113
1
7

7

votes

1 answer

How to use Pyspark equivalent for reset_index() in python

I'd like to know equivalence in PySpark to the use of reset_index() command used in pandas. When using the default command (reset_index), as follows: data.reset_index() I get an error: "DataFrame' object has no attribute 'reset_index' error"

python python-3.x apache-spark pyspark apache-spark-sql

asked Nov 06 '20 at 04:07

pruthviraj

71
1
2

7

votes

1 answer

'Can not create a Path from an empty string' Error for 'CREATE TABLE AS' in hive using S3 path

I am trying to create a table in Glue catalog with s3 path location from spark running in EMR using hive. I have tried the following commands, but getting the error: pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Can not…

amazon-web-services pyspark hive aws-glue-data-catalog aws-glue-spark

asked Oct 21 '20 at 08:17

AditiSuba

73
1
1
3

7

votes

1 answer

Pyspark to_timestamp with timezone

I am trying to convert datetime strings with timezone to timestamp using to_timestamp. Sample dataframe: df = spark.createDataFrame([("a", '2020-09-08 14:00:00.917+02:00'), ("b", '2020-09-08 14:00:00.900+01:00')], …

python-3.x pyspark apache-spark-sql

asked Sep 08 '20 at 16:20

Christian Sloper

7,440
3
15
28

7

votes

3 answers

How to throw Exception in Databricks?

I want my Databricks notebook to fail if a certain condition is satisfied. Right now I am using dbutils.notebook.exit() but it does not cause the notebook to fail and I will get mail like notebook run is successful. How can I make my notebook fail?

apache-spark pyspark databricks azure-databricks

asked Sep 02 '20 at 10:01

Shubham Sahay

113
1
2
8

7

votes

5 answers

NLTK is called and got error of "punkt" not found on databricks pyspark

I would like to call NLTK to do some NLP on databricks by pyspark. I have installed NLTK from the library tab of databricks. It should be accessible from all nodes. My py3 code : import pyspark.sql.functions as F from pyspark.sql.types import…

python-3.x pyspark nlp nltk

asked Aug 16 '20 at 04:32

user3448011

1,469
1
17
39

7

votes

2 answers

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time…

python apache-spark pyspark apache-spark-sql

asked Jul 29 '20 at 18:56

Lauren Leder

276
1
3
15

7

votes

4 answers

Getting the 'Exception thrown in awaitResult:' error when trying to copy table in glue to redshift

I have been trying to copy a table over from glue to one in redshift. I created a job with the following code import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from…

python apache-spark pyspark aws-glue

asked Jul 24 '20 at 23:19

lotad

99
1
1
5

7

votes

3 answers

Fetch week start date and week end date from Date

I need to fetch week start date and week end date from a given date, taking into account that the week starts from Sunday and ends on Saturday. I referred this post but this takes monday as starting day of week. Is there any inbuilt function in…

pyspark apache-spark-sql

asked Jul 15 '20 at 10:02

ben

1,404
8
25
43

7

votes

1 answer

What is the differences between spark.table("TABLE A") and spark.read.("TABLE A")

Question as the title,I am learning sparkSQL,but I can't get a good understanding of the difference between them. Thanks.

apache-spark pyspark apache-spark-sql

asked Jul 14 '20 at 03:01

Sean

87
1
6

Questions tagged [pyspark]

Useful Links:

Related Tags:

Use case of spark.executor.allowSparkContext

Debug Pyspark on EMR using Pycharm

No Module Named 'delta.tables'

AttributeError: 'DataFrame' object has no attribute '_data'

Is there an Enum type in PySpark?

Spark pool taking time to start in azure synapse Analytics

How to use Pyspark equivalent for reset_index() in python

'Can not create a Path from an empty string' Error for 'CREATE TABLE AS' in hive using S3 path

Pyspark to_timestamp with timezone

How to throw Exception in Databricks?

NLTK is called and got error of "punkt" not found on databricks pyspark

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

Getting the 'Exception thrown in awaitResult:' error when trying to copy table in glue to redshift

Fetch week start date and week end date from Date

What is the differences between spark.table("TABLE A") and spark.read.("TABLE A")