Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
7
votes
2 answers

Use case of spark.executor.allowSparkContext

I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation. In code, there is short description for this config If set…
Hyun
  • 566
  • 1
  • 3
  • 13
7
votes
1 answer

Debug Pyspark on EMR using Pycharm

Does anyone have experience with debugging Pyspark that runs on AWS EMR using Pycharm? I couldn't find any good guides or existing threads regrading this one. I know how to debug Scala-Spark with Intellij against EMR but I have no experince with…
Ron F
  • 370
  • 2
  • 14
7
votes
3 answers

No Module Named 'delta.tables'

I am getting the following error for the code below, please help: from delta.tables import * ModuleNotFoundError: No module named 'delta.tables' INFO SparkContext: Invoking stop() from shutdown hook Here is the code: ''' from…
RLT
  • 141
  • 1
  • 2
  • 7
7
votes
2 answers

AttributeError: 'DataFrame' object has no attribute '_data'

Azure Databricks execution error while parallelizing on pandas dataframe. The code is able to create RDD but breaks at the time of performing .collect() setup: import pandas as pd # initialize list of lists data = [['tom', 10], ['nick', 15],…
7
votes
0 answers

Is there an Enum type in PySpark?

I just wondered if there is an EnumType in PySpark/Spark. I want to add constraints on StringTypes (or other types as well) to have certain values only in my DataFrame's schema.
7
votes
2 answers

Spark pool taking time to start in azure synapse Analytics

I have created 3 different notebook using pyspark code in Azure synapse Analytics. Notebook is running using spark pool. There is only one spark pool for all 3 notebook. when these 3 notebook run individually, spark pool starts for all 3 notebook by…
kshitiz sinha
  • 113
  • 1
  • 7
7
votes
1 answer

How to use Pyspark equivalent for reset_index() in python

I'd like to know equivalence in PySpark to the use of reset_index() command used in pandas. When using the default command (reset_index), as follows: data.reset_index() I get an error: "DataFrame' object has no attribute 'reset_index' error"
7
votes
1 answer

'Can not create a Path from an empty string' Error for 'CREATE TABLE AS' in hive using S3 path

I am trying to create a table in Glue catalog with s3 path location from spark running in EMR using hive. I have tried the following commands, but getting the error: pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Can not…
7
votes
1 answer

Pyspark to_timestamp with timezone

I am trying to convert datetime strings with timezone to timestamp using to_timestamp. Sample dataframe: df = spark.createDataFrame([("a", '2020-09-08 14:00:00.917+02:00'), ("b", '2020-09-08 14:00:00.900+01:00')], …
Christian Sloper
  • 7,440
  • 3
  • 15
  • 28
7
votes
3 answers

How to throw Exception in Databricks?

I want my Databricks notebook to fail if a certain condition is satisfied. Right now I am using dbutils.notebook.exit() but it does not cause the notebook to fail and I will get mail like notebook run is successful. How can I make my notebook fail?
Shubham Sahay
  • 113
  • 1
  • 2
  • 8
7
votes
5 answers

NLTK is called and got error of "punkt" not found on databricks pyspark

I would like to call NLTK to do some NLP on databricks by pyspark. I have installed NLTK from the library tab of databricks. It should be accessible from all nodes. My py3 code : import pyspark.sql.functions as F from pyspark.sql.types import…
user3448011
  • 1,469
  • 1
  • 17
  • 39
7
votes
2 answers

Implementing a recursive algorithm in pyspark to find pairings within a dataframe

I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time…
Lauren Leder
  • 276
  • 1
  • 3
  • 15
7
votes
4 answers

Getting the 'Exception thrown in awaitResult:' error when trying to copy table in glue to redshift

I have been trying to copy a table over from glue to one in redshift. I created a job with the following code import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from…
lotad
  • 99
  • 1
  • 1
  • 5
7
votes
3 answers

Fetch week start date and week end date from Date

I need to fetch week start date and week end date from a given date, taking into account that the week starts from Sunday and ends on Saturday. I referred this post but this takes monday as starting day of week. Is there any inbuilt function in…
ben
  • 1,404
  • 8
  • 25
  • 43
7
votes
1 answer

What is the differences between spark.table("TABLE A") and spark.read.("TABLE A")

Question as the title,I am learning sparkSQL,but I can't get a good understanding of the difference between them. Thanks.
Sean
  • 87
  • 1
  • 6