I'm looking into spark-core, I found one undocumented config, which is spark.executor.allowSparkContext available since 3.0.1. I wasn't able to find detail in spark official documentation.
In code, there is short description for this config
If set…
Does anyone have experience with debugging Pyspark that runs on AWS EMR using Pycharm?
I couldn't find any good guides or existing threads regrading this one.
I know how to debug Scala-Spark with Intellij against EMR but I have no experince with…
I am getting the following error for the code below, please help:
from delta.tables import *
ModuleNotFoundError: No module named 'delta.tables'
INFO SparkContext: Invoking stop() from shutdown hook
Here is the code:
'''
from…
Azure Databricks execution error while parallelizing on pandas dataframe. The code is able to create RDD but breaks at the time of performing .collect()
setup:
import pandas as pd
# initialize list of lists
data = [['tom', 10], ['nick', 15],…
I just wondered if there is an EnumType in PySpark/Spark.
I want to add constraints on StringTypes (or other types as well) to have certain values only in my DataFrame's schema.
I have created 3 different notebook using pyspark code in Azure synapse Analytics. Notebook is running using spark pool.
There is only one spark pool for all 3 notebook. when these 3 notebook run individually, spark pool starts for all 3 notebook by…
I'd like to know equivalence in PySpark to the use of reset_index() command used in pandas. When using the default command (reset_index), as follows:
data.reset_index()
I get an error:
"DataFrame' object has no attribute 'reset_index' error"
I am trying to create a table in Glue catalog with s3 path location from spark running in EMR using hive. I have tried the following commands, but getting the error:
pyspark.sql.utils.AnalysisException: u'java.lang.IllegalArgumentException: Can not…
I am trying to convert datetime strings with timezone to timestamp using to_timestamp.
Sample dataframe:
df = spark.createDataFrame([("a", '2020-09-08 14:00:00.917+02:00'),
("b", '2020-09-08 14:00:00.900+01:00')],
…
I want my Databricks notebook to fail if a certain condition is satisfied. Right now I am using dbutils.notebook.exit() but it does not cause the notebook to fail and I will get mail like notebook run is successful. How can I make my notebook fail?
I would like to call NLTK to do some NLP on databricks by pyspark.
I have installed NLTK from the library tab of databricks. It should be accessible from all nodes.
My py3 code :
import pyspark.sql.functions as F
from pyspark.sql.types import…
I have a spark dataframe (prof_student_df) that lists student/professor pair for a timestamp. There are 4 professors and 4 students for each timestamp and each professor-student pair has a “score” (so there are 16 rows per time frame). For each time…
I have been trying to copy a table over from glue to one in redshift. I created a job with the following code
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from…
I need to fetch week start date and week end date from a given date, taking into account that the week starts from Sunday and ends on Saturday.
I referred this post but this takes monday as starting day of week. Is there any inbuilt function in…