First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones.
I am reading a CSV file like this:
df1 = spark.read.format("csv").option("header",…
I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa?
All I find on the internet are headlines with when to use a Dataset but when opened, they just…
How do I avoid initializing a class within a pyspark user-defined function? Here is an example.
Creating a spark session and DataFrame representing four latitudes and longitudes.
import pandas as pd
from pyspark import SparkConf
from pyspark.sql…
I have a dataset like the below:
epoch_seconds
eq_time
1636663343887
2021-11-12 02:12:23
Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my…
I'm having a pyspark job which runs without any issues when ran locally, but when It runs from the aws cluster, it gets stuck at the point when it reaches the below code. The job just process 100 records. "some_function" posts data into a website…
Update: the root issue was a bug which was fixed in Spark 3.2.0.
Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6). I know I can use aliases for dataframes which would return…
I am using spark streaming (spark 2.4.6) to read data files from NFS mount point. However, sometimes spark streaming job checkpoints files differently for different batches, hence it produces duplicates. Does anyone have similar issue?
Here is…
I notice when I run the same code as my example over here but with a union or unionByName or unionAll instead of the join, my query planning takes significantly longer and can result in a driver OOM.
Code included here for reference, with a slight…
Apache Airflow version: v2.1.1
Kubernetes version (if you are using kubernetes) (use kubectl version):-
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7",…
I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the…
Documentation explanation is given as:
The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit…
I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it.
My questions are:
Should I specifiy the checkpoint dir? Is it possible to do it like this:
df.checkpoint()
Are there any optional params…
MAIN GOAL
Show or select columns from the Spark dataframe read from the parquet file.
All the solutions mentioned in the forum are not successfull in our case.
PROBLEM
The issue happens when the parquet file is read and queried with SPARK and is due…
I'm trying to call a list of pyspark modules dynamically from one main.py python script, using Import module and subprocess. The child modules I'm trying to call does not return anything, it just does its ETL operation. I want my main.py program to…
I'm trying to submit my Pyspark application to a Kubernetes cluster (Minikube) using spark-submit:
./bin/spark-submit \
--master k8s://https://192.168.64.4:8443 \
--deploy-mode cluster \
--packages…