Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
7
votes
3 answers

DataBricks: Ingesting CSV data to a Delta Live Table in Python triggers "invalid characters in table name" error - how to set column mapping mode?

First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones. I am reading a CSV file like this: df1 = spark.read.format("csv").option("header",…
Asfand Qazi
  • 6,586
  • 4
  • 32
  • 34
7
votes
2 answers

In what situations are Datasets preferred to Dataframes and vice-versa in Apache Spark?

I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa? All I find on the internet are headlines with when to use a Dataset but when opened, they just…
Metadata
  • 2,127
  • 9
  • 56
  • 127
7
votes
1 answer

access objects in pyspark user-defined function from outer scope, avoid PicklingError: Could not serialize object

How do I avoid initializing a class within a pyspark user-defined function? Here is an example. Creating a spark session and DataFrame representing four latitudes and longitudes. import pandas as pd from pyspark import SparkConf from pyspark.sql…
7
votes
3 answers

Converting timestamp to epoch milliseconds in pyspark

I have a dataset like the below: epoch_seconds eq_time 1636663343887 2021-11-12 02:12:23 Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my…
whatsinthename
  • 1,828
  • 20
  • 59
7
votes
2 answers

AWS EMR: Pyspark: Rdd: mappartitions: Could not find valid SPARK_HOME while searching: Spark closures

I'm having a pyspark job which runs without any issues when ran locally, but when It runs from the aws cluster, it gets stuck at the point when it reaches the below code. The job just process 100 records. "some_function" posts data into a website…
7
votes
2 answers

Why joining structure-identic dataframes gives different results?

Update: the root issue was a bug which was fixed in Spark 3.2.0. Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6). I know I can use aliases for dataframes which would return…
ZygD
  • 22,092
  • 39
  • 79
  • 102
7
votes
1 answer

Spark streaming reads file twice from NFS

I am using spark streaming (spark 2.4.6) to read data files from NFS mount point. However, sometimes spark streaming job checkpoints files differently for different batches, hence it produces duplicates. Does anyone have similar issue? Here is…
kevi
  • 71
  • 2
7
votes
2 answers

Why is my build hanging / taking a long time to generate my query plan with many unions?

I notice when I run the same code as my example over here but with a union or unionByName or unionAll instead of the join, my query planning takes significantly longer and can result in a driver OOM. Code included here for reference, with a slight…
7
votes
2 answers

Unable to create SparkApplications on Kubernetes cluster using SparkKubernetesOperator from Airflow DAG

Apache Airflow version: v2.1.1 Kubernetes version (if you are using kubernetes) (use kubectl version):- Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7",…
7
votes
2 answers

How can I control the amount of files being processed for each trigger in Spark Structured Streaming using the "Trigger once" trigger?

I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the…
7
votes
1 answer

In spark what is the meaning of spark.executor.pyspark.memory configuration option?

Documentation explanation is given as: The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit…
figs_and_nuts
  • 4,870
  • 2
  • 31
  • 56
7
votes
1 answer

Dataframe Checkpoint Example Pyspark

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it. My questions are: Should I specifiy the checkpoint dir? Is it possible to do it like this: df.checkpoint() Are there any optional params…
7
votes
1 answer

Use parquet file with special characters in column names in PySpark

MAIN GOAL Show or select columns from the Spark dataframe read from the parquet file. All the solutions mentioned in the forum are not successfull in our case. PROBLEM The issue happens when the parquet file is read and queried with SPARK and is due…
m v
  • 71
  • 1
  • 3
7
votes
1 answer

Error while Importing pyspark ETL module and running as child process using pything subprocess

I'm trying to call a list of pyspark modules dynamically from one main.py python script, using Import module and subprocess. The child modules I'm trying to call does not return anything, it just does its ETL operation. I want my main.py program to…
user7343922
  • 316
  • 4
  • 17
7
votes
1 answer

Spark submit to kubernetes: packages not pulled by executors

I'm trying to submit my Pyspark application to a Kubernetes cluster (Minikube) using spark-submit: ./bin/spark-submit \ --master k8s://https://192.168.64.4:8443 \ --deploy-mode cluster \ --packages…