Highest Voted 'pyspark' Questions

7

votes

3 answers

DataBricks: Ingesting CSV data to a Delta Live Table in Python triggers "invalid characters in table name" error - how to set column mapping mode?

First off, can I just say that I am learning DataBricks at the time of writing this post, so I'd like simpler, cruder solutions as well as more sophisticated ones. I am reading a CSV file like this: df1 = spark.read.format("csv").option("header",…

pyspark databricks delta-live-tables

asked Jun 16 '22 at 09:04

Asfand Qazi

6,586
4
32
34

7

votes

2 answers

In what situations are Datasets preferred to Dataframes and vice-versa in Apache Spark?

I have been searching for any links or documents or articles that will help me understand when do we go for Datasets over Dataframes and vice-versa? All I find on the internet are headlines with when to use a Dataset but when opened, they just…

dataframe apache-spark pyspark apache-spark-dataset

asked May 10 '22 at 06:24

Metadata

2,127
9
56
127

7

votes

1 answer

access objects in pyspark user-defined function from outer scope, avoid PicklingError: Could not serialize object

How do I avoid initializing a class within a pyspark user-defined function? Here is an example. Creating a spark session and DataFrame representing four latitudes and longitudes. import pandas as pd from pyspark import SparkConf from pyspark.sql…

python apache-spark serialization pyspark user-defined-functions

asked May 03 '22 at 23:12

Russell Burdt

2,391
2
19
30

7

votes

3 answers

Converting timestamp to epoch milliseconds in pyspark

I have a dataset like the below: epoch_seconds eq_time 1636663343887 2021-11-12 02:12:23 Now, I am trying to convert the eq_time to epoch seconds which should match the value of the first column but am unable to do so. Below is my…

python apache-spark pyspark apache-spark-sql

asked Nov 13 '21 at 18:43

whatsinthename

1,828
20
59

7

votes

2 answers

AWS EMR: Pyspark: Rdd: mappartitions: Could not find valid SPARK_HOME while searching: Spark closures

I'm having a pyspark job which runs without any issues when ran locally, but when It runs from the aws cluster, it gets stuck at the point when it reaches the below code. The job just process 100 records. "some_function" posts data into a website…

apache-spark pyspark apache-spark-sql python-requests amazon-emr

asked Oct 16 '21 at 02:14

user7343922

316
4
17

7

votes

2 answers

Why joining structure-identic dataframes gives different results?

Update: the root issue was a bug which was fixed in Spark 3.2.0. Input df structures are identic in both runs, but outputs are different. Only the second run returns desired result (df6). I know I can use aliases for dataframes which would return…

apache-spark join pyspark apache-spark-sql

asked Sep 24 '21 at 13:58

ZygD

22,092
39
79
102

7

votes

1 answer

Spark streaming reads file twice from NFS

I am using spark streaming (spark 2.4.6) to read data files from NFS mount point. However, sometimes spark streaming job checkpoints files differently for different batches, hence it produces duplicates. Does anyone have similar issue? Here is…

apache-spark pyspark duplicates spark-streaming

asked Sep 24 '21 at 08:06

kevi

71
2

7

votes

2 answers

Why is my build hanging / taking a long time to generate my query plan with many unions?

I notice when I run the same code as my example over here but with a union or unionByName or unionAll instead of the join, my query planning takes significantly longer and can result in a driver OOM. Code included here for reference, with a slight…

pyspark palantir-foundry foundry-code-repositories foundry-python-transform

asked Aug 16 '21 at 17:48

vanhooser

1,497
3
19

7

votes

2 answers

Unable to create SparkApplications on Kubernetes cluster using SparkKubernetesOperator from Airflow DAG

Apache Airflow version: v2.1.1 Kubernetes version (if you are using kubernetes) (use kubectl version):- Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7",…

docker kubernetes pyspark airflow amazon-eks

asked Jul 14 '21 at 03:57

Jitendra Patel

83
2
5

7

votes

2 answers

How can I control the amount of files being processed for each trigger in Spark Structured Streaming using the "Trigger once" trigger?

I am trying to use Spark Structured Streaming's feature, Trigger once, to mimic a batch alike setup. However, I run into some trouble when I am running my initial batch, because I have a lot of historic data, and for this reason I am also using the…

python apache-spark pyspark databricks spark-structured-streaming

asked Jul 12 '21 at 10:36

Mathias Bigler

71
1
4

7

votes

1 answer

In spark what is the meaning of spark.executor.pyspark.memory configuration option?

Documentation explanation is given as: The amount of memory to be allocated to PySpark in each executor, in MiB unless otherwise specified. If set, PySpark memory for an executor will be limited to this amount. If not set, Spark will not limit…

apache-spark pyspark

asked Jul 04 '21 at 22:39

figs_and_nuts

4,870
2
31
56

7

votes

1 answer

Dataframe Checkpoint Example Pyspark

I read about checkpoint and it looks great for my needs but I couldn't find a good example of how to use it. My questions are: Should I specifiy the checkpoint dir? Is it possible to do it like this: df.checkpoint() Are there any optional params…

apache-spark pyspark apache-spark-sql spark-checkpoint

asked Jun 10 '21 at 07:27

dasilva555

93
1
2
12

7

votes

1 answer

Use parquet file with special characters in column names in PySpark

MAIN GOAL Show or select columns from the Spark dataframe read from the parquet file. All the solutions mentioned in the forum are not successfull in our case. PROBLEM The issue happens when the parquet file is read and queried with SPARK and is due…

python pandas dataframe apache-spark pyspark

asked May 11 '21 at 09:37

m v

71
1
3

7

votes

1 answer

Error while Importing pyspark ETL module and running as child process using pything subprocess

I'm trying to call a list of pyspark modules dynamically from one main.py python script, using Import module and subprocess. The child modules I'm trying to call does not return anything, it just does its ETL operation. I want my main.py program to…

python pyspark

asked Apr 17 '21 at 18:18

user7343922

316
4
17

7

votes

1 answer

Spark submit to kubernetes: packages not pulled by executors

I'm trying to submit my Pyspark application to a Kubernetes cluster (Minikube) using spark-submit: ./bin/spark-submit \ --master k8s://https://192.168.64.4:8443 \ --deploy-mode cluster \ --packages…

python apache-spark kubernetes pyspark spark-submit

asked Feb 24 '21 at 20:07

Alexandre Pieroux

219
1
2
13

Questions tagged [pyspark]

Useful Links:

Related Tags:

DataBricks: Ingesting CSV data to a Delta Live Table in Python triggers "invalid characters in table name" error - how to set column mapping mode?

In what situations are Datasets preferred to Dataframes and vice-versa in Apache Spark?

access objects in pyspark user-defined function from outer scope, avoid PicklingError: Could not serialize object

Converting timestamp to epoch milliseconds in pyspark

AWS EMR: Pyspark: Rdd: mappartitions: Could not find valid SPARK_HOME while searching: Spark closures

Why joining structure-identic dataframes gives different results?

Spark streaming reads file twice from NFS

Why is my build hanging / taking a long time to generate my query plan with many unions?

Unable to create SparkApplications on Kubernetes cluster using SparkKubernetesOperator from Airflow DAG

How can I control the amount of files being processed for each trigger in Spark Structured Streaming using the "Trigger once" trigger?

In spark what is the meaning of spark.executor.pyspark.memory configuration option?

Dataframe Checkpoint Example Pyspark

Use parquet file with special characters in column names in PySpark

Error while Importing pyspark ETL module and running as child process using pything subprocess

Spark submit to kubernetes: packages not pulled by executors