Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
7
votes
2 answers

PySpark DataFrame Floor division unsupported operand type(s)

I have a dataset like below: I am group by age and average on numbers of friends for each age from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F def parseInput(line): fields = line.split(',') …
Chelseajcole
  • 487
  • 1
  • 9
  • 16
7
votes
1 answer

How to run parallel threads in AWS Glue PySpark?

I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below). This job…
sewardth
  • 347
  • 2
  • 13
7
votes
2 answers

PySpark : Optimize read/load from Delta using selected columns or partitions

I am trying to load data from Delta into a pyspark dataframe. path_to_data = 's3://mybucket/daily_data/' df = spark.read.format("delta").load(path_to_data) Now the underlying data is partitioned by date as s3://mybucket/daily_data/ …
Spandan Brahmbhatt
  • 3,774
  • 6
  • 24
  • 36
7
votes
2 answers

PySpark - pass a value from another column as the parameter of spark function

I have a spark dataframe which looks like this where expr is SQL/Hive filter expression. +-----------------------------------------+ |expr |var1 |var2 | +-------------------------+---------+-----+ |var1 > 7 |9…
UtkarshSahu
  • 93
  • 2
  • 10
7
votes
3 answers

Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector. When I try to submit my spark job without a dependency using…
7
votes
4 answers

PySpark remove special characters in all column names for all special characters

I am trying to remove all special characters from all the columns. I am using the following commands: import pyspark.sql.functions as F df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns]) df_spark1 =…
user13766314
7
votes
1 answer

mypy type checking shows error when a variable gets dynamically allocated

I have a class that takes a Spark DataFrame and does some processing to it. Here is the code: for column in self.sdf.columns: if column not in self.__columns: row = [column] row += '--' * 9 …
ahrooran
  • 931
  • 1
  • 10
  • 25
7
votes
2 answers

Databricks notebooks crashes on memory job

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster.…
KLA
  • 31
  • 1
  • 8
7
votes
1 answer

Job 65 cancelled because SparkContext was shut down

I'm working on a shared Apache Zeppelin server. Almost every day, I try to run a command and get this error: Job 65 cancelled because SparkContext was shut down I would love to learn more about what causes the SparkContext to shut down. My…
Cauder
  • 2,157
  • 4
  • 30
  • 69
7
votes
1 answer

Read TSV file in pyspark

What is the best way to read .tsv file with header in pyspark and store it in a spark data frame. I am trying to use "spark.read.options" and "spark.read.csv" commands however no luck. Thanks. Regards, Jit
Jitu
  • 91
  • 1
  • 1
  • 4
7
votes
2 answers

Unify schema across multiple rows of json strings in Spark Dataframe

I have a difficult issue regarding rows in a PySpark DataFrame which contains a series of json strings. The issue revolves around that each row might contain a different schema from another, so when I want to transform said rows into a subscriptable…
Dammi
  • 1,268
  • 2
  • 13
  • 23
7
votes
0 answers

Spark constant IndexOutOfBoundsException warnings

I am running a Spark on an EMR large cluster (master.type=r5.4xlarge, core.count=150 and core.type=r5.4xlarge). Fortunately the job finishes but it is constantly throwing these kind of warnings: 20/04/30 14:30:58 INFO TaskSetManager: Finished task…
chemipot
  • 252
  • 2
  • 7
7
votes
3 answers

pyspark: arrays_zip equivalent in Spark 2.3

How to write the equivalent function of arrays_zip in Spark 2.3? Source code from Spark 2.4 def arrays_zip(*cols): """ Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input…
bp2010
  • 2,342
  • 17
  • 34
7
votes
1 answer

Structured Streaming output is not showing on Jupyter Notebook

I have two notebooks. First notebook is reading tweets from twitter using tweepy and writing it to a socket. Other notebook is reading tweets from that socket using spark structured streaming (Python) and writing it's result to console.…
7
votes
1 answer

Connect to spark cluster from local jupyter notebook

I try to connect to remote spark master from notebook on my local machine. When I try creating sparkContext sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077", appName="jupyter…