Highest Voted 'pyspark' Questions

7

votes

2 answers

PySpark DataFrame Floor division unsupported operand type(s)

I have a dataset like below: I am group by age and average on numbers of friends for each age from pyspark.sql import SparkSession from pyspark.sql import Row import pyspark.sql.functions as F def parseInput(line): fields = line.split(',') …

pyspark

asked Jul 11 '20 at 02:54

Chelseajcole

487
1
9
16

7

votes

1 answer

How to run parallel threads in AWS Glue PySpark?

I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below). This job…

apache-spark pyspark aws-glue aws-glue-spark

asked Jul 03 '20 at 17:00

sewardth

347
2
13

7

votes

2 answers

PySpark : Optimize read/load from Delta using selected columns or partitions

I am trying to load data from Delta into a pyspark dataframe. path_to_data = 's3://mybucket/daily_data/' df = spark.read.format("delta").load(path_to_data) Now the underlying data is partitioned by date as s3://mybucket/daily_data/ …

python apache-spark pyspark delta-lake

asked Jun 23 '20 at 17:13

Spandan Brahmbhatt

3,774
6
24
36

7

votes

2 answers

PySpark - pass a value from another column as the parameter of spark function

I have a spark dataframe which looks like this where expr is SQL/Hive filter expression. +-----------------------------------------+ |expr |var1 |var2 | +-------------------------+---------+-----+ |var1 > 7 |9…

apache-spark pyspark apache-spark-sql

asked Jun 19 '20 at 21:34

UtkarshSahu

93
2
10

7

votes

3 answers

Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector. When I try to submit my spark job without a dependency using…

docker apache-spark kubernetes pyspark dependency-management

asked Jun 18 '20 at 11:23

denise

149
14

7

votes

4 answers

PySpark remove special characters in all column names for all special characters

I am trying to remove all special characters from all the columns. I am using the following commands: import pyspark.sql.functions as F df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns]) df_spark1 =…

apache-spark pyspark apache-spark-sql special-characters str-replace

asked Jun 18 '20 at 02:37

user13766314

7

votes

1 answer

mypy type checking shows error when a variable gets dynamically allocated

I have a class that takes a Spark DataFrame and does some processing to it. Here is the code: for column in self.sdf.columns: if column not in self.__columns: row = [column] row += '--' * 9 …

pyspark python-3.7 mypy

asked Jun 10 '20 at 22:38

ahrooran

931
1
10
25

7

votes

2 answers

Databricks notebooks crashes on memory job

I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster.…

azure pyspark databricks azure-databricks

asked Jun 09 '20 at 04:45

KLA

31
1
8

7

votes

1 answer

Job 65 cancelled because SparkContext was shut down

I'm working on a shared Apache Zeppelin server. Almost every day, I try to run a command and get this error: Job 65 cancelled because SparkContext was shut down I would love to learn more about what causes the SparkContext to shut down. My…

apache-spark hadoop pyspark apache-spark-sql apache-zeppelin

asked May 16 '20 at 13:39

Cauder

2,157
4
30
69

7

votes

1 answer

Read TSV file in pyspark

What is the best way to read .tsv file with header in pyspark and store it in a spark data frame. I am trying to use "spark.read.options" and "spark.read.csv" commands however no luck. Thanks. Regards, Jit

python file apache-spark pyspark

asked May 14 '20 at 14:04

Jitu

91
1
1
4

7

votes

2 answers

Unify schema across multiple rows of json strings in Spark Dataframe

I have a difficult issue regarding rows in a PySpark DataFrame which contains a series of json strings. The issue revolves around that each row might contain a different schema from another, so when I want to transform said rows into a subscriptable…

python pyspark

asked May 08 '20 at 16:46

Dammi

1,268
2
13
23

7

votes

0 answers

Spark constant IndexOutOfBoundsException warnings

I am running a Spark on an EMR large cluster (master.type=r5.4xlarge, core.count=150 and core.type=r5.4xlarge). Fortunately the job finishes but it is constantly throwing these kind of warnings: 20/04/30 14:30:58 INFO TaskSetManager: Finished task…

apache-spark pyspark

asked May 04 '20 at 08:25

chemipot

252
2
7

7

votes

3 answers

pyspark: arrays_zip equivalent in Spark 2.3

How to write the equivalent function of arrays_zip in Spark 2.3? Source code from Spark 2.4 def arrays_zip(*cols): """ Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input…

python arrays apache-spark pyspark

asked Apr 29 '20 at 14:14

bp2010

2,342
17
34

7

votes

1 answer

Structured Streaming output is not showing on Jupyter Notebook

I have two notebooks. First notebook is reading tweets from twitter using tweepy and writing it to a socket. Other notebook is reading tweets from that socket using spark structured streaming (Python) and writing it's result to console.…

apache-spark pyspark jupyter-notebook spark-streaming spark-structured-streaming

asked Apr 27 '20 at 16:25

Abdul Haseeb

442
4
22

7

votes

1 answer

Connect to spark cluster from local jupyter notebook

I try to connect to remote spark master from notebook on my local machine. When I try creating sparkContext sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077", appName="jupyter…

apache-spark pyspark jupyter-notebook py4j

asked Apr 27 '20 at 11:00

Grigory Skvortsov

431
4
22

Questions tagged [pyspark]

Useful Links:

Related Tags:

PySpark DataFrame Floor division unsupported operand type(s)

How to run parallel threads in AWS Glue PySpark?

PySpark : Optimize read/load from Delta using selected columns or partitions

PySpark - pass a value from another column as the parameter of spark function

Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

PySpark remove special characters in all column names for all special characters

mypy type checking shows error when a variable gets dynamically allocated

Databricks notebooks crashes on memory job

Job 65 cancelled because SparkContext was shut down

Read TSV file in pyspark

Unify schema across multiple rows of json strings in Spark Dataframe

Spark constant IndexOutOfBoundsException warnings

pyspark: arrays_zip equivalent in Spark 2.3

Structured Streaming output is not showing on Jupyter Notebook

Connect to spark cluster from local jupyter notebook