I have a dataset like below:
I am group by age and average on numbers of friends for each age
from pyspark.sql import SparkSession
from pyspark.sql import Row
import pyspark.sql.functions as F
def parseInput(line):
fields = line.split(',')
…
I have a spark job that will just pull data from multiple tables with the same transforms. Basically a for loop that iterates across a list of tables, queries the catalog table, adds a timestamp, then shoves into Redshift (example below).
This job…
I am trying to load data from Delta into a pyspark dataframe.
path_to_data = 's3://mybucket/daily_data/'
df = spark.read.format("delta").load(path_to_data)
Now the underlying data is partitioned by date as
s3://mybucket/daily_data/
…
I have a spark dataframe which looks like this where expr is SQL/Hive filter expression.
+-----------------------------------------+
|expr |var1 |var2 |
+-------------------------+---------+-----+
|var1 > 7 |9…
I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.
When I try to submit my spark job without a dependency using…
I am trying to remove all special characters from all the columns. I am using the following commands:
import pyspark.sql.functions as F
df_spark = spark_df.select([F.col(col).alias(col.replace(' ', '_')) for col in df.columns])
df_spark1 =…
I have a class that takes a Spark DataFrame and does some processing to it. Here is the code:
for column in self.sdf.columns:
if column not in self.__columns:
row = [column]
row += '--' * 9
…
I am running few operations to aggregate a big quantity of data (about 600gb) on azure databricks. I noticed recently that the notebook crashes and the databricks returns the error below. The same code worked before with smaller 6 nodes cluster.…
I'm working on a shared Apache Zeppelin server. Almost every day, I try to run a command and get this error: Job 65 cancelled because SparkContext was shut down
I would love to learn more about what causes the SparkContext to shut down. My…
What is the best way to read .tsv file with header in pyspark and store it in a spark data frame.
I am trying to use "spark.read.options" and "spark.read.csv" commands however no luck.
Thanks.
Regards,
Jit
I have a difficult issue regarding rows in a PySpark DataFrame which contains a series of json strings.
The issue revolves around that each row might contain a different schema from another, so when I want to transform said rows into a subscriptable…
I am running a Spark on an EMR large cluster (master.type=r5.4xlarge, core.count=150 and core.type=r5.4xlarge). Fortunately the job finishes but it is constantly throwing these kind of warnings:
20/04/30 14:30:58 INFO TaskSetManager: Finished task…
How to write the equivalent function of arrays_zip in Spark 2.3?
Source code from Spark 2.4
def arrays_zip(*cols):
"""
Collection function: Returns a merged array of structs in which the N-th struct contains all
N-th values of input…
I have two notebooks. First notebook is reading tweets from twitter using tweepy and writing it to a socket. Other notebook is reading tweets from that socket using spark structured streaming (Python) and writing it's result to console.…
I try to connect to remote spark master from notebook on my local machine.
When I try creating sparkContext
sc = pyspark.SparkContext(master = "spark://remote-spark-master-hostname:7077",
appName="jupyter…