I have a python project, whose folder has the structure
main_directory - lib - lib.py
- run - script.py
script.py is
from lib.lib import add_two
spark = SparkSession \
.builder \
.master('yarn') \
.appName('script') \
…
Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers.
So, given a dataset with a Speed column, the possible…
I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import…
I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12
Before this I just stat zookeeper, kafka and create a new topic:
/usr/local/kafka/bin/zookeeper-server-start.sh…
I am trying to make sure that a particular column in a dataframe does not contain any illegal values (non- numerical data). For this purpose I am trying to use a regex matching using rlike to collect illegal values in the data:
I need to collect…
I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark?
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
I want to check if the column values are within some boundaries. If they are not I will append some value to the array column "F". This is the code I have so far:
df = spark.createDataFrame(
[
(1, 56),
(2, 32),
(3, 99)
…
On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error:
Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely…
I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works.
I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset…
I'm facing a weird issue that I cannot understand.
I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data).
The HIVE schema registered for this column is of Long.
Thus, when…
I am getting error with pandas_udf with the following code. The code is to create a column with data type based on another column. The same code works fine for the normal slower udf (commented out).
Basically anything more sophisticated that…
I have a dataframe with some columns:
+------------+--------+----------+----------+
|country_name| ID_user|birth_date| psdt|
+------------+--------+----------+----------+
| Россия|16460783| 486|1970-01-01|
| Россия|16467391| …
I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of…
I am working on a pyspark dataframe which looks like below
id
category
1
A
1
A
1
B
2
B
2
A
3
B
3
B
3
B
I want to unstack the category column and count their occurrences. So, the result I want is shown…
I want to make all values in an array column in my pyspark data frame negative without exploding (!). I tried this udf but it didn't work:
negative = func.udf(lambda x: x * -1, T.ArrayType(T.FloatType()))
cast_contracts = cast_contracts \
…