Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
7
votes
3 answers

Submit a Python project to Dataproc job

I have a python project, whose folder has the structure main_directory - lib - lib.py - run - script.py script.py is from lib.lib import add_two spark = SparkSession \ .builder \ .master('yarn') \ .appName('script') \ …
Galuoises
  • 2,630
  • 24
  • 30
7
votes
1 answer

Custom sorting in pyspark dataframes

Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers. So, given a dataset with a Speed column, the possible…
Daveed
  • 149
  • 2
  • 8
7
votes
2 answers

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import…
Ajay
  • 247
  • 1
  • 5
  • 15
7
votes
3 answers

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zookeeper, kafka and create a new topic: /usr/local/kafka/bin/zookeeper-server-start.sh…
7
votes
1 answer

Spark SQL - Regex for matching only numbers

I am trying to make sure that a particular column in a dataframe does not contain any illegal values (non- numerical data). For this purpose I am trying to use a regex matching using rlike to collect illegal values in the data: I need to collect…
Hemanth
  • 705
  • 2
  • 16
  • 32
7
votes
2 answers

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html
Kevin Tianyu Xu
  • 646
  • 2
  • 8
  • 15
7
votes
2 answers

Append to PySpark array column

I want to check if the column values are within some boundaries. If they are not I will append some value to the array column "F". This is the code I have so far: df = spark.createDataFrame( [ (1, 56), (2, 32), (3, 99) …
Lossa
  • 341
  • 2
  • 3
  • 9
7
votes
3 answers

"Parquet record is malformed" while column count is not 0

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error: Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely…
Shinagan
  • 435
  • 3
  • 15
7
votes
2 answers

Problem running a Pandas UDF on a large dataset

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works. I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset…
naifmeh
  • 408
  • 5
  • 15
7
votes
2 answers

spark 2.4 Parquet column cannot be converted in file, Column: [Impressions], Expected: bigint, Found: BINARY

I'm facing a weird issue that I cannot understand. I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data). The HIVE schema registered for this column is of Long. Thus, when…
Jay Cee
  • 1,855
  • 5
  • 28
  • 48
7
votes
1 answer

pandas_udf error RuntimeError: Result vector from pandas_udf was not the required length: expected 12, got 35

I am getting error with pandas_udf with the following code. The code is to create a column with data type based on another column. The same code works fine for the normal slower udf (commented out). Basically anything more sophisticated that…
Kevin Tianyu Xu
  • 646
  • 2
  • 8
  • 15
7
votes
1 answer

How to use date_add with two columns in pyspark?

I have a dataframe with some columns: +------------+--------+----------+----------+ |country_name| ID_user|birth_date| psdt| +------------+--------+----------+----------+ | Россия|16460783| 486|1970-01-01| | Россия|16467391| …
7
votes
1 answer

Getting the leaf probabilities of a tree model in spark

I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of…
nicola
  • 24,005
  • 3
  • 35
  • 56
7
votes
2 answers

Pyspark Dataframe pivot and groupby count

I am working on a pyspark dataframe which looks like below id category 1 A 1 A 1 B 2 B 2 A 3 B 3 B 3 B I want to unstack the category column and count their occurrences. So, the result I want is shown…
Sayed Shazeb
  • 75
  • 1
  • 5
7
votes
2 answers

apply function to all values in array column pyspark

I want to make all values in an array column in my pyspark data frame negative without exploding (!). I tried this udf but it didn't work: negative = func.udf(lambda x: x * -1, T.ArrayType(T.FloatType())) cast_contracts = cast_contracts \ …
LN_P
  • 1,448
  • 4
  • 21
  • 37