Highest Voted 'pyspark' Questions

7

votes

3 answers

Submit a Python project to Dataproc job

I have a python project, whose folder has the structure main_directory - lib - lib.py - run - script.py script.py is from lib.lib import add_two spark = SparkSession \ .builder \ .master('yarn') \ .appName('script') \ …

python pyspark google-cloud-dataproc

asked Apr 23 '20 at 11:45

Galuoises

2,630
24
30

7

votes

1 answer

Custom sorting in pyspark dataframes

Are there any recommended methods for implementing custom sort ordering for categorical data in pyspark? I'm ideally looking for the functionality the pandas categorical data type offers. So, given a dataset with a Speed column, the possible…

python pandas apache-spark pyspark apache-spark-sql

asked Mar 05 '20 at 00:25

Daveed

149
2
8

7

votes

2 answers

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried. #initialize pyspark import…

python apache-spark-sql azure-databricks pyspark

asked Feb 22 '20 at 16:43

Ajay

247
1
5
15

7

votes

3 answers

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

I am trying to read a stream from kafka using pyspark. I am using spark version 3.0.0-preview2 and spark-streaming-kafka-0-10_2.12 Before this I just stat zookeeper, kafka and create a new topic: /usr/local/kafka/bin/zookeeper-server-start.sh…

apache-spark pyspark apache-kafka apache-spark-sql spark-structured-streaming

asked Feb 14 '20 at 16:26

el abed houssem

350
1
7
16

7

votes

1 answer

Spark SQL - Regex for matching only numbers

I am trying to make sure that a particular column in a dataframe does not contain any illegal values (non- numerical data). For this purpose I am trying to use a regex matching using rlike to collect illegal values in the data: I need to collect…

regex dataframe apache-spark pyspark apache-spark-sql

asked Feb 10 '20 at 10:16

Hemanth

705
2
16
32

7

votes

2 answers

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

I would like to perform operation similar to pandas.io.json.json_normalize is pyspark dataframe. Is there an equivalent function in spark? https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.json_normalize.html

python json pandas pyspark

asked Jan 23 '20 at 00:42

Kevin Tianyu Xu

646
2
8
15

7

votes

2 answers

Append to PySpark array column

I want to check if the column values are within some boundaries. If they are not I will append some value to the array column "F". This is the code I have so far: df = spark.createDataFrame( [ (1, 56), (2, 32), (3, 99) …

arrays apache-spark pyspark apache-spark-sql append

asked Jan 21 '20 at 10:01

Lossa

341
2
3
9

7

votes

3 answers

"Parquet record is malformed" while column count is not 0

On an AWS EMR cluster, I'm trying to write a query result to parquet using Pyspark but face the following error: Caused by: java.lang.RuntimeException: Parquet record is malformed: empty fields are illegal, the field should be ommited completely…

hive pyspark amazon-emr parquet

asked Jan 10 '20 at 01:13

Shinagan

435
3
15

7

votes

2 answers

Problem running a Pandas UDF on a large dataset

I'm currently working on a project and I am having a hard time understanding how does the Pandas UDF in PySpark works. I have a Spark Cluster with one Master node with 8 cores and 64GB, along with two workers of 16 cores each and 112GB. My dataset…

python apache-spark pyspark pyarrow

asked Dec 26 '19 at 20:53

naifmeh

408
5
15

7

votes

2 answers

spark 2.4 Parquet column cannot be converted in file, Column: [Impressions], Expected: bigint, Found: BINARY

I'm facing a weird issue that I cannot understand. I have source data with a column "Impressions" that is sometimes a bigint / sometimes a string (when I manually explore the data). The HIVE schema registered for this column is of Long. Thus, when…

apache-spark pyspark apache-spark-sql

asked Nov 28 '19 at 21:05

Jay Cee

1,855
5
28
48

7

votes

1 answer

pandas_udf error RuntimeError: Result vector from pandas_udf was not the required length: expected 12, got 35

I am getting error with pandas_udf with the following code. The code is to create a column with data type based on another column. The same code works fine for the normal slower udf (commented out). Basically anything more sophisticated that…

python apache-spark pyspark

asked Nov 28 '19 at 05:04

Kevin Tianyu Xu

646
2
8
15

7

votes

1 answer

How to use date_add with two columns in pyspark?

I have a dataframe with some columns: +------------+--------+----------+----------+ |country_name| ID_user|birth_date| psdt| +------------+--------+----------+----------+ | Россия|16460783| 486|1970-01-01| | Россия|16467391| …

apache-spark pyspark apache-spark-sql

asked Nov 26 '19 at 01:22

Andrey Timonin

73
1
3

7

votes

1 answer

Getting the leaf probabilities of a tree model in spark

I'm trying to refactor a trained spark tree-based model (RandomForest or GBT classifiers) in such a way it can be exported in environments without spark. The toDebugString method is a good starting point. However, in the case of…

apache-spark pyspark apache-spark-ml

asked Nov 12 '19 at 13:14

nicola

24,005
3
35
56

7

votes

2 answers

Pyspark Dataframe pivot and groupby count

I am working on a pyspark dataframe which looks like below id category 1 A 1 A 1 B 2 B 2 A 3 B 3 B 3 B I want to unstack the category column and count their occurrences. So, the result I want is shown…

python pyspark

asked Oct 25 '19 at 11:15

Sayed Shazeb

75
1
5

7

votes

2 answers

apply function to all values in array column pyspark

I want to make all values in an array column in my pyspark data frame negative without exploding (!). I tried this udf but it didn't work: negative = func.udf(lambda x: x * -1, T.ArrayType(T.FloatType())) cast_contracts = cast_contracts \ …

arrays apache-spark pyspark user-defined-functions

asked Oct 22 '19 at 12:31

LN_P

1,448
4
21
37

Questions tagged [pyspark]

Useful Links:

Related Tags:

Submit a Python project to Dataproc job

Custom sorting in pyspark dataframes

How to execute a stored procedure in Azure Databricks PySpark?

pyspark.sql.utils.AnalysisException: Failed to find data source: kafka

Spark SQL - Regex for matching only numbers

Is there a function in pyspark dataframe that is similar to pandas.io.json.json_normalize

Append to PySpark array column

"Parquet record is malformed" while column count is not 0

Problem running a Pandas UDF on a large dataset

spark 2.4 Parquet column cannot be converted in file, Column: [Impressions], Expected: bigint, Found: BINARY

pandas_udf error RuntimeError: Result vector from pandas_udf was not the required length: expected 12, got 35

How to use date_add with two columns in pyspark?

Getting the leaf probabilities of a tree model in spark

Pyspark Dataframe pivot and groupby count

apply function to all values in array column pyspark