Highest Voted 'pyspark' Questions

7

votes

3 answers

'DataFrame' object has no attribute 'withColumn'

I am trying to compare two pandas dataframes but I get an error as 'DataFrame' object has no attribute 'withColumn'. What could be the issue? import pandas as pd import pyspark.sql.functions as…

python pyspark

asked Jul 11 '19 at 11:38

jakrm

183
2
3
11

7

votes

2 answers

How to use Spark Streaming to read a stream and find the IP over a time Window?

I am new to Apache Spark and I would like to write some code in Python using PySpark to read a stream and find the IP addresses. I have a Java class to generate some fake ip addresses in order to process them afterwards. This class will be listed…

python pyspark spark-streaming

asked Jun 20 '19 at 16:45

dadadima

938
4
28

7

votes

1 answer

Pyspark - Cumulative sum with reset condition

I have this dataframe +---+----+---+ | A| B| C| +---+----+---+ | 0|null| 1| | 1| 3.0| 0| | 2| 7.0| 0| | 3|null| 1| | 4| 4.0| 0| | 5| 3.0| 0| | 6|null| 1| | 7|null| 1| | 8|null| 1| | 9| 5.0| 0| | 10| 2.0| 0| | 11|null| …

python dataframe apache-spark pyspark cumulative-sum

asked May 30 '19 at 19:37

Kafels

3,864
1
15
32

7

votes

4 answers

How to apply large python model to pyspark-dataframe?

I have: Large dataframe (parquet format, 100.000.000 rows, 4.5TB size) that contains some data (features) Several huge ML models (each one takes 5-15GB of RAM) Spark cluster (AWS EMR), typical node configuration is 8 CPU, 32 RAM, can be changed if…

python apache-spark machine-learning pyspark apache-spark-sql

asked May 15 '19 at 15:51

Ivan Menshikh

144
2
9

7

votes

1 answer

Using Pyspark to read JSON items from an array?

I'm having some issues with reading items from Cosmos DB in databricks, it seems to read the JSON as a string value, and having some issues getting the data out of it to columns. I have a column called ProductRanges with the following values in a…

json pyspark databricks azure-databricks

asked May 13 '19 at 15:19

Jon

4,593
3
13
33

7

votes

1 answer

How to remove duplicates from a spark data frame while retaining the latest?

I'm using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the data frame retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be…

pyspark apache-spark-sql

asked Apr 12 '19 at 22:18

lalatnayak

160
1
6
21

7

votes

2 answers

pySpark forEachPartition - Where is code executed

I'm using pySpark in version 2.3 (cannot update to 2.4 in my current dev-System) and have the following questions concerning the foreachPartition. First a little context: As far as I understood pySpark-UDFs force the Python-code to be executed…

python pandas apache-spark pyspark

asked Apr 12 '19 at 15:27

Markus

2,265
5
28
54

7

votes

5 answers

PySpark Dataframe melt columns into rows

As the subject describes, I have a PySpark Dataframe that I need to melt three columns into rows. Each column essentially represents a single fact in a category. The ultimate goal is to aggregate the data into a single total per category. There are…

python dataframe pyspark aggregate melt

asked Mar 27 '19 at 13:10

Gary C

93
1
1
5

7

votes

1 answer

Comparison of a `float` to `np.nan` in Spark Dataframe

Is this expected behaviour? I thought to raise an issue with Spark, but this seems such a basic functionality, that it's hard to imagine that there's a bug here. What am I missing? Python import numpy as np >>> np.nan < 0.0 False >>> np.nan >…

python numpy apache-spark pyspark nan

asked Mar 18 '19 at 18:11

avloss

2,389
2
22
26

7

votes

1 answer

How to get datediff() in seconds in pyspark?

I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below. Any hints? Below my code and output. eg =…

python apache-spark pyspark datediff

asked Mar 08 '19 at 23:42

a_geo

157
1
1
6

7

votes

2 answers

pyspark - getting Latest partition from Hive partitioned column logic

I am new to pySpark. I am trying get the latest partition (date partition) of a hive table using PySpark-dataframes and done like below. But I am sure there is a better way to do it using dataframe functions (not by writing SQL). Could you…

apache-spark hive pyspark hive-partitions

asked Mar 07 '19 at 21:40

vinu.m.19

495
2
8
16

7

votes

3 answers

pyspark: Method isBarrier([]) does not exist

I'm trying to learn Spark following some hello-word level example such as below, using pyspark. I got a "Method isBarrier([]) does not exist" error, full error included below the code. from pyspark import SparkContext if __name__ == '__main__': …

python apache-spark pyspark

asked Mar 04 '19 at 17:21

Indominus

1,228
15
31

7

votes

1 answer

Concat multiple columns of a dataframe using pyspark

Suppose I have a list of columns, for example: col_list = ['col1','col2'] df = spark.read.json(path_to_file) print(df.columns) # ['col1','col2','col3'] I need to create a new column by concatenating col1 and col2. I don't want to hard code the…

apache-spark pyspark apache-spark-sql

asked Feb 28 '19 at 08:32

Amita Rawat

153
1
2
6

7

votes

2 answers

Multiple WHEN condition implementation in Pyspark

I've my T-SQL code below which I've converted in Pyspark but is giving me error CASE WHEN time_on_site.eventaction = 'IN' AND time_on_site.next_action = 'OUT' AND time_on_site.timespent_sec < 72000 THEN 1 -- 20 hours WHEN…

tsql pyspark apache-spark-sql case-when .when

asked Feb 21 '19 at 21:59

Katelyn Raphael

253
2
4
16

7

votes

3 answers

Display PySpark Dataframe as HTML Table in Juypyter Notebook

I'm trying to display a PySpark dataframe as an HTML table in a Jupyter Notebook, but all methods seem to be failing. Using this method displays a text-formatted table: import pandas df.toPandas() Using this method displays the HTML table as a…

python pandas pyspark jupyter-notebook

asked Feb 15 '19 at 15:47

nxl4

714
2
8
17

Questions tagged [pyspark]

Useful Links:

Related Tags:

'DataFrame' object has no attribute 'withColumn'

How to use Spark Streaming to read a stream and find the IP over a time Window?

Pyspark - Cumulative sum with reset condition

How to apply large python model to pyspark-dataframe?

Using Pyspark to read JSON items from an array?

How to remove duplicates from a spark data frame while retaining the latest?

pySpark forEachPartition - Where is code executed

PySpark Dataframe melt columns into rows

Comparison of a `float` to `np.nan` in Spark Dataframe

How to get datediff() in seconds in pyspark?

pyspark - getting Latest partition from Hive partitioned column logic

pyspark: Method isBarrier([]) does not exist

Concat multiple columns of a dataframe using pyspark

Multiple WHEN condition implementation in Pyspark

Display PySpark Dataframe as HTML Table in Juypyter Notebook