Questions tagged [pyspark]

The Spark Python API (PySpark) exposes the Apache Spark programming model to Python.

The Spark Python API (PySpark) exposes the Spark programming model to Python.

Useful Links:

Related Tags:

39058 questions
7
votes
3 answers

'DataFrame' object has no attribute 'withColumn'

I am trying to compare two pandas dataframes but I get an error as 'DataFrame' object has no attribute 'withColumn'. What could be the issue? import pandas as pd import pyspark.sql.functions as…
jakrm
  • 183
  • 2
  • 3
  • 11
7
votes
2 answers

How to use Spark Streaming to read a stream and find the IP over a time Window?

I am new to Apache Spark and I would like to write some code in Python using PySpark to read a stream and find the IP addresses. I have a Java class to generate some fake ip addresses in order to process them afterwards. This class will be listed…
dadadima
  • 938
  • 4
  • 28
7
votes
1 answer

Pyspark - Cumulative sum with reset condition

I have this dataframe +---+----+---+ | A| B| C| +---+----+---+ | 0|null| 1| | 1| 3.0| 0| | 2| 7.0| 0| | 3|null| 1| | 4| 4.0| 0| | 5| 3.0| 0| | 6|null| 1| | 7|null| 1| | 8|null| 1| | 9| 5.0| 0| | 10| 2.0| 0| | 11|null| …
Kafels
  • 3,864
  • 1
  • 15
  • 32
7
votes
4 answers

How to apply large python model to pyspark-dataframe?

I have: Large dataframe (parquet format, 100.000.000 rows, 4.5TB size) that contains some data (features) Several huge ML models (each one takes 5-15GB of RAM) Spark cluster (AWS EMR), typical node configuration is 8 CPU, 32 RAM, can be changed if…
7
votes
1 answer

Using Pyspark to read JSON items from an array?

I'm having some issues with reading items from Cosmos DB in databricks, it seems to read the JSON as a string value, and having some issues getting the data out of it to columns. I have a column called ProductRanges with the following values in a…
Jon
  • 4,593
  • 3
  • 13
  • 33
7
votes
1 answer

How to remove duplicates from a spark data frame while retaining the latest?

I'm using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the data frame retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be…
lalatnayak
  • 160
  • 1
  • 6
  • 21
7
votes
2 answers

pySpark forEachPartition - Where is code executed

I'm using pySpark in version 2.3 (cannot update to 2.4 in my current dev-System) and have the following questions concerning the foreachPartition. First a little context: As far as I understood pySpark-UDFs force the Python-code to be executed…
Markus
  • 2,265
  • 5
  • 28
  • 54
7
votes
5 answers

PySpark Dataframe melt columns into rows

As the subject describes, I have a PySpark Dataframe that I need to melt three columns into rows. Each column essentially represents a single fact in a category. The ultimate goal is to aggregate the data into a single total per category. There are…
Gary C
  • 93
  • 1
  • 1
  • 5
7
votes
1 answer

Comparison of a `float` to `np.nan` in Spark Dataframe

Is this expected behaviour? I thought to raise an issue with Spark, but this seems such a basic functionality, that it's hard to imagine that there's a bug here. What am I missing? Python import numpy as np >>> np.nan < 0.0 False >>> np.nan >…
avloss
  • 2,389
  • 2
  • 22
  • 26
7
votes
1 answer

How to get datediff() in seconds in pyspark?

I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below. Any hints? Below my code and output. eg =…
a_geo
  • 157
  • 1
  • 1
  • 6
7
votes
2 answers

pyspark - getting Latest partition from Hive partitioned column logic

I am new to pySpark. I am trying get the latest partition (date partition) of a hive table using PySpark-dataframes and done like below. But I am sure there is a better way to do it using dataframe functions (not by writing SQL). Could you…
vinu.m.19
  • 495
  • 2
  • 8
  • 16
7
votes
3 answers

pyspark: Method isBarrier([]) does not exist

I'm trying to learn Spark following some hello-word level example such as below, using pyspark. I got a "Method isBarrier([]) does not exist" error, full error included below the code. from pyspark import SparkContext if __name__ == '__main__': …
Indominus
  • 1,228
  • 15
  • 31
7
votes
1 answer

Concat multiple columns of a dataframe using pyspark

Suppose I have a list of columns, for example: col_list = ['col1','col2'] df = spark.read.json(path_to_file) print(df.columns) # ['col1','col2','col3'] I need to create a new column by concatenating col1 and col2. I don't want to hard code the…
Amita Rawat
  • 153
  • 1
  • 2
  • 6
7
votes
2 answers

Multiple WHEN condition implementation in Pyspark

I've my T-SQL code below which I've converted in Pyspark but is giving me error CASE WHEN time_on_site.eventaction = 'IN' AND time_on_site.next_action = 'OUT' AND time_on_site.timespent_sec < 72000 THEN 1 -- 20 hours WHEN…
Katelyn Raphael
  • 253
  • 2
  • 4
  • 16
7
votes
3 answers

Display PySpark Dataframe as HTML Table in Juypyter Notebook

I'm trying to display a PySpark dataframe as an HTML table in a Jupyter Notebook, but all methods seem to be failing. Using this method displays a text-formatted table: import pandas df.toPandas() Using this method displays the HTML table as a…
nxl4
  • 714
  • 2
  • 8
  • 17
1 2 3
99
100