I am trying to compare two pandas dataframes but I get an error as 'DataFrame' object has no attribute 'withColumn'. What could be the issue?
import pandas as pd
import pyspark.sql.functions as…
I am new to Apache Spark and I would like to write some code in Python using PySpark to read a stream and find the IP addresses.
I have a Java class to generate some fake ip addresses in order to process them afterwards. This class will be listed…
I have:
Large dataframe (parquet format, 100.000.000 rows, 4.5TB size) that contains some data (features)
Several huge ML models (each one takes 5-15GB of RAM)
Spark cluster (AWS EMR), typical node configuration is 8 CPU, 32 RAM, can be changed if…
I'm having some issues with reading items from Cosmos DB in databricks, it seems to read the JSON as a string value, and having some issues getting the data out of it to columns.
I have a column called ProductRanges with the following values in a…
I'm using spark to load json files from Amazon S3. I would like to remove duplicates based on two columns of the data frame retaining the newest(I have timestamp column). What would be the best way to do it? Please note that the duplicates may be…
I'm using pySpark in version 2.3 (cannot update to 2.4 in my current dev-System) and have the following questions concerning the foreachPartition.
First a little context: As far as I understood pySpark-UDFs force the Python-code to be executed…
As the subject describes, I have a PySpark Dataframe that I need to melt three columns into rows. Each column essentially represents a single fact in a category. The ultimate goal is to aggregate the data into a single total per category.
There are…
Is this expected behaviour? I thought to raise an issue with Spark, but this seems such a basic functionality, that it's hard to imagine that there's a bug here. What am I missing?
Python
import numpy as np
>>> np.nan < 0.0
False
>>> np.nan >…
I have tried the code as in (this_post) and cannot get the date difference in seconds. I just take the datediff() between the columns 'Attributes_Timestamp_fix' and 'lagged_date' below.
Any hints?
Below my code and output.
eg =…
I am new to pySpark.
I am trying get the latest partition (date partition) of a hive table using PySpark-dataframes and done like below.
But I am sure there is a better way to do it using dataframe functions (not by writing SQL). Could you…
I'm trying to learn Spark following some hello-word level example such as below, using pyspark. I got a "Method isBarrier([]) does not exist" error, full error included below the code.
from pyspark import SparkContext
if __name__ == '__main__':
…
Suppose I have a list of columns, for example:
col_list = ['col1','col2']
df = spark.read.json(path_to_file)
print(df.columns)
# ['col1','col2','col3']
I need to create a new column by concatenating col1 and col2. I don't want to hard code the…
I've my T-SQL code below which I've converted in Pyspark but is giving me error
CASE
WHEN time_on_site.eventaction = 'IN' AND time_on_site.next_action = 'OUT' AND time_on_site.timespent_sec < 72000 THEN 1 -- 20 hours
WHEN…
I'm trying to display a PySpark dataframe as an HTML table in a Jupyter Notebook, but all methods seem to be failing.
Using this method displays a text-formatted table:
import pandas
df.toPandas()
Using this method displays the HTML table as a…