Questions tagged [pyspark-pandas]
131 questions
0
votes
1 answer
How to filter out rows with lots of conditions in pyspark?
Let's say that these are my data:
` Product_Number| Condition| Type | Country
1 | New | Chainsaw | USA
1 | Old | Chainsaw | USA
1 | Null | Chainsaw | USA
2 | Old …

enas dyo
- 35
- 6
0
votes
0 answers
pyspark.sql.utils.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found: exists()
Spark Version : 3.3.0
pyspark Version: 3.1.1
python Version: 3.7.9
I am trying to work with the functionality of pyspark.pandas.
I created a pyspark.pandas dataframe and converted it into spark data frame using df.to_spark() function. After that I…

Puneet Jain
- 1
- 1
0
votes
1 answer
I want to add a new column to a dataframe with values that I get from a for loop
I've written the below code:
def func_sql(table,query):
q=spark.sql(eval(f"f'{query}'")).collect()[0][0]
print(q)
lst=[]
for i in range(int_df.count()):
lst.append(int_df.collect()[i][3])
# print(lst)
for x in lst:
…

Aishani Singh
- 29
- 5
0
votes
0 answers
Missing original values when converting csv file to dataframe
I have a csv file which looks like this-
I've read it and stored it in a dataframe as follows:
query_df=spark.read.csv('./rules/rules.csv',header=True)
query_df.show(truncate=False)
However when I view this dataframe using the .show() method…

Aishani Singh
- 29
- 5
0
votes
0 answers
Pandas index operations in Pyspark
In our Data Science project we are playing around pandas dataframe, numpy and scipy libraries and we want to change the code into Pyspark, We are facing issues like:
wst = cur_buck[:, [0]]
cur_buck[:, :-1] = cur_buck[:, 1:] - wst
cur_buck[:, -1] =…

Shubham
- 1
0
votes
1 answer
convert nanosecond value into datetime using pyspark in databricks
I'm trying to recreate some work I have already done in Python using Databricks. I have a dataframe and within it is a column called 'time', of data in nanoseconds. In Python, I use the following code to convert the field into the appropriate…

JGW
- 314
- 4
- 18
0
votes
1 answer
How to do a column wise lead and lag in pyspark?
I have a data frame in pyspark where I have columns like Quantity1, Quantity 2, ......Quantity. I just want to sum up the previous 5 quantity fields value in these Quantity fields. So in this case I have to do a Column by Lead or Lag but I haven't…

Asif Khan
- 143
- 12
0
votes
1 answer
Reshape the dataframe using PySpark/Pandas according to custom logic
I have a dataframe with structure similar as shown…

sinawa2195
- 5
- 3
0
votes
1 answer
PySpark error when getting shape of Dataframe using Pandas on spark API
My folder structure is currently this
|- logger
|--- __init__.py
|--- logger.py
|- another_package
|--- __init__.py
|--- module1.py
|- models
|--- model1
|------ main.py
|------ model1_utilities.py
The spark context and session are started in…

Nitinram Velraaj
- 11
- 2
0
votes
1 answer
Repeated values in pyspark
I have a dataframe in pyspark where i have three columns
df1 = spark.createDataFrame([
('a', 3, 4.2),
('a', 7, 4.2),
('b', 7, 2.6),
('c', 7, 7.21),
('c', 11, 7.21),
('c', 18, 7.21),
('d', 15, 9.0),
], ['model', 'number',…

sunny
- 11
- 5
0
votes
1 answer
DataFrame Manipulation using pyspark-pandas
1998-02-10 1998-02-11 1998-02-12 1998-02-13 1998-02-14 1998-02-15 1998-02-16
19 20 10 65 12 5 46
10 17 15 45 10 20 45
12 12 …

flash speedster
- 33
- 4
0
votes
1 answer
Compare two couple of columns from two different pyspark dataframe to display the data that are different
i've got this dataframe with four columns
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c',…

sunny
- 11
- 5
0
votes
0 answers
Show() brings error after applying pandas udf to dataframe
I am having problems to make this trial code work. The final line df.select(plus_one(col("x"))).show() doesn't work, I also tried to save in a variable ( vardf = df.select(plus_one(col("x"))) followed by vardf.show() and fails too.
import…

Paul Villagra
- 1
- 1
0
votes
1 answer
upload a sample pyspark dataframe to Azure blob, after converting it to excel format
I'm trying to upload a sample pyspark dataframe to Azure blob, after converting it to excel format. Getting the below error. Also, below is the snippet of my sample code.
If there is a other way to do the same, pls let me know.
from…

kanishk kashyap
- 63
- 1
- 10
0
votes
2 answers
i want to obtain max value of a column depending on two other columns and for the forth column the value of the most repeated number
I've got this dataframe
df1 = spark.createDataFrame([
('c', 'd', 3.0, 4),
('c', 'd', 7.3, 8),
('c', 'd', 7.3, 2),
('c', 'd', 7.3, 8),
('e', 'f', 6.0, 3),
('e', 'f', 6.0, 8),
('e', 'f', 6.0, 3),
('c', 'j', 4.2, 3),
…

sunny
- 11
- 5