Questions tagged [pyspark-pandas]

131 questions
0
votes
1 answer

How to filter out rows with lots of conditions in pyspark?

Let's say that these are my data: ` Product_Number| Condition| Type | Country 1 | New | Chainsaw | USA 1 | Old | Chainsaw | USA 1 | Null | Chainsaw | USA 2 | Old …
enas dyo
  • 35
  • 6
0
votes
0 answers

pyspark.sql.utils.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found: exists()

Spark Version : 3.3.0 pyspark Version: 3.1.1 python Version: 3.7.9 I am trying to work with the functionality of pyspark.pandas. I created a pyspark.pandas dataframe and converted it into spark data frame using df.to_spark() function. After that I…
0
votes
1 answer

I want to add a new column to a dataframe with values that I get from a for loop

I've written the below code: def func_sql(table,query): q=spark.sql(eval(f"f'{query}'")).collect()[0][0] print(q) lst=[] for i in range(int_df.count()): lst.append(int_df.collect()[i][3]) # print(lst) for x in lst: …
0
votes
0 answers

Missing original values when converting csv file to dataframe

I have a csv file which looks like this- I've read it and stored it in a dataframe as follows: query_df=spark.read.csv('./rules/rules.csv',header=True) query_df.show(truncate=False) However when I view this dataframe using the .show() method…
0
votes
0 answers

Pandas index operations in Pyspark

In our Data Science project we are playing around pandas dataframe, numpy and scipy libraries and we want to change the code into Pyspark, We are facing issues like: wst = cur_buck[:, [0]] cur_buck[:, :-1] = cur_buck[:, 1:] - wst cur_buck[:, -1] =…
0
votes
1 answer

convert nanosecond value into datetime using pyspark in databricks

I'm trying to recreate some work I have already done in Python using Databricks. I have a dataframe and within it is a column called 'time', of data in nanoseconds. In Python, I use the following code to convert the field into the appropriate…
JGW
  • 314
  • 4
  • 18
0
votes
1 answer

How to do a column wise lead and lag in pyspark?

I have a data frame in pyspark where I have columns like Quantity1, Quantity 2, ......Quantity. I just want to sum up the previous 5 quantity fields value in these Quantity fields. So in this case I have to do a Column by Lead or Lag but I haven't…
Asif Khan
  • 143
  • 12
0
votes
1 answer

Reshape the dataframe using PySpark/Pandas according to custom logic

I have a dataframe with structure similar as shown…
0
votes
1 answer

PySpark error when getting shape of Dataframe using Pandas on spark API

My folder structure is currently this |- logger |--- __init__.py |--- logger.py |- another_package |--- __init__.py |--- module1.py |- models |--- model1 |------ main.py |------ model1_utilities.py The spark context and session are started in…
0
votes
1 answer

Repeated values in pyspark

I have a dataframe in pyspark where i have three columns df1 = spark.createDataFrame([ ('a', 3, 4.2), ('a', 7, 4.2), ('b', 7, 2.6), ('c', 7, 7.21), ('c', 11, 7.21), ('c', 18, 7.21), ('d', 15, 9.0), ], ['model', 'number',…
sunny
  • 11
  • 5
0
votes
1 answer

DataFrame Manipulation using pyspark-pandas

1998-02-10 1998-02-11 1998-02-12 1998-02-13 1998-02-14 1998-02-15 1998-02-16 19 20 10 65 12 5 46 10 17 15 45 10 20 45 12 12 …
0
votes
1 answer

Compare two couple of columns from two different pyspark dataframe to display the data that are different

i've got this dataframe with four columns df1 = spark.createDataFrame([ ('c', 'd', 3.0, 4), ('c', 'd', 7.3, 8), ('c', 'd', 7.3, 2), ('c', 'd', 7.3, 8), ('e', 'f', 6.0, 3), ('e', 'f', 6.0, 8), ('e', 'f', 6.0, 3), ('c',…
0
votes
0 answers

Show() brings error after applying pandas udf to dataframe

I am having problems to make this trial code work. The final line df.select(plus_one(col("x"))).show() doesn't work, I also tried to save in a variable ( vardf = df.select(plus_one(col("x"))) followed by vardf.show() and fails too. import…
0
votes
1 answer

upload a sample pyspark dataframe to Azure blob, after converting it to excel format

I'm trying to upload a sample pyspark dataframe to Azure blob, after converting it to excel format. Getting the below error. Also, below is the snippet of my sample code. If there is a other way to do the same, pls let me know. from…
0
votes
2 answers

i want to obtain max value of a column depending on two other columns and for the forth column the value of the most repeated number

I've got this dataframe df1 = spark.createDataFrame([ ('c', 'd', 3.0, 4), ('c', 'd', 7.3, 8), ('c', 'd', 7.3, 2), ('c', 'd', 7.3, 8), ('e', 'f', 6.0, 3), ('e', 'f', 6.0, 8), ('e', 'f', 6.0, 3), ('c', 'j', 4.2, 3), …
1 2 3
8
9