Highest Voted 'pyspark-pandas' Questions

0

votes

1 answer

How to filter out rows with lots of conditions in pyspark?

Let's say that these are my data: ` Product_Number| Condition| Type | Country 1 | New | Chainsaw | USA 1 | Old | Chainsaw | USA 1 | Null | Chainsaw | USA 2 | Old …

pyspark apache-spark-sql pyspark-pandas

asked Aug 10 '22 at 16:03

enas dyo

35
6

0

votes

0 answers

pyspark.sql.utils.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found: exists()

Spark Version : 3.3.0 pyspark Version: 3.1.1 python Version: 3.7.9 I am trying to work with the functionality of pyspark.pandas. I created a pyspark.pandas dataframe and converted it into spark data frame using df.to_spark() function. After that I…

python apache-spark pyspark apache-spark-sql pyspark-pandas

asked Jul 30 '22 at 18:40

Puneet Jain

1
1

0

votes

1 answer

I want to add a new column to a dataframe with values that I get from a for loop

I've written the below code: def func_sql(table,query): q=spark.sql(eval(f"f'{query}'")).collect()[0][0] print(q) lst=[] for i in range(int_df.count()): lst.append(int_df.collect()[i][3]) # print(lst) for x in lst: …

python pyspark apache-spark-sql pyspark-pandas

asked Jul 20 '22 at 07:47

Aishani Singh

29
5

0

votes

0 answers

Missing original values when converting csv file to dataframe

I have a csv file which looks like this- I've read it and stored it in a dataframe as follows: query_df=spark.read.csv('./rules/rules.csv',header=True) query_df.show(truncate=False) However when I view this dataframe using the .show() method…

python pyspark apache-spark-sql pyspark-pandas

asked Jul 18 '22 at 17:17

Aishani Singh

29
5

0

votes

0 answers

Pandas index operations in Pyspark

In our Data Science project we are playing around pandas dataframe, numpy and scipy libraries and we want to change the code into Pyspark, We are facing issues like: wst = cur_buck[:, [0]] cur_buck[:, :-1] = cur_buck[:, 1:] - wst cur_buck[:, -1] =…

python pandas pyspark data-science pyspark-pandas

asked Jul 06 '22 at 17:40

Shubham

1

0

votes

1 answer

convert nanosecond value into datetime using pyspark in databricks

I'm trying to recreate some work I have already done in Python using Databricks. I have a dataframe and within it is a column called 'time', of data in nanoseconds. In Python, I use the following code to convert the field into the appropriate…

python azure-databricks python-datetime pyspark-pandas

asked Jul 05 '22 at 14:50

JGW

314
4
18

0

votes

1 answer

How to do a column wise lead and lag in pyspark?

I have a data frame in pyspark where I have columns like Quantity1, Quantity 2, ......Quantity. I just want to sum up the previous 5 quantity fields value in these Quantity fields. So in this case I have to do a Column by Lead or Lag but I haven't…

sql pyspark apache-spark-sql pyspark-pandas

asked Jun 18 '22 at 09:50

Asif Khan

143
12

0

votes

1 answer

Reshape the dataframe using PySpark/Pandas according to custom logic

I have a dataframe with structure similar as shown…

python pandas dataframe pyspark pyspark-pandas

asked Jun 17 '22 at 14:36

sinawa2195

5
3

0

votes

1 answer

PySpark error when getting shape of Dataframe using Pandas on spark API

python python-3.x apache-spark pyspark pyspark-pandas

asked Jun 16 '22 at 22:28

Nitinram Velraaj

11
2

0

votes

1 answer

Repeated values in pyspark

I have a dataframe in pyspark where i have three columns df1 = spark.createDataFrame([ ('a', 3, 4.2), ('a', 7, 4.2), ('b', 7, 2.6), ('c', 7, 7.21), ('c', 11, 7.21), ('c', 18, 7.21), ('d', 15, 9.0), ], ['model', 'number',…

pyspark apache-spark-sql pyspark-pandas

asked May 02 '22 at 12:50

sunny

11
5

0

votes

1 answer

DataFrame Manipulation using pyspark-pandas

1998-02-10 1998-02-11 1998-02-12 1998-02-13 1998-02-14 1998-02-15 1998-02-16 19 20 10 65 12 5 46 10 17 15 45 10 20 45 12 12 …

python-3.x pandas dataframe pyspark-pandas

asked Apr 27 '22 at 01:26

flash speedster

33
4

0

votes

1 answer

Compare two couple of columns from two different pyspark dataframe to display the data that are different

i've got this dataframe with four columns df1 = spark.createDataFrame([ ('c', 'd', 3.0, 4), ('c', 'd', 7.3, 8), ('c', 'd', 7.3, 2), ('c', 'd', 7.3, 8), ('e', 'f', 6.0, 3), ('e', 'f', 6.0, 8), ('e', 'f', 6.0, 3), ('c',…

python dataframe pyspark apache-spark-sql pyspark-pandas

asked Apr 21 '22 at 15:20

sunny

11
5

0

votes

0 answers

Show() brings error after applying pandas udf to dataframe

I am having problems to make this trial code work. The final line df.select(plus_one(col("x"))).show() doesn't work, I also tried to save in a variable ( vardf = df.select(plus_one(col("x"))) followed by vardf.show() and fails too. import…

pyspark show pyspark-pandas

asked Apr 21 '22 at 15:02

Paul Villagra

1
1

0

votes

1 answer

upload a sample pyspark dataframe to Azure blob, after converting it to excel format

I'm trying to upload a sample pyspark dataframe to Azure blob, after converting it to excel format. Getting the below error. Also, below is the snippet of my sample code. If there is a other way to do the same, pls let me know. from…

pandas pyspark azure-databricks pyspark-pandas

asked Apr 21 '22 at 06:46

kanishk kashyap

63
1
10

0

votes

2 answers

i want to obtain max value of a column depending on two other columns and for the forth column the value of the most repeated number

I've got this dataframe df1 = spark.createDataFrame([ ('c', 'd', 3.0, 4), ('c', 'd', 7.3, 8), ('c', 'd', 7.3, 2), ('c', 'd', 7.3, 8), ('e', 'f', 6.0, 3), ('e', 'f', 6.0, 8), ('e', 'f', 6.0, 3), ('c', 'j', 4.2, 3), …

pyspark apache-spark-sql pyspark-pandas pyspark-schema

asked Apr 19 '22 at 13:09

sunny

11
5

Questions tagged [pyspark-pandas]