I was learning pyspark
which uses HiveQL
and found it interesting that the percent rank gives two different answers for pyspark-sql
and pandas
.
Question Source with sql code: https://www.windowfunctions.com/questions/ranking/3
How to get the same result as SQL in pandas?
Two Questions
- What is the python code that gives same result as SQL?
- What is the SQL code that gives the same result as pandas?
pyspark-sql
q = """
select name, weight,
percent_rank() over (order by weight) as percent_rank_wt
from cats
order by weight
"""
spark.sql(q).show()
SQL gives this table. I would like same table using pandas.
+-------+------+-------------------+
| name|weight| percent_rank_wt|
+-------+------+-------------------+
| Tigger| 3.8| 0.0|
| Molly| 4.2|0.09090909090909091|
| Ashes| 4.5|0.18181818181818182|
|Charlie| 4.8| 0.2727272727272727|
| Smudge| 4.9|0.36363636363636365|
| Felix| 5.0|0.45454545454545453|
| Puss| 5.1| 0.5454545454545454|
| Millie| 5.4| 0.6363636363636364|
| Alfie| 5.5| 0.7272727272727273|
| Misty| 5.7| 0.8181818181818182|
| Oscar| 6.1| 0.9090909090909091|
| Smokey| 6.1| 0.9090909090909091|
+-------+------+-------------------+
pandas
methods = {'average', 'min', 'max', 'first', 'dense'}
df[['name','weight']].sort_values('weight').assign(
pct_avg=df['weight'].rank(pct=True,method='average'),
pct_min=df['weight'].rank(pct=True,method='min'),
pct_max=df['weight'].rank(pct=True,method='max'),
pct_first=df['weight'].rank(pct=True,method='first'),
pct_dense=df['weight'].rank(pct=True,method='dense')
).sort_values('weight')
name weight pct_avg pct_min pct_max pct_first pct_dense
4 Tigger 3.8 0.083333 0.083333 0.083333 0.083333 0.090909
0 Molly 4.2 0.166667 0.166667 0.166667 0.166667 0.181818
1 Ashes 4.5 0.250000 0.250000 0.250000 0.250000 0.272727
11 Charlie 4.8 0.333333 0.333333 0.333333 0.333333 0.363636
3 Smudge 4.9 0.416667 0.416667 0.416667 0.416667 0.454545
2 Felix 5.0 0.500000 0.500000 0.500000 0.500000 0.545455
9 Puss 5.1 0.583333 0.583333 0.583333 0.583333 0.636364
7 Millie 5.4 0.666667 0.666667 0.666667 0.666667 0.727273
5 Alfie 5.5 0.750000 0.750000 0.750000 0.750000 0.818182
8 Misty 5.7 0.833333 0.833333 0.833333 0.833333 0.909091
6 Oscar 6.1 0.958333 0.916667 1.000000 0.916667 1.000000
10 Smokey 6.1 0.958333 0.916667 1.000000 1.000000 1.000000
setup
import numpy as np
import pandas as pd
import pyspark
from pyspark.sql.types import *
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from pyspark import SparkConf, SparkContext, SQLContext
spark = pyspark.sql.SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)
df = pd.DataFrame({
'name': [
'Molly', 'Ashes', 'Felix', 'Smudge', 'Tigger', 'Alfie', 'Oscar',
'Millie', 'Misty', 'Puss', 'Smokey', 'Charlie'
],
'breed': [
'Persian', 'Persian', 'Persian', 'British Shorthair',
'British Shorthair', 'Siamese', 'Siamese', 'Maine Coon', 'Maine Coon',
'Maine Coon', 'Maine Coon', 'British Shorthair'
],
'weight': [4.2, 4.5, 5.0, 4.9, 3.8, 5.5, 6.1, 5.4, 5.7, 5.1, 6.1, 4.8],
'color': [
'Black', 'Black', 'Tortoiseshell', 'Black', 'Tortoiseshell', 'Brown',
'Black', 'Tortoiseshell', 'Brown', 'Tortoiseshell', 'Brown', 'Black'
],
'age': [1, 5, 2, 4, 2, 5, 1, 5, 2, 2, 4, 4]
})
schema = StructType([
StructField('name', StringType(), True),
StructField('breed', StringType(), True),
StructField('weight', DoubleType(), True),
StructField('color', StringType(), True),
StructField('age', IntegerType(), True),
])
sdf = sqlContext.createDataFrame(df, schema)
sdf.createOrReplaceTempView("cats")