Validate data from the same column in different rows with pyspark

Question

How can I change the value of a column depending on some validation between some cells? What I need is to compare the kilometraje values of each customer's (id) record to compare whether the record that follows the kilometraje is higher.

fecha      id   estado  id_cliente  error_code  kilometraje error_km
1/1/2019    1     A         1                       10  
2/1/2019    2     A                    ERROR        20  
3/1/2019    1     D         1          ERROR        30
4/1/2019    2     O                                          ERROR

The error in the error_km column is because for customer (id) 2 the kilometraje value is less than the same customer record for 2/1/2019 (If time passes the car is used so the kilometraje increases, so that there is no error the mileage has to be higher or the same)

I know that withColumn I can overwrite or create a column that doesn't exist and that using when I can set conditions. For example: This would be the code I use to validate the estado and id_cliente column and ERROR overwrite the error_code column where applicable, but I don't understand how to validate between different rows for the same client.

from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col

file_path = 'archive.txt'

error = 'ERROR'

df = spark.read.parquet(file_path)
df = df.persist(StorageLevel.MEMORY_AND_DISK)
df = df.select('estado', 'id_cliente')
df = df.withColumn("error_code", lit(''))

df = df.withColumn('error_code',
                            F.when((F.col('status') == 'O') &
                                    (F.col('client_id') != '') |
                                    (F.col('status') == 'D') &
                                    (F.col('client_id') != '') |
                                    (F.col('status') == 'A') &
                                    (F.col('client_id') == ''),
                                     F.concat(F.col("error_code"), F.lit(":[{}]".format(error)))
                                   )
                             .otherwise(F.col('error_code')))

Please don't post screenshots. They just lead to more work for people who answer your [questions](https://meta.stackoverflow.com/questions/285551/why-not-upload-images-of-code-on-so-when-asking-a-question/285557#285557). — cronoik, Jul 03 '19 at 23:38

score 2 · Answer 1 · answered Jul 03 '19 at 23:34

You achieve that with the lag window function. The lag function returns you the row before the current row. With that you can easily compare the kilometraje values. Have a look at the code below:

import pyspark.sql.functions as F
from pyspark.sql import Window

l = [('1/1/2019' , 1      , 10),
('2/1/2019', 2     , 20  ),
('3/1/2019', 1      , 30  ),
('4/1/2019', 1      , 10  ),
('5/1/2019', 1      , 30  ),
('7/1/2019', 3      , 30  ),
('4/1/2019', 2      , 5)]

columns = ['fecha', 'id', 'kilometraje']

df=spark.createDataFrame(l, columns)
df = df.withColumn('fecha',F.to_date(df.fecha,  'dd/MM/yyyy'))

w = Window.partitionBy('id').orderBy('fecha')

df = df.withColumn('error_km', F.when(F.lag('kilometraje').over(w) > df.kilometraje, F.lit('ERROR') ).otherwise(F.lit('')))

df.show()

Output:

+----------+---+-----------+--------+ 
|     fecha| id|kilometraje|error_km| 
+----------+---+-----------+--------+ 
|2019-01-01|  1|         10|        | 
|2019-01-03|  1|         30|        | 
|2019-01-04|  1|         10|   ERROR| 
|2019-01-05|  1|         30|        | 
|2019-01-07|  3|         30|        | 
|2019-01-02|  2|         20|        | 
|2019-01-04|  2|          5|   ERROR| 
+----------+---+-----------+--------+

The fourth row doesn't get labeled with 'ERROR' as the previous value had a smaller kilometraje value (10 < 30). When you want to label all the id's with 'ERROR' which contain at least one corrupted row, perform a left join.

df.drop('error_km').join(df.filter(df.error_km == 'ERROR').groupby('id').agg(F.first(df.error_km).alias('error_km')), 'id', 'left').show()

@cronoik nice answer . But it will not work in case the latest kilometeraje is greater than lag but less than the max mileage for that customerId. For example if i add one more row `('4/1/2019', 2 , 6 )` , Here 6 is greater than 5 but less than 20 for `Id` 2. So it should also give an error. But as per OP current scenario its perfect. — PIG, Jul 04 '19 at 07:15
I was about to raise that, because I'm using it and there I got that error 10, 15, 8, 9, 16, 8 and 9 are wrong because they are less than the 15 — Juan, Jul 04 '19 at 16:07
I've already found the solution, this is the magic https://stackoverflow.com/questions/45946349/python-spark-cumulative-sum-by-group-using-dataframe — Juan, Jul 04 '19 at 17:51
Well that perfectly shows that people who try answer questions can only work with the information which was provided. Just because their was a higher value before doesn't mean automatically that all following smaller values are incorrect. It could also mean the recorded higher value is incorrect. It really depends on how you (Juan) want to treat such scenarios in your data and it is therefore really important to provide such information in your question. — cronoik, Jul 04 '19 at 18:35

score 0 · Answer 2 · answered Jul 04 '19 at 18:21

I use .rangeBetween(Window.unboundedPreceding,0).

This function searches from the current value for the added value for the back

import pyspark
from pyspark.sql.functions import lit
from pyspark.sql import functions as F
from pyspark.sql.functions import col
from pyspark.sql import Window
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

error = 'This is error'

l = [('1/1/2019' , 1      , 10),
('2/1/2019', 2     , 20  ),
('3/1/2019', 1      , 30  ),
('4/1/2019', 1      , 10  ),
('5/1/2019', 1      , 22  ),
('7/1/2019', 1      , 23  ),
('22/1/2019', 2      , 5),
('11/1/2019', 2      , 24),
('13/2/2019', 1      , 16),
('14/2/2019', 2      , 18),
('5/2/2019', 1      , 19),
('6/2/2019', 2      , 23),
('7/2/2019', 1      , 14),
('8/3/2019', 1      , 50),
('8/3/2019', 2      , 50)]

columns = ['date', 'vin', 'mileage']

df=spark.createDataFrame(l, columns)
df = df.withColumn('date',F.to_date(df.date,  'dd/MM/yyyy'))
df = df.withColumn("max", lit(0))
df = df.withColumn("error_code", lit(''))

w = Window.partitionBy('vin').orderBy('date').rangeBetween(Window.unboundedPreceding,0)

df = df.withColumn('max',F.max('mileage').over(w))
df = df.withColumn('error_code', F.when(F.col('mileage') < F.col('max'), F.lit('ERROR')).otherwise(F.lit('')))

df.show()

Finally, all that remains is to remove the column that has the maximum

df = df.drop('max')
df.show()

Validate data from the same column in different rows with pyspark

2 Answers2