2

I just started with Great Expectations library and I want to know if it is possible to use it to remove invalidated data from Pandas DataFrame. And how I can do that if is possible ? Also I want to insert invalid data to PostgreSQL database.

I didn't find anything about this in the documentation and on searching the Web.

Later Edit : To clarify: I need that in the case great expectation for example find 5 rows in a DataFrame that are invalid (for example df.expect_column_values_to_not_be_null('age') has 5 rows with null) to remove them from original DataFrame and insert them in a PostgreSQL errors table

Florin P.
  • 45
  • 7

1 Answers1

1

Great Expectations is a powerful tool to validate data.
Like all powerful tools, it's not that straightforward.

You can start from here:

import great_expectations as ge
import numpy as np
import pandas as pd
    
# get some random numbers and create a pandas df
df_raw = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

# initialize a "great_expectations" df 
df = ge.from_pandas(df_raw)

# search for invalidate data on column 'A'. 
# In this case, i'm looking for any null value from column 'A'.
df.expect_column_values_to_not_be_null('A')

Results:

{
  "exception_info": null,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "A",
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "meta": {},
  "success": true,
  "result": {
    "element_count": 100,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  }
}

Look at the response : good news !!!
There aren't null values in my df. "unexpected_count" is equal to 0

API Reference : https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/index.html


EDIT: If you need simply to find some invalid values and split your df into:

  1. Clean Dataframe
  2. Dirty Dataframe

maybe you dont need "great_expectations". you can use a function like this:

import pandas as pd

my_df = pd.DataFrame({'A': [1,2,1,2,3,0,1,1,5,2]})

def check_data_quality(dataframe):
    df = dataframe
    clean_df = df[df['A'].isin([1, 2])]
    dirty_df = df[df["A"].isin([1, 2]) == False]
    return {'clean': clean_df, 
            'dirty': dirty_df}

my_df_clean = check_data_quality(my_df)['clean']
my_df_dirty = check_data_quality(my_df)['dirty']
BlackMath
  • 1,708
  • 1
  • 11
  • 14
  • First thank you for the answer. I don't understand the example : Were no null values in the original df or great expectations erase nul lvalues form df ? What I need is like this : great expectation finds 5 rows in df that were invalid and I need a possibility to erase these rows from original df and insert them in a errors table – Florin P. Sep 20 '21 at 10:46
  • In my example "great expectations" is used to find if there are null values in the original df. No replacements, no eliminations. "great expectations" is great to determine your data quality. – BlackMath Sep 20 '21 at 10:51
  • ok. I understand but is not exactly what I need. I need (from your example) to collect all rows that has null in any column , remove them form original df and insert collected rows in an errors table – Florin P. Sep 20 '21 at 10:54
  • have a look at this : https://legacy.docs.greatexpectations.io/en/latest/autoapi/great_expectations/expectations/core/expect_table_row_count_to_equal/index.html – BlackMath Sep 20 '21 at 10:55
  • Sorry to bother you again but I don't understand : How I can use this API ? I don't know before how many rows have an expected problem. For example I have an expectation for column ('A' ) to hava only values "one" and "two". I need to collect all rows that invalidate this rule, collect them in a separate (maybe) dataframe and delete them from original dataframe – Florin P. Sep 20 '21 at 11:05
  • To do this, you don't need a complex tool like "Great Expectations". I edited my answer to your specific case. – BlackMath Sep 20 '21 at 12:02