Questions tagged [data-quality]

Data quality could refer to conditions of the data and techniques to evaluate or improve such conditions

124 questions
1
vote
0 answers

how can I specify a different database and schema to create temporary tables in Great Expectations?

Great Expectations creates temporary tables. I tried profiling data in my Snowflake lab. It worked because the role I was using could create tables in the schema that contained the tables I was profiling. I tried to profile a table in a Snowflake…
Alex Woolford
  • 4,433
  • 11
  • 47
  • 80
1
vote
1 answer

python great expectation compatible with pyspark

I am implementing data quality checks using Great expectation library. does this library compatible with Pyspark does this run on multiple cores?
code_bug
  • 355
  • 1
  • 12
1
vote
0 answers

Display whole rows in great_expectations dashboard

When an expectation fails, I cannot view on the dashboard (the data docs) the entire row (and not just the column value) which caused the failure. For example, if I have a failure because the maximum value of a numerical column is over a threshold,…
aprospero
  • 529
  • 3
  • 14
1
vote
0 answers

Great Expectations: How to add a partition (column partition) in an Athena External Table in a checkpoint reference in GE?

The setup is GE v3 and I am using AWS Athena as a Data Source. However, I couldn't find a way to tell the "expectation" that the table in actually partitioned with a relative path in S3 like…
nandevers
  • 191
  • 8
1
vote
0 answers

Data Quality Framework in AWS

I am trying to implement a data quality framework for an application which ingests data from various systems(batch, near real time, real time). Few items that I want to highlight here are: The data pipelines widely vary and ingest very high volumes…
1
vote
1 answer

Pyspark how can identify unmatched row value from two data frame

I have below two data frame from which i am trying to identify the unmatched row value from data frame two. This is the part of migration where i want to see the difference after source data being migrated/moved to different…
cloud_hari
  • 147
  • 1
  • 8
1
vote
1 answer

When will the data quality function be released to the release version

When will the dolphinscheduler data quality function be released to the release version? Is there a time plan for this。
JK.LEE
  • 11
  • 3
1
vote
1 answer

Data Quality Process - defining rules

I am working on a Data Quality Monitoring project which is new me. I started with a Data Profiling to analyse my data and have a global view of it. Next, i thought about defining some data quality rules, but i'm a little bit confused about how to…
biihu
  • 69
  • 6
1
vote
1 answer

Data quality - Missing values (Pandas)

I'm working on a data quality project. I'm trying to generate a data quality report using pandas-profiling profileReport but when i verify the report it says that i have no missing values while i do have empty cells. Or do you have any other…
biihu
  • 69
  • 6
1
vote
1 answer

Data quality - check if all values in a character column are numbers in R

I am looking to perform data quality on numerous system generated tables. One of the checks is to see if all values in a character column are only numbers. I am looking to know the number columns where this check is true. Using the following table…
Ryan Garnett
  • 231
  • 2
  • 8
1
vote
1 answer

How to use hasUniqueness check in PyDeequ?

I'm using PyDeequ for data quality and I want to check the uniqueness of a set of columns. There is a Check method hasUniqueness but I can't figure how to use it. I'm trying: check.hasUniqueness([col1, col2], ????) But what should we use here for…
ruy
  • 23
  • 3
1
vote
0 answers

Standardized format for specification for data quality tests

I'm working on automation of data quality tests. I found a Great-expectation framework that seems wonderful to do the tests. But using it requires "manual" coding of the checks & keep them up-to-date with changes in the spec (requiring a coder…
Mira Renda
  • 11
  • 1
1
vote
1 answer

Define Data Quality Rules for Big Data

Is there any way to define Data quality rules that can be applied over Dataframes. The template to define the rule should be easy enough for any lay man to define and then we can take these rules and convert them to pyspark codes and run them over…
Snehasish Das
  • 280
  • 2
  • 12
1
vote
1 answer

Histogram in Anomaly detection Deequ library

Can we use histogram analyzer in anomaly detection? Let's say, I want to check for the change in the ratio of variables in a specified column. For example histogram analysis for a column with Male and Female as values is something like (Male - 0.6)…
1
vote
0 answers

Can ICP4D (IBM Cloud Pak for Data) also be used as Data Quality tool?

Can ICP4D (IBM Cloud Pak for Data) also be used as DQ(Data Quality) tool ? I know it is primarily not meant for DQ but does it have capability to address few DQ dimensions/areas ?
NKS
  • 21
  • 1
  • 6
1 2
3
8 9