Data quality could refer to conditions of the data and techniques to evaluate or improve such conditions
Questions tagged [data-quality]
124 questions
0
votes
1 answer
Dataplex Data Quality Rules
I am looking for a google native option for data quality and went through Dataplex in GCP world. However there are 2 ways to define data quality rules in Dataplex - i) via process & ii) via Govern.
What is the difference between Dataplex Data…

Gora Bhattacharya
- 1
- 1
- 4
0
votes
0 answers
How to write regular expressions in the Rules Grid in Abinitio
I have a regular expression which is working perfectly fine in the Sheet view in Abinitio ExpressIT but I am trying to do the same in the Rules Grid / Grid view
But I am not sure which function can I use in the Rules Grid. Tried with re_get_match…

JKC
- 2,498
- 6
- 30
- 56
0
votes
0 answers
Data Quality Flag
let's talk about data quality flags. As I have only limited knowledge about this topic, I hope some of you are more experienced.
Let's say we have two tables inside the first layer of a datawarehouse. The first one does contain a person's ID and its…

Ai4l2s
- 525
- 2
- 9
0
votes
0 answers
XGBoost separation of weekday and marketing campaign
Since 2020 and up until today, we conduct marketing campaigns almost every Sunday, and I'm trying to calculate their impact using the XGBoost model and to calculate sundays without this campaign which is basically a high discount on most of the…

Thomas Dorloff
- 111
- 1
- 1
- 5
0
votes
0 answers
How to measure data quality?
I have a question regarding data quality. I am aware of the data quality dimensions, but I'd like to be able to measure data quality in numbers.
For example how many NULL are acceptable in a column e.g. 2%, 5%, 10% etc.
I know every data set is…

ninelondon
- 97
- 6
0
votes
2 answers
Check the data quality in Google Sheets (asking for suggestions)
I'm trying to create a sheet to check the data quality from a survey in Google Sheets the document have this format:
So basically I was using this formula =COUNTIF(B2:F2,"Don't know") to count Don't know, empty spaces, 0's and numbers > than 9, if…
user16239103
0
votes
0 answers
How to include agg function without skewing my other data flags (that would need one flag per row)
I am doing some data quality checks. Flag 2 needs COUNT(). How could I structure this query So that I can keep the other flags and get my agg fn flag?
WITH
OrderTable AS(
SELECT
order_id, product_id, country, bought_date, price
FROM…

Jade
- 25
- 5
0
votes
0 answers
How to Null check multiple columns, with casting reporting elements
Looking for the most efficient way to check for nulls and have a desired output for a report. This is done in a Hadoop environment.
For example,
Database…

Supernova
- 29
- 6
0
votes
0 answers
Model Monitor Capture data - EndpointOutput Encoding is BASE64
https://docs.aws.amazon.com/sagemaker/latest/dg/model-monitor-data-capture-endpoint.html
I have followed the steps mentioned in this link and it appears I cannot change the encoding for EndpointOutput in datacapture file. It's coming BASE64 for…

Karki
- 7
- 4
0
votes
1 answer
great expectation with delta table
I am trying to run a great expectation suite on a delta table in Databricks. But I would want to run this on part of the table with a query. Though the validation is running fine, it's running on full table data.
I know that I can load a Dataframe…

S.Dasgupta
- 61
- 9
0
votes
0 answers
Integration of Alation, Manta, and BigID over cloud
I need to integrate and deploy Alation, Manta, and BigID over the cloud[AWS/Azure] for data governance. can anyone please suggest source material for the below:
How to deploy these 3 over the cloud
How these 2 work integrated, I could not find…

Abhi Soni
- 445
- 6
- 14
0
votes
0 answers
An error has been thrown from the AWS client
I got this error when running an Collibra DQ Job via AWS Athena :
An error has been thrown from the AWS Athena client. Query exhausted resources at this scale factor at com.simba.athena.athena.api.AJClient.executeQuery at…
0
votes
2 answers
Detecting similar columns across multiple files based on statistical profile
I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the best that I can do in these cases is to match…

Ryan Gross
- 6,423
- 2
- 32
- 44
0
votes
0 answers
python Great Expectations memory error for unique check
I am implementing Data quality checks using Great Expectation library.The dataset size 80GB and the number of rows 513749893.
Following is the code which i am implementing to find out unique checks on one of the column,
import great_expectations as…

code_bug
- 355
- 1
- 12
0
votes
1 answer
How to filter rows that violates constraints deequ
In order to do some unit test on my data I am using PyDeequ. Is there a way to filter out the rows which violate the defined constraints? I was not able to find anything online. Here is my code:
df1 = (spark
.read
.format("csv")
…

leop
- 41
- 7