Data quality could refer to conditions of the data and techniques to evaluate or improve such conditions
Questions tagged [data-quality]
124 questions
1
vote
1 answer
In a DAG, how can I find the column Primary Key in a table and test if any Null value?
I writing a DataQualityOperator in a DAG.
It should check if there's data in a Redshift table. To do this, I would like to check if the primary column contains null values. With sql, I found the name of the column Primary key. How do I check if it…

anthelix
- 25
- 4
1
vote
1 answer
SQL - find all examples of values in all colums with given characteristic
I have a dataset (8.5 mill rows), where all values in all columns must be enclosed in quotation symbols (" "). I have discovered that there is a problem - some few records holds values in some columns with the last quotation symbol missing. Now I…

KNN
- 13
- 3
1
vote
0 answers
How to automate the execution process of data quality rules?
One of our clients has a requirement to build/develop data quality rules using hiveQL.
E.g, Replace NULL values, Change date format in YYYY-MM-DD, Standardize amount column values in US & EU format, etc.
Problem Statement:
I have the set of data…

Manoj Dhake
- 227
- 4
- 16
1
vote
1 answer
Data quality database model
Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of…

blunders
- 3,619
- 10
- 43
- 65
1
vote
2 answers
How to handle bad data quality in a SQL query
The code below is a sample of grouped data containing Temperature (bear in mind it's temperature taken of a human being in hospital) from our source system.
Obviously the data is horrible but wondered if it was possible to somehow turn this data…

Simon
- 391
- 4
- 16
1
vote
0 answers
Hive - most efficient way to check for duplicates on one partition against large table
I'm creating a query to run on a very large Hive table (millions of rows inserted every day).
I need to check (after the rows have been added, not before) for duplicates. I was wondering whether the below is the most efficient way of doing it, or…

L. Howes
- 65
- 2
- 7
1
vote
1 answer
Working with inaccurate (incorrect) dataset
This is my problem description:
"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many…

Ardeshir
- 11
- 3
1
vote
1 answer
Not getting IDQ log while running using infacmd command
We are running a shell script that runs a deployed IDQ mapping. I tried in unix directories to see if it created a mapping log file but no where i can see.
I checked in various directories under "" " folder but i could not trace the log…

Shankar Panda
- 736
- 3
- 11
- 27
1
vote
1 answer
Repetitions in field in Firebird without regex
I'm trying to craft a query which rejects a row when some field is all the same characters. Ie. I want to select people named Smith but not people named aaaaaa or bbbb.
I can't use regexes, as Firebird's SIMILAR TO doesn't have backreferences.
How…

BenoitParis
- 3,166
- 4
- 29
- 56
1
vote
1 answer
Is it possible to change workspace in Talend Open Studio for Data Quality?
Unlike Talend Open Studio (TOS) for Data Integration, TOS for Data Quality neither start with a splash screen with project and workspace choices nor permits to change the working project in the Studio. :(
I would like at least to change the…

JM.D
- 139
- 12
1
vote
1 answer
Is there an algorithm or pattern to merge several rows of the same record into one row?
Due to some unknown fault, every time I have sync'ed my Nokia's Contacts with my Outlook Contacts, via Nokia Suite, each contact on the phone gets added to Outlook again. I now have up to four copies of some contacts in Outlook. Some have different…

ProfK
- 49,207
- 121
- 399
- 775
0
votes
1 answer
pyspark - compare two String col and show the diff in new col
I am doing some data quality checking,
How do I compare two StringType columns ('old_unmatch' and 'new_unmatch') and create new columns for the results ('new_unmatch' and…

TingL
- 35
- 4
0
votes
0 answers
Constraint constraints/compute.requireOsLogin violated for project (project id)
While creating the Data Quality task in dataplex i am facing the issue as Constraint constraints/compute.requireOsLogin violated for project.
I have check all the task configuration but i am not able to find anything related this error.
0
votes
0 answers
Completeness (and other data profiling) check(s) in snowpark python worksheet
I am trying to a do some data profiling/quality checks on data in Snowflake. I've already tried to implement some using SQL but saw there is also an option for a Python worksheet.
My current code:
import snowflake.snowpark as snowpark
from…

Leonie
- 47
- 4
0
votes
0 answers
Is there a way I can set up an alert when data source Azure SQL tables is not updated with latest date?
Recently I have been notified by my report users that the report data is not updated. Then only when I checked the data source, I found out that data source data isn't synced in yet.
Because there are many tables, I cannot manually check one table…