Questions tagged [data-quality]

Data quality could refer to conditions of the data and techniques to evaluate or improve such conditions

124 questions
1
vote
1 answer

In a DAG, how can I find the column Primary Key in a table and test if any Null value?

I writing a DataQualityOperator in a DAG. It should check if there's data in a Redshift table. To do this, I would like to check if the primary column contains null values. With sql, I found the name of the column Primary key. How do I check if it…
anthelix
  • 25
  • 4
1
vote
1 answer

SQL - find all examples of values in all colums with given characteristic

I have a dataset (8.5 mill rows), where all values in all columns must be enclosed in quotation symbols (" "). I have discovered that there is a problem - some few records holds values in some columns with the last quotation symbol missing. Now I…
KNN
  • 13
  • 3
1
vote
0 answers

How to automate the execution process of data quality rules?

One of our clients has a requirement to build/develop data quality rules using hiveQL. E.g, Replace NULL values, Change date format in YYYY-MM-DD, Standardize amount column values in US & EU format, etc. Problem Statement: I have the set of data…
Manoj Dhake
  • 227
  • 4
  • 16
1
vote
1 answer

Data quality database model

Need an example of a database model to be attached to a database for data quality. Best form of the answer would at the very least be DDL that's executable in MySQL; other RDMS DDL's are okay, I'll just post another question asking for a porting of…
blunders
  • 3,619
  • 10
  • 43
  • 65
1
vote
2 answers

How to handle bad data quality in a SQL query

The code below is a sample of grouped data containing Temperature (bear in mind it's temperature taken of a human being in hospital) from our source system. Obviously the data is horrible but wondered if it was possible to somehow turn this data…
Simon
  • 391
  • 4
  • 16
1
vote
0 answers

Hive - most efficient way to check for duplicates on one partition against large table

I'm creating a query to run on a very large Hive table (millions of rows inserted every day). I need to check (after the rows have been added, not before) for duplicates. I was wondering whether the below is the most efficient way of doing it, or…
L. Howes
  • 65
  • 2
  • 7
1
vote
1 answer

Working with inaccurate (incorrect) dataset

This is my problem description: "According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many…
Ardeshir
  • 11
  • 3
1
vote
1 answer

Not getting IDQ log while running using infacmd command

We are running a shell script that runs a deployed IDQ mapping. I tried in unix directories to see if it created a mapping log file but no where i can see. I checked in various directories under "" " folder but i could not trace the log…
Shankar Panda
  • 736
  • 3
  • 11
  • 27
1
vote
1 answer

Repetitions in field in Firebird without regex

I'm trying to craft a query which rejects a row when some field is all the same characters. Ie. I want to select people named Smith but not people named aaaaaa or bbbb. I can't use regexes, as Firebird's SIMILAR TO doesn't have backreferences. How…
BenoitParis
  • 3,166
  • 4
  • 29
  • 56
1
vote
1 answer

Is it possible to change workspace in Talend Open Studio for Data Quality?

Unlike Talend Open Studio (TOS) for Data Integration, TOS for Data Quality neither start with a splash screen with project and workspace choices nor permits to change the working project in the Studio. :( I would like at least to change the…
JM.D
  • 139
  • 12
1
vote
1 answer

Is there an algorithm or pattern to merge several rows of the same record into one row?

Due to some unknown fault, every time I have sync'ed my Nokia's Contacts with my Outlook Contacts, via Nokia Suite, each contact on the phone gets added to Outlook again. I now have up to four copies of some contacts in Outlook. Some have different…
ProfK
  • 49,207
  • 121
  • 399
  • 775
0
votes
1 answer

pyspark - compare two String col and show the diff in new col

I am doing some data quality checking, How do I compare two StringType columns ('old_unmatch' and 'new_unmatch') and create new columns for the results ('new_unmatch' and…
TingL
  • 35
  • 4
0
votes
0 answers

Constraint constraints/compute.requireOsLogin violated for project (project id)

While creating the Data Quality task in dataplex i am facing the issue as Constraint constraints/compute.requireOsLogin violated for project. I have check all the task configuration but i am not able to find anything related this error.
0
votes
0 answers

Completeness (and other data profiling) check(s) in snowpark python worksheet

I am trying to a do some data profiling/quality checks on data in Snowflake. I've already tried to implement some using SQL but saw there is also an option for a Python worksheet. My current code: import snowflake.snowpark as snowpark from…
0
votes
0 answers

Is there a way I can set up an alert when data source Azure SQL tables is not updated with latest date?

Recently I have been notified by my report users that the report data is not updated. Then only when I checked the data source, I found out that data source data isn't synced in yet. Because there are many tables, I cannot manually check one table…
1 2 3
8 9