Data quality could refer to conditions of the data and techniques to evaluate or improve such conditions
Questions tagged [data-quality]
124 questions
20
votes
4 answers
List of Unicode characters that should be filtered in output?
Recently I hit a bug due to data quality with browser support, and I am looking for a safe rule for applying string escape without double size unless required.
A UTF-8 byte sequence "E2-80-A8" (U+2028, LINE SEPARATOR), a perfectly valid character in…

Dennis C
- 24,511
- 12
- 71
- 99
5
votes
2 answers
how find rows where a particular column has decimal numbers using pandas?
I am writing a data quality script using pandas, where the script would be checking certain conditions on each column
At the moment i need to find out the rows that don't have a decimal or an actual number in a a particular column. I am able to find…

stormfield
- 1,696
- 1
- 14
- 26
4
votes
1 answer
Great Expectations: base_directory must be an absolute path if root_directory is not provided
This is about Great Expectations module in python primarily used for data quality checks (I found their documentation to be inadequate). So I've been trying to set up the data context on my notebook (using a local datasource) - as mentioned…

Debapratim Chakraborty
- 375
- 3
- 15
4
votes
2 answers
R - estimating missing values
Let's assume I have a table as such:
Date Sales
09/01/2017 9000
09/02/2017 12000
09/03/2017 0
09/04/2017 11000
09/05/2017 14400
09/06/2017 0
09/07/2017 0
09/08/2017 21000
09/09/2017 15000
09/10/2017 23100
09/11/2017 0
09/12/2017 …

Craig
- 1,929
- 5
- 30
- 51
3
votes
1 answer
Matching Oracle duplicate column values using Soundex, Jaro Winkler and Edit Distance (UTL_MATCH)
I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have…

Ollie
- 17,058
- 7
- 48
- 59
3
votes
1 answer
How to view specific changes in data at particular version in Delta Lake
Right now I have one test data which have 1 partition and inside that partition it has 2 parquet files
If I read data as:
val df = spark.read.format("delta").load("./test1510/table@v1")
Then I get latest data with 10,000 rows and if I read:
val df…

Shashank Sharma
- 385
- 1
- 4
- 14
3
votes
3 answers
Best usability practice for accepting long-ish account numbers
A user recently inquired (OK, complained) as to why a 19-digit account number on our web site was broken up into 4 individual text boxes of length [5,5,5,4]. Not being the original designer, I couldn't answer the question, but I'd always it…

LesterDove
- 3,014
- 1
- 23
- 24
3
votes
1 answer
Handling Duplicates in Data Warehouse
I was going through the below link for handling Data Quality issues in a data warehouse.
http://www.kimballgroup.com/2007/10/an-architecture-for-data-quality/
"
Responding to Quality Events
I have already remarked that each quality screen has to…

Anand Kannan
- 141
- 3
- 8
2
votes
3 answers
What are techniques and practices on measuring data quality?
If I have a large set of data that describes physical 'things', how could I go about measuring how well that data fits the 'things' that it is supposed to represent?
An example would be if I have a crate holding 12 widgets, and I know each widget…

MStodd
- 4,716
- 3
- 30
- 50
2
votes
2 answers
Is there a difference between the terms "data integrity" and "data quality"?
I was asked this question in an interview today, and didn't know how to answer.
Can anyone provide an insight as to the differences?

Chris
- 4,594
- 4
- 29
- 39
2
votes
1 answer
Using Pydequu on Jupyter Notebook and having this "An error occurred while calling o70.run.'
I'm trying to use Pydequu on Jupyter Notebook when i try to use ConstraintSuggestionRunner and show this error:
Py4JJavaError: An error occurred while calling o70.run.
: java.lang.NoSuchMethodError:…

LuisRicardo
- 21
- 1
2
votes
1 answer
Test yaml great-expectations with Bigquery
I am having troubles testing the yaml of great-expectation to bigquery.
I followed the official documentation and got to this code
import os
import great_expectations as ge
datasource_yaml = """
name: my_bigquery_datasource
class_name:…

elvainch
- 1,369
- 3
- 15
- 32
2
votes
1 answer
On Athena AWS, last update on table?
I try to monitor the data quality on AWS Athena.
I would like to know how can i find when data have been loaded in a table?
The table hasn't partition and i can't do the partition on this table.
Thanks for your help!

Karo
- 21
- 2
2
votes
2 answers
Data sets that neural networks are not advised
My question basically is : in a learning problem, are there data sets that neural networks are not advised to be used ? What are some popular characteristics of such data sets?
The reason why I am asking is :
In some articles it is proven that…

Already
- 21
- 2
2
votes
3 answers
Are there free, low cost, or open source tools for matching name/address data?
This question is related to Tools for matching name/address data. There is a number commercial tools provided by SAS, Oracle, Microsoft, etc., that allow to de-duplicate or merging names of individuals or companies coming from multiple…

luiscolorado
- 1,525
- 2
- 16
- 23