Questions tagged [data-quality]

Data quality could refer to conditions of the data and techniques to evaluate or improve such conditions

124 questions
20
votes
4 answers

List of Unicode characters that should be filtered in output?

Recently I hit a bug due to data quality with browser support, and I am looking for a safe rule for applying string escape without double size unless required. A UTF-8 byte sequence "E2-80-A8" (U+2028, LINE SEPARATOR), a perfectly valid character in…
Dennis C
  • 24,511
  • 12
  • 71
  • 99
5
votes
2 answers

how find rows where a particular column has decimal numbers using pandas?

I am writing a data quality script using pandas, where the script would be checking certain conditions on each column At the moment i need to find out the rows that don't have a decimal or an actual number in a a particular column. I am able to find…
stormfield
  • 1,696
  • 1
  • 14
  • 26
4
votes
1 answer

Great Expectations: base_directory must be an absolute path if root_directory is not provided

This is about Great Expectations module in python primarily used for data quality checks (I found their documentation to be inadequate). So I've been trying to set up the data context on my notebook (using a local datasource) - as mentioned…
4
votes
2 answers

R - estimating missing values

Let's assume I have a table as such: Date Sales 09/01/2017 9000 09/02/2017 12000 09/03/2017 0 09/04/2017 11000 09/05/2017 14400 09/06/2017 0 09/07/2017 0 09/08/2017 21000 09/09/2017 15000 09/10/2017 23100 09/11/2017 0 09/12/2017 …
Craig
  • 1,929
  • 5
  • 30
  • 51
3
votes
1 answer

Matching Oracle duplicate column values using Soundex, Jaro Winkler and Edit Distance (UTL_MATCH)

I am trying to find a reliable method for matching duplicate person records within the database. The data has some serious data quality issues which I am also trying to fix but until I have the go-ahead to do so I am stuck with the data I have…
Ollie
  • 17,058
  • 7
  • 48
  • 59
3
votes
1 answer

How to view specific changes in data at particular version in Delta Lake

Right now I have one test data which have 1 partition and inside that partition it has 2 parquet files If I read data as: val df = spark.read.format("delta").load("./test1510/table@v1") Then I get latest data with 10,000 rows and if I read: val df…
Shashank Sharma
  • 385
  • 1
  • 4
  • 14
3
votes
3 answers

Best usability practice for accepting long-ish account numbers

A user recently inquired (OK, complained) as to why a 19-digit account number on our web site was broken up into 4 individual text boxes of length [5,5,5,4]. Not being the original designer, I couldn't answer the question, but I'd always it…
LesterDove
  • 3,014
  • 1
  • 23
  • 24
3
votes
1 answer

Handling Duplicates in Data Warehouse

I was going through the below link for handling Data Quality issues in a data warehouse. http://www.kimballgroup.com/2007/10/an-architecture-for-data-quality/ " Responding to Quality Events I have already remarked that each quality screen has to…
2
votes
3 answers

What are techniques and practices on measuring data quality?

If I have a large set of data that describes physical 'things', how could I go about measuring how well that data fits the 'things' that it is supposed to represent? An example would be if I have a crate holding 12 widgets, and I know each widget…
MStodd
  • 4,716
  • 3
  • 30
  • 50
2
votes
2 answers

Is there a difference between the terms "data integrity" and "data quality"?

I was asked this question in an interview today, and didn't know how to answer. Can anyone provide an insight as to the differences?
Chris
  • 4,594
  • 4
  • 29
  • 39
2
votes
1 answer

Using Pydequu on Jupyter Notebook and having this "An error occurred while calling o70.run.'

I'm trying to use Pydequu on Jupyter Notebook when i try to use ConstraintSuggestionRunner and show this error: Py4JJavaError: An error occurred while calling o70.run. : java.lang.NoSuchMethodError:…
2
votes
1 answer

Test yaml great-expectations with Bigquery

I am having troubles testing the yaml of great-expectation to bigquery. I followed the official documentation and got to this code import os import great_expectations as ge datasource_yaml = """ name: my_bigquery_datasource class_name:…
elvainch
  • 1,369
  • 3
  • 15
  • 32
2
votes
1 answer

On Athena AWS, last update on table?

I try to monitor the data quality on AWS Athena. I would like to know how can i find when data have been loaded in a table? The table hasn't partition and i can't do the partition on this table. Thanks for your help!
Karo
  • 21
  • 2
2
votes
2 answers

Data sets that neural networks are not advised

My question basically is : in a learning problem, are there data sets that neural networks are not advised to be used ? What are some popular characteristics of such data sets? The reason why I am asking is : In some articles it is proven that…
2
votes
3 answers

Are there free, low cost, or open source tools for matching name/address data?

This question is related to Tools for matching name/address data. There is a number commercial tools provided by SAS, Oracle, Microsoft, etc., that allow to de-duplicate or merging names of individuals or companies coming from multiple…
luiscolorado
  • 1,525
  • 2
  • 16
  • 23
1
2 3
8 9