Questions tagged [data-quality]

Data quality could refer to conditions of the data and techniques to evaluate or improve such conditions

124 questions
0
votes
1 answer

How to UPIVOT all columns in a table and aggregate into Data Quality/ Validation Metrics? SQL SNOWFLAKE

I have a table with 60+ columns in it that I would like to UNPIVOT so that each column becomes a row and then find the fill rate, min value and max value of each entry. For…
0
votes
1 answer

Provide aws credentials to Airflow GreatExpectationsOperator

I would like to use GreatExpectationsOperator to perform data quality validations. The validation results data should be stored in S3. I don't see an option to send an airflow connection name to the GE operator, and the AWS credentials in my…
Itai Sevitt
  • 140
  • 1
  • 7
0
votes
1 answer

How to change the way Talend formulates SQL queries in a JDBC connection?

In Talend Data Quality, I have configured a JDBC connection to an OpenEdge database and it's working fine. I can pull the list of tables and select columns to analyse, but when executing analysis, I get this : Table "DBGSS.SGSSGSS" cannot be…
Sergei K.
  • 1
  • 1
0
votes
0 answers

Repairing data in a Pandas dataframe when duplicate data exists

I've not had to do any heavy lifting with Pandas until now, and now I've got a bit of a situation and can use some guidance. I've got some code that generates the following dataframe: ID_x HOST_NM IP_ADDRESS_x SERIAL_x ID_y IP_ADDRESS_y …
0
votes
2 answers

Great Expectations list total unique values

I have run Great Expectation check expect_column_values_to_be_unique check on one of the column. It produced the following result as below.Total There are 62 Duplicates but in the output list it is returning only 20 elements. How to retrieve all…
0
votes
1 answer

Spark Compatible Data Quality Framework for Narrow Data

I'm trying to find an appropriate data quality framework for very large amounts of time series data in a narrow format. Image billions of rows of data that look kinda like…
0
votes
1 answer

How to get sum of multiple rows in a table dynamically

I am trying to get the total sum from columns of a specific data type(money) for multiple tables in a database. Currently I am able to get the list of columns from specific tables but I am unable to get the sums from those columns. This is what I…
2tone_tony
  • 47
  • 1
  • 10
0
votes
1 answer

Data Quality check with Python Dask

Currently trying to write code to check for data quality of a 7 gb data file. I tried googling exactly but to no avail. Initially, the purpose of the code is to check how many are nulls/NaNs and later on to join it with another datafile and compare…
doubleD
  • 269
  • 1
  • 12
0
votes
0 answers

Manipulate the data from two not linked servers

At the moment I have two Microsoft SQL Servers with schema_1 and schema_2 respectively. Previously, these two schemes were on one server, and I wrote queries in which I could access these two schemes at once. Now, for some reason, these two schemes…
Aleksandra
  • 305
  • 3
  • 10
0
votes
1 answer

Recursive method for calculate percentual of repeated values for each column in my df with R

I need to use lapply/sapply or other recursive methods for my real df for calculate how many repeated values have in each column/variable. Here I used an small example to reproduce my case: library(dplyr) df <- data.frame( var1 =…
0
votes
0 answers

How to populate an function inside a for loop in R dataframe?

I need some help with my need to create a dataframe that is generated inside a function that makes use of a for loop for each row of a given dataframe in R. In summary, my role seeks to facilitate a data quality process that I'm doing as an initial…
0
votes
1 answer

Dynamic SQL table validation for data quality dimension

I have the following code to test for nulls in a whole table using dynamic sql: /*Completitud*/ --Housekeeping: drop table if exists tmp_completitud; --Declarar variables para el loop: declare @custom_sql VARCHAR(max) declare @tablename as…
metarodri
  • 1
  • 1
0
votes
2 answers

Python : Changing the original data using a for loop

I have some really big txt files (> 2 gb) where the quality of the data is not good. In some columns (that should be integer), for values below 1000.00 , '.' is used as the decimal point (e.g. 473.71886) but for values above 1000.00 then the form is…
Foivos
  • 25
  • 4
0
votes
0 answers

Creating a snowflake schema in talend

I'm discovering Talend Data Quality Dashboards through a tutorial, and i want to create a schema as shown below but i cant find how:
0
votes
0 answers

Pentaho Data Validator Error : Unable to find the specified fieldname for validation

I'm working on a transformation in PDI using data validation but i get this error when i run my transormation, can anyone help me fix it: edit: I do have that field but it says unable find it. 2021/04/23 21:59:10 - DataValidation - Dispatching…
1 2 3
8 9