Questions tagged [data-profiling]

Data profiling is the process of examining the data available in an existing data source and collecting statistics and information about that data.

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data.

Data profiling is an analysis of the candidate data sources for a data warehouse to clarify the structure, content, relationships and derivation rules of the data. Profiling helps to understand anomalies and to assess data quality, but also to discover, register, and assess enterprise metadata.

36 questions
1
vote
3 answers

Data profiling in Power BI

I want to profile every single data table I have in my Power BI report. By data profile I mean something like this: Are there ways to make a data profile view in Power BI? DAX measure or calculated columns? Alternatively, you can also recommend…
Reza Azimi
  • 11
  • 7
0
votes
1 answer

Suggestion on Customer Profiling System: Books, Articles, etc

I'm going to work on a Customer Profiling project (similar but not same to Google Analytics) for our own E-Commerce website using C#. I'm pretty new to this kind of project, and the Customer Profiling project is also a brand new project. Could you…
0
votes
0 answers

how can we create alerts for datadrift by giving threshold

how can we create alerts for data drift and data quality for some dataset by giving threshold using python. Which is better package to capture data quality and data drift? evidently, deepchecks or Y-data profiling? How can we convert RDD for large…
0
votes
1 answer

How to customize customize alerts + other metrics in pandas_profiling / y_data_profiling alerts

pandas_profiling, or as it is now called, y_data_profiling provides a detailed breakdown of data quality. How can we customize alerts + other metrics included in their default report? I see options to change color scheme, and to hide existing…
0
votes
0 answers

Is Data Scan in Dataplex available for Americas São Paulo?

I tried to create a profile in Data Scan, a PRE-OFFERING in Dataplex, but even having the Admin permisssions, an error occured making it impossible to test data scan. I submitted a feedback but I still have no answer about it. Can anyone help? I…
0
votes
1 answer

Databricks : Export data profiling report

Databricks can create a data profiling report after using the display(dataframe_name). I have created a data profiling report using Azure Databricks but I do not know how do I export it. Can you please suggest How to export/download this report to…
venus
  • 1,188
  • 9
  • 18
0
votes
2 answers

Detecting similar columns across multiple files based on statistical profile

I'm attempting to clean up a set of old files that contain sensor data measurements. Many of the files don't have headers, and the format (column ordering, etc.) is inconsistent. I'm thinking the best that I can do in these cases is to match…
Ryan Gross
  • 6,423
  • 2
  • 32
  • 44
0
votes
0 answers

How can I connect a local delta lake with talend for data profiling purpose?

As I am new to talend, I am trying to connect my local delta lake with talend to do some data profiling on it.
khÜs h
  • 51
  • 6
0
votes
0 answers

How to create multiple pandas profiling reports for multiple csv files in a directory? The report name should match the file name

I tried this, import glob import os import pandas as pd import pandas_profiling from pandas_profiling import ProfileReport files = glob.glob("D:\home_health_services_current_data\*.csv") df = pd.DataFrame() for f in files: csv =…
0
votes
1 answer

Data profiling of columns for big table (SQL Server)

I have table with over 40 million records. I need to make data profiling, including Nulls count, Distinct Values, Zeros and Blancs, %Numeric, %Date, Needs to be Trimmed, etc. The examples that I was able to find are always including implementation…
Yana
  • 785
  • 8
  • 23
0
votes
2 answers

Validation for columns work very slow (SQL Server)

I want to perform data profiling on the columns of a table. In this particular case - what percentage of data is date/integer/numeric/bit. The query that I am using: SELECT CAST(SUM(CASE WHEN TRY_CAST([column1] AS date) IS NOT NULL AND…
Yana
  • 785
  • 8
  • 23
0
votes
1 answer

when i execute pandas-profiling package it won't return min, max and mean values

When i profiling the following data using pandas-profiling==2.8.0 it won't return min, max and mean values. CSV data a,b,c 12,2.5,0 12,4.7,5 33,5,4 44,44.21,67 python code import json import pandas as pd from pandas_profiling import…
0
votes
1 answer

Db2 tables - finding all blank columns in a table that has 100+ columns

I have a table with 78 columns and 100k rows. Is there a way to find all the blank columns in the table without querying on each column to find their counts? Running a not null query is time consuming and not feasible for whatever I am trying to do…
Vinney_143
  • 23
  • 11
0
votes
3 answers

data profiling on bigquery table covering min,max,unique, null count statistics

I am looking for solution to perform data profiling on bigquery table covering below statistics for each column in table. Some of the columns are ARRAY and STRUCT as given below. I tried multiple ways to generate dynamic query to cover below…
0
votes
2 answers

Find Multi-Column Primary key

I have about 30 tables from an old ERP which have multi-column primary keys. Unfortunately I don't know what those keys are. I've used the SSIS profiling task to determine primary key candidates for up to 5 columns, but it runs so slow as to be…
Jeremiah
  • 43
  • 1
  • 1
  • 7