Currently, I'm working on a data mining project, which processes the data stored on the hdfs
. The first thing for this project is to integrate data from different databases or tables into uniform tables stored in hdfs
.
By reading the Data Preprocessing in Data Mining by Salvador García, Julián Luengo, and Francisco Herrera, I learn that lots of challenges or problems exist in data integration such as following ones:
- Attribute Redundancy:
- Example: For
Table A
, it has attributeindex
, andTable B
has attributeidentifier
, while those two attributes represent the same meaning of the same object. So if we just process those two tables with naive join, redundant attributes may exist in the unified table. - Solution suggested by the book: Compare the
meta-data
of those fields to eliminate redundant ones.
- Example: For
- Attribute Correlation:
- Example: For
Table A
, it has attributesalary_per_month
, andTable B
has attributesalary_per_year
. But those two attributes are correlated, andsalary_per_month
can infersalary_per_year
. Then similar to above cases, redundant attributes are created. - Solution suggested by the book: Apply the correlational test or Chi-Square test to determine the relationship between different fields.
- Example: For
Along with the above challenges, lots of cases may be possible (just brief description without specific examples):
- Case 1:
- Description: Integrate
table A
from MongoDB andtable B
from MySQL into one table stored in hdfs through kind of join operation. - Notice: This case doesn't occur frequently, but is still possible.
- Description: Integrate
- Case 2:
- Description: Integrate
table A
andB
from MongoDB (or MySQL) into one table stored in hdfs through kind of join operation. - Notice: Only one type of database is involved in this case.
- Description: Integrate
Above is all the problems and possible situations. I understand the basic concepts for the problems described above, but I don't know how to solve the above problems in a real project, specifically based on hdfs
. It seems that for the problems such as attribute redundancy and correlation can only be solved under the case when I know how the tables are designed, so-called hard-coded. Can I solve them automatically by kind of API provided by Hadoop ecosystem?
Again, since lots of cases are possible, I want to know in data integration what are general steps to follow, what are common tools to use in a real Big Data project where data preprocessing is very important? Any guidance would be helpful for me.