0

Currently, I'm working on a data mining project, which processes the data stored on the hdfs. The first thing for this project is to integrate data from different databases or tables into uniform tables stored in hdfs.

By reading the Data Preprocessing in Data Mining by Salvador García, Julián Luengo, and Francisco Herrera, I learn that lots of challenges or problems exist in data integration such as following ones:

  1. Attribute Redundancy:
    • Example: For Table A, it has attribute index, and Table B has attribute identifier, while those two attributes represent the same meaning of the same object. So if we just process those two tables with naive join, redundant attributes may exist in the unified table.
    • Solution suggested by the book: Compare the meta-data of those fields to eliminate redundant ones.
  2. Attribute Correlation:
    • Example: For Table A, it has attribute salary_per_month, and Table B has attribute salary_per_year. But those two attributes are correlated, and salary_per_month can infer salary_per_year. Then similar to above cases, redundant attributes are created.
    • Solution suggested by the book: Apply the correlational test or Chi-Square test to determine the relationship between different fields.

Along with the above challenges, lots of cases may be possible (just brief description without specific examples):

  1. Case 1:
    • Description: Integrate table A from MongoDB and table B from MySQL into one table stored in hdfs through kind of join operation.
    • Notice: This case doesn't occur frequently, but is still possible.
  2. Case 2:
    • Description: Integrate table A and B from MongoDB (or MySQL) into one table stored in hdfs through kind of join operation.
    • Notice: Only one type of database is involved in this case.

Above is all the problems and possible situations. I understand the basic concepts for the problems described above, but I don't know how to solve the above problems in a real project, specifically based on hdfs. It seems that for the problems such as attribute redundancy and correlation can only be solved under the case when I know how the tables are designed, so-called hard-coded. Can I solve them automatically by kind of API provided by Hadoop ecosystem?

Again, since lots of cases are possible, I want to know in data integration what are general steps to follow, what are common tools to use in a real Big Data project where data preprocessing is very important? Any guidance would be helpful for me.

Walden Lian
  • 95
  • 2
  • 10
  • Please clarify *"table stored in hdfs"*... HDFS only stores files, not tables/databases. – OneCricketeer Mar 13 '17 at 22:52
  • @cricket_007 Sorry for the misleading words, it just indicates a uniform way for data from different sources to be stored in `hdfs`. – Walden Lian Mar 14 '17 at 01:53
  • Sure, but HDFS itself has no concept of "related data"... If you are trying to combine MongoDB with a relational database, I don't think Hadoop is the right way to do that – OneCricketeer Mar 14 '17 at 13:27
  • @cricket_007 OK, I get it. So how can one apply the solutions to attribute redundancy problems to data integration? I think the theory is separated from the practice. The above solutions seem like evaluating the table fields without knowing the table structure, but the real situation is that we know the meaning of each table field when we manipulate the datasets, thus we can manually tell which two fields are redundant. What do you think of this point? – Walden Lian Mar 15 '17 at 01:59

1 Answers1

0

For polyglot querying (fetching data from multiple datasources), I prefer Spark or Drill.

Using these tools, you can perform joins and other aggregation in-memory (if data is not too large)

You can write output easily to HDFS in desired file format.

Challenges like transaction management are still there. But these query engines solve many of the problems easily.

Dev
  • 13,492
  • 19
  • 81
  • 174
  • For Spark, I get certain API such as Chi-Square Test, but when to use them? My point is in the last comment of my question: _"how can one apply the solutions to attribute redundancy problems to data integration? I think the theory is separated from the practice. The above solutions seem like evaluating the table fields without knowing the table structure, but the real situation is that we know the meaning of each table field when we manipulate the datasets, thus we can manually tell which two fields are redundant. What do you think of this point?"_ – Walden Lian Mar 16 '17 at 02:26