Working with inaccurate (incorrect) dataset

Question

This is my problem description:

"According to the Survey on Household Income and Wealth, we need to find out the top 10% households with the most income and expenditures. However, we know that these collected data is not reliable due to many misstatements. Despite these misstatements, we have some features in the dataset which are certainly reliable. But these certain features are just a little part of information for each household wealth."

Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns.

Now, I have below questions:

How should we deal with unreliable data in data science?
Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms? -How can we evaluate our errors in this study? Since we have unlabeled dataset, should I look for labeling techniques? Or, should I use unsupervised methods? Or, should I work with semi-supervised learning methods?
Is there any idea or application in Machine Learning which tries to improve the quality of collected data?

Please introduce me any ideas or references which can help me in this issue.

Thanks in advance.

Could you add some more specifics to this problem? Can you just post a tiny sample of your data. Better yet, also post a large sample of your data on dropbox also. You are getting pushback from the SO and DS communities because you are speaking in generalities. Therefor the answers you get will be generalities. No one is helped by: Q: "Can I improve data through imputation?" A: "Yes you can through standard imputation techniques." — AN6U5, Jul 07 '15 at 19:41
I will provide you with a large sample in a few days. Hope this will help...Thanks — Ardeshir, Jul 09 '15 at 05:22

score 5 · Answer 1 · answered Jun 23 '15 at 12:44

5

Q: How should we deal with unreliable data in data science

A: Use feature engineering to fix unreliable data (make some transformations on unreliable data to make it reliable) or drop them out completely - bad features could significantly decrease the quality of the model

Q: Is there any way to figure out these misstatements and then report the top 10% rich people with better accuracy using Machine Learning algorithms?

A: ML algorithms are not magic sticks, they can't figure out anything unless you tell them what you are looking for. Can you describe what means 'unreliable'? If yes, you can, as I mentioned, use feature engineering or write a code which will fix the data. Otherwise no ML algorithm will be able to help you, without the description of what exactly you want to achieve

Q: Is there any idea or application in Machine Learning which tries to improve the quality of collected data?

A: I don't think so just because the question itself is too open-ended. What means 'the quality of the data'?

Generally, here are couple of things for you to consider:

1) Spend some time on googling feature engineering guides. They cover how to prepare your data for you ML algorithms, refine it, fix it. Good data with good features dramatically increase the results.

2) You don't need to use all of features from original data. Some of features of original dataset are meaningless and you don't need to use them. Try to run gradient boosting machine or random forest classifier from scikit-learn on your dataset to perform classification (or regression, if you do regression). These algorithms also evaluate importance of each feature of original dataset. Part of your features will have extremely low importance for classification, so you may wish to drop them out completely or try to combine unimportant features together somehow to produce something more important.

answered Jun 23 '15 at 12:44

Maksim Khaitovich

4,742
7
39
70

Thanks for your time and answer, Maxim. I understand that ML is do exactly what we want it to do. Thus, I am looking for the idea that helps me to explain my problem in away which is solvable with ML. – Ardeshir Jun 24 '15 at 16:23
Unreliable data means that households tell lies to government. These households misstate their income and wealth in order to unfairly get more governmental services. Therefore, these fraudulent statements in original data will lead to incorrect results and patterns. – Ardeshir Jun 24 '15 at 16:32
Hmmm... Well, if the fraud is not really massive than it won't lead to problems with ML algorithm. Majority of ML algorithms don't require your data to be 100% clear, small percent of anomalies won't break them. So you probably could just ignore bad data. On the other hand you might want to use statistics methods to verify your data and maybe identify anomalies – Maksim Khaitovich Jun 24 '15 at 18:30
BTW, another interesting idea. You train a regressor on your data which predicts the answer of an person. Say, the person's income based on other parameters. Then you run the prediction on all of persons in your training dataset. If for some specific person your model outputs EXTREMELY different result from the one reported - this is probably your fraudulent statement. – Maksim Khaitovich Jun 24 '15 at 18:33
Many thanks for your reply and idea. Do you any reference or related works? – Ardeshir Jun 25 '15 at 20:37
Uh, only in Russian, sorry :) But that's are rather basic things, you could find the same points in almost any article on Feature Engineering – Maksim Khaitovich Jun 28 '15 at 15:47
@Ardeshir I thought about that a little bit more. You may also want to google a thing called 'fraud detection', that's more of statistics rather than an ML thing, but it worth trying it out, maybe these approach will help you to find the fraudulent data entries. I studied about it on coursera: https://www.coursera.org/course/datasci – Maksim Khaitovich Jun 29 '15 at 09:01
Thanks for your consideration, Maxim. – Ardeshir Jun 30 '15 at 10:04

Working with inaccurate (incorrect) dataset

1 Answers1