0

I'm very sorry if I'm wording this wrong in advance but I have a large dataset and I am trying to analyze it, but most of the data is not correct and need some help figuring out how to select the correct data.

Here's some more information to clear it up more. For example I have the following:

color  value   quantity
red       20    2
blue    5   8
green   10  2

total       100

If only the value and the total is given, I will find there is 36 possible answers:

#1 Found : 20.0*0.0 red + 5.0*0.0 blue + 10.0*10.0 green = 100.0
#2 Found : 20.0*0.0 red + 5.0*2.0 blue + 10.0*9.0 green = 100.0
#3 Found : 20.0*0.0 red + 5.0*4.0 blue + 10.0*8.0 green = 100.0
#4 Found : 20.0*0.0 red + 5.0*6.0 blue + 10.0*7.0 green = 100.0
#5 Found : 20.0*0.0 red + 5.0*8.0 blue + 10.0*6.0 green = 100.0
#6 Found : 20.0*0.0 red + 5.0*10.0 blue + 10.0*5.0 green = 100.0
#7 Found : 20.0*0.0 red + 5.0*12.0 blue + 10.0*4.0 green = 100.0
#8 Found : 20.0*0.0 red + 5.0*14.0 blue + 10.0*3.0 green = 100.0
#9 Found : 20.0*0.0 red + 5.0*16.0 blue + 10.0*2.0 green = 100.0
#10 Found : 20.0*0.0 red + 5.0*18.0 blue + 10.0*1.0 green = 100.0
#11 Found : 20.0*0.0 red + 5.0*20.0 blue + 10.0*0.0 green = 100.0
#12 Found : 20.0*1.0 red + 5.0*0.0 blue + 10.0*8.0 green = 100.0
#13 Found : 20.0*1.0 red + 5.0*2.0 blue + 10.0*7.0 green = 100.0
#14 Found : 20.0*1.0 red + 5.0*4.0 blue + 10.0*6.0 green = 100.0
#15 Found : 20.0*1.0 red + 5.0*6.0 blue + 10.0*5.0 green = 100.0
#16 Found : 20.0*1.0 red + 5.0*8.0 blue + 10.0*4.0 green = 100.0
#17 Found : 20.0*1.0 red + 5.0*10.0 blue + 10.0*3.0 green = 100.0
#18 Found : 20.0*1.0 red + 5.0*12.0 blue + 10.0*2.0 green = 100.0
#19 Found : 20.0*1.0 red + 5.0*14.0 blue + 10.0*1.0 green = 100.0
#20 Found : 20.0*1.0 red + 5.0*16.0 blue + 10.0*0.0 green = 100.0
#21 Found : 20.0*2.0 red + 5.0*0.0 blue + 10.0*6.0 green = 100.0
#22 Found : 20.0*2.0 red + 5.0*2.0 blue + 10.0*5.0 green = 100.0
#23 Found : 20.0*2.0 red + 5.0*4.0 blue + 10.0*4.0 green = 100.0
#24 Found : 20.0*2.0 red + 5.0*6.0 blue + 10.0*3.0 green = 100.0
#25 Found : 20.0*2.0 red + 5.0*8.0 blue + 10.0*2.0 green = 100.0
#26 Found : 20.0*2.0 red + 5.0*10.0 blue + 10.0*1.0 green = 100.0
#27 Found : 20.0*2.0 red + 5.0*12.0 blue + 10.0*0.0 green = 100.0
#28 Found : 20.0*3.0 red + 5.0*0.0 blue + 10.0*4.0 green = 100.0
#29 Found : 20.0*3.0 red + 5.0*2.0 blue + 10.0*3.0 green = 100.0
#30 Found : 20.0*3.0 red + 5.0*4.0 blue + 10.0*2.0 green = 100.0
#31 Found : 20.0*3.0 red + 5.0*6.0 blue + 10.0*1.0 green = 100.0
#32 Found : 20.0*3.0 red + 5.0*8.0 blue + 10.0*0.0 green = 100.0
#33 Found : 20.0*4.0 red + 5.0*0.0 blue + 10.0*2.0 green = 100.0
#34 Found : 20.0*4.0 red + 5.0*2.0 blue + 10.0*1.0 green = 100.0
#35 Found : 20.0*4.0 red + 5.0*4.0 blue + 10.0*0.0 green = 100.0
#36 Found : 20.0*5.0 red + 5.0*0.0 blue + 10.0*0.0 green = 100.0

As you can see, in the possibilities I get the correct answer but many other answers also. Now say I add one more red(so the total red is 3) then I now have 49 results, but some of the results in second set are not likely if you factor in the relationship with the first result set. I assume as I get more data results, I can more accurately remove the results that don't work.

I'm trying to figure if there's any research or standard approach to narrowing the results down to something more meaningful. I am not 100% sure but I thought maybe google might be an example of this as each query is not only ran against the data but your history also(I have a website that is ranked very low and when I clicked on it and then searched for it again it always comes up on top..but when I search on my friends computer the same site shows up at the bottom). I thought maybe the way google builds a relationship with our multiple search queries, I could use a similar approach to remove the results from my data above that weren't correct.

Sorry for the misunderstanding. I'm a bit new to algo's and I am having trouble explaining this. If it doesn't make sense please let me know.

Thanks in advance!

Lostsoul
  • 25,013
  • 48
  • 144
  • 239
  • 1
    I can't formulate a complete answer yet, but it sounds like a linear algebra problem. Let me get this straight, you want a,o,p so that `20a + 5o + 10p = 100`? Where a is number of apples, o is number of oranges and p is number of pears? Are you wanting to determine what a reasonable solution is, or how many reasonable solutions? Sorry if I'm completely misunderstanding. – Chance Jun 07 '11 at 01:17
  • What do you mean by "There's possible 36 combinations of this"? – Bohemian Jun 07 '11 at 01:18
  • Do you mean "total calories" or "total food objects"? – sarnold Jun 07 '11 at 01:23
  • Hey guys, sorry for the misunderstanding..the question is clear to me but I'm really lost on how to formulate it. I added a sample above of the result and the answer from one interval(the results will be different if the values of the variable and sum is changed) and what I'm trying to do is figure out how I can somehow use a relationship between the intervals to predict what is most likely the correct quantities. Sorry again – Lostsoul Jun 07 '11 at 01:56
  • The problem is this: Find all combinations of quantities and calories that satisfy the equation such that quantities and calories are both integers and at least one calorie value is non-zero. – Bohemian Jun 07 '11 at 02:57
  • What do you mean by the "correct one"? They look equally "correct". – Ian Mercer Jun 07 '11 at 03:00
  • @Bohemian Well I think I have solved that part already(see my edited answer), what I'm having problems with is each time I solve it the result is independent of each other..I am trying to figure out how to connect the two. I thought an example would be how everytime you search for something on google its not independent its build upon your last searches – Lostsoul Jun 07 '11 at 03:00
  • @hightechrider yes they all are correct but say I ask you to pick a combination of the above fruits and the sum equaled 100, then your answer will surely be in one of the 36..but now if I tell you to only change the quantity by only one and you do, then it'll give me a bunch of results as well. I am trying to figure out how to tie your first selection with the second selection to more accurately figured out what you really choose. – Lostsoul Jun 07 '11 at 03:03
  • Eh? You want to take the calorie sum of the first selection and the calorie sum of the first selection plus one item, and deduce what the item was? Just subtract. *What is the question here?* – Beta Jun 07 '11 at 04:26
  • @beta I don't think that works when you have a very set of possible results. Above is only 36 matches but some of the results I get are huge(over 500,000), I suspect only 10% are valid if the next data results are considered. – Lostsoul Jun 07 '11 at 04:42
  • Everyone I'm so sorry for the confusion..I don't know how to explain this question and its clear from all the questions..I rewrote the question and hopefully its more understandable..if its not then I'll delete it and work on explaining it better. My problem is I don't even know what I want, I'm not sure if I need an algo solution or some kind of stats(correlation) or probability model..I'm not sure of the correct approach. – Lostsoul Jun 07 '11 at 04:44

2 Answers2

2

If I got this right you solve the equations like this one for

R*r + G*g + B*b = 100

For given integer values of R, G, B and with the constraint that r, g, b are also integer values.

Since you have only one equation and 3 variable, you get a solution space instead of a single solution and now want to apply some algorithm to pick the correct or best one

You also seem to have values of r0, g0, b0 which are likely values for r, g and b ?!

What you need to come up with is a fitness function which tells you how good or bad your candidate solution is.

One example could be (lower values meaning better solution)

(r-r0)^2 +(g-g0)^2 +(b-b0)^2 

Which basically says a solution is better when it is closer to the likely values.

A variant could be

(r-r0)^2 +(g-g0)^2 +(b-b0)^2 + c*C

Where C is a constant to be choosen by you and c is the number of values of that differ from your likely solution. This would give a higher fitness to a candidate which changes only one value compared to one changing two or three values.

Once you a have a fitness function, pick the solution with the lowest fitness.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • Usually, a fitness function indicates a better solution with a higher value. When the lower values are more desirable, you might want to refer to it as a cost function. Generally useful advice, otherwise, so I'm upvoting. – Michael J. Barber Jun 07 '11 at 06:43
  • Thank you Jens..This is the kind of solution I was looking for(I was just explaining it horribly). I did a bit of introductory reading of fitness functions and it seems really good. Are there other approaches to these types of problems? – Lostsoul Jun 07 '11 at 13:03
  • The number of possible fitness/cost functions is unlimited. Since you seem to working with color, there might be something like a perceived distance between two colors which might be useful for you. When your data volume, and number of possible solutions become huge, you might want to look in optimizing algorithms that are smarter then just sorting the options and picking the first. Possibly using algorithms that find a good but not always the best solution. – Jens Schauder Jun 07 '11 at 13:42
  • Thanks Jens! Do you know what the category of these optimizing algorithms called? After your post I read an intro to genetic algorithms, and I think its in the realm of what I need are there anything else I could use to help me work on this project? – Lostsoul Jun 08 '11 at 19:49
  • Awesome thank you so much Jens. You have pointed me towards a field that I didn't even know exists and given me a new reference point as which to think of solving my problems. Thanks a million! – Lostsoul Jun 09 '11 at 13:08
0

The problem is called a linear Diophantine equation. You can find further information here.

Community
  • 1
  • 1
Klas Lindbäck
  • 33,105
  • 5
  • 57
  • 82
  • Thank you Klas..I understand that, the issue is more in how to select the more realistic results based on a history of data. – Lostsoul Jun 07 '11 at 13:04