1

I've been using the Tagging API to tag my items in order to allow Item-Item 'similarity' scores to be calculated, so: Item 1 gets tagged with {UK, MALE, 50}, Item 2 with {FRANCE, MALE, 22}, that kind of thing. That's been working fine.

What I'd like to do is represent item-item 'relationships', so if my application says that 1 is a parent of 2 (and just to make things a little more complex, this is multi-level), I'd like to be able to tell Myrrix to pull those two items a little closer together.

My first solution was to add a 'PARENT_[name]' tag to each Item and, for each parent it has, add a 'PARENT_[parentname]' tag, with a lower weight as we go up the hierarchy. That did succeed in pulling parents and children closer.

Unfortunately the overall quality of suggestions seemed to fall a little, and the results seemed increasingly variable, e.g. run the import again, results seem completely random. Is this something that can be fixed at the features / lambda level?

I'm still not really all that clear what 'features' represents, but my suspicion is that by massively increasing the number of possible tags, I need to configure the model very differently...

Taryn
  • 242,637
  • 56
  • 362
  • 405
Andrew Regan
  • 5,087
  • 6
  • 37
  • 73

1 Answers1

1

That's the right way to think about it. It's overloading the API a fair bit, but still principled.

It may or may not actually help the results. It kind of depends on whether users who like A will also like B because they have a common product family. Maybe for music; unlikely for things you buy once like a toaster.

Variability comes from the random starting point. You will get different models each time. If the difference is significant when you start from scratch, then you are likely getting into over-fitting. It may be that your # of features is too high or lambda too low for the data set.

You should also run an eval to see whether the scores are good at all. If it's scoring poorly, yeah it's a case of parameters that are well off their best values.

The idea is that you need not build a new model from scratch every time though.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • I've experimented with quite a few sets of params to ParameterOptimizer, and results generally hover around l=0.3-0.4, f=20-23. While the sim scores between two very close items is high and relatively stable, they get overwhelmed by a lot of completely bogus items which have no tags in common whatsoever, but have scores of 0.99999994, 0.99998987, etc. Even with l=0.85, f=5 I still get bogus ones very high up. Is there a point at which I should just accept this as inevitable with only 2K items? – Andrew Regan Sep 23 '13 at 21:16
  • What kinds of scores come out of the optimizer? what ranges are you letting it try? I somehow think this is still off. But those symptoms sound like the opposite, under-fitting, where everything ends up about the same. You might back up and see what happens without the artificial data too. – Sean Owen Sep 23 '13 at 21:40
  • I've just run the PO with: 15 0.8 model.features=2:25 model.als.lambda=0.0001:10, and it's given me the rather familiar l=0.459, f=19. If I remove the parent-handling tags, I'm back to somewhere between 0.25-0.45, f=20ish, but the sims are much better: for Item 53, closely related Item 55 jumps way up the order: http://pastebin.com/sqas0hSP - which is good, but wouldn't be able to do the parental matching in future. – Andrew Regan Sep 23 '13 at 22:00
  • Cast the net wider. 2:25 is probably better as 10:100. – Sean Owen Sep 24 '13 at 06:50
  • I've tried a couple of times with: 30 0.8 model.features=10:100 model.als.lambda=0.0001:2, and keep getting similar values: l=0.47 f=26, l=0.4 f=23, which I don't think will make that much difference. I'll try widening much further... – Andrew Regan Sep 24 '13 at 18:11
  • I tried letting it pick 50 feature values from 10 to 1000, but again ended up with l=0.4ish f=25 – Andrew Regan Sep 24 '13 at 21:33