1

In general, the implementation of a Kohonen map / SOM algorithm with only real-valued vectors is a relatively trivial task. I wonder though how to implement such an algorithm for non-real-valued (i.e. non-scalar) attributes such as text strings, due to the "weight update" phase.

Suppose that there's a set of data containing words of different lengths, classes of meanings and, say, degree of being romantic, such as rose (very romantic), flower (romantic), plant (romantic depending on context), factory (romantic only for steampunkers). I'm making that up, so please ignore the details. (Edit: Yes, romantic-ness can be expressed as a scalar value; My question is really not about that part.)

One could shuffle words or even letters to create the prototypes on the map and then use the Levenshtein distance in order to find the best matching unit, I see that. But how would one update the BMU and its neighborhood towards the selected target vector?

Other examples might be paintings (e.g. by color, theme, epoch, ...) or perceived shapes (e.g. triangle, sawtooth, ...) embedded in one-dimensional (scalar) data streams.

sunside
  • 8,069
  • 9
  • 51
  • 74
  • Coming to think about this question nowadays, I assume the solution would be to obtain vector space embeddings of the qualitative predictors in question (for words, `word2vec` comes to mind) and then cluster based on those. – sunside Jan 15 '17 at 14:00

1 Answers1

0

Wouldn't those degrees of romantic-ness just be a number? "This rose is 0.9 romantic". Then find the right spot for your 0.9 in the SOM, and this is where your rose should sit. If you have multiple dimensions, it's basically a vector but still a vector of numbers, not a string, thus more easily updateable

Nicolas78
  • 5,124
  • 1
  • 23
  • 41
  • That's obviously true, but my question is about mapping those non-scalar parts, i.e. mapping words (not just named samples that is.) – sunside Apr 20 '14 at 23:15
  • So basically you're looking for a way to move a "weight" that says "BLA" towards "BLUB" with an update of say 0.05 so the result would be 0.95*BLA + 0.05*BLUB and you're asking how such a string might look like? – Nicolas78 Apr 21 '14 at 11:24
  • In essence, yes. I know this is a strange question, but I couldn't find much (if any) information on how to handle such cases. Are there standard transformations that can be used... you know, that kind of solutions. – sunside Apr 21 '14 at 11:28
  • 1
    Yea I see. Nothing comes immediately to mind, but the general topic you're interested in is called "machine learning on structured data". In particular, you might look for "string kernels", even though they're a similarity measure and don't necessarily allow for interpolation – Nicolas78 Apr 21 '14 at 11:57