There is an example of creating Mahout Vector objects from text. It says:
Before creating the vectors, you need to convert the documents to SequenceFile format. SequenceFile is a hadoop class which allows us to write arbitary key,value pairs into it. The DocumentVectorizer requires the key to be a Text with a unique document id, and value to be the Text content in UTF-8 format.
That's somewhat clear, since I know what a SequenceFile is. However, for all the Mahout algorithms (clustering, classification, ..) the content is actually a bag of words (or n-grams). Is the value treated as space-separated?
More importantly, I actually want to cluster something that is not text. Suppose, for example, I had users who rated movies in space-separated format:
user1 movie_11 5
user1 movie_12 4
..
user2 movie_21 1
user2 movie_22 5
..
Suppose I want to cluster movies. I could treat a user like a "document" (grouping of movies), and a movie like a "word." How would I get these ratings into a vector file? I could convert it to arff (not sure exactly how yet) and use Mahout's arff.vector. Is there a simpler utility that just takes document-to-word associations (or counts) and makes vectors?
It would be convenient not to have to put, say, 100 million ratings on disk as ARFF just to get it into sequence files, just to get it into vectors.