3

Can anybody please suggest what type of input exactly Mahout wants to applying LDA. It will be nice if anybody please write down in detail to grab the output

Subhradip Bose
  • 3,065
  • 2
  • 13
  • 17

1 Answers1

6

The documentation on the newest form of LDA in mahout is a little sparse. It's called 'cvb' now. The input could be a directory of text files or anything else (lucene index, whatever) that you can get into mahout form. The output is a defined number of topics represented by their keywords in vector form (see example below).

I actually worked through an example yesterday, so I'm going to paste some commands below in an effort to be useful. The example uses the reuters dataset which can be loaded using commands (could be outdated) found in http://svn.apache.org/repos/asf/mahout/tags/mahout-0.4/examples/bin/build-reuters.sh

(for example: input would go in $basedir/work/reuters-out/ below)

# setup some directories

basedir=[set to your current directory]
workdir=$basedir/work

# convert text to SequenceFile format

mahout seqdirectory \
-i $basedir/work/reuters-out/ \
-o $basedir/work/reuters-out-seqdir -c UTF-8 -chunk 5

# make sparse vectors

mahout seq2sparse \
  -i $workdir/reuters-out-seqdir/ \
  -o $workdir/reuters-out-seqdir-sparse-lda -ow --maxDFPercent 85 --namedVector

# using rowid to convert sparse vectors to the form needed for cvb clustering (i.e., to change the Text key to an Integer).

mahout rowid \
-i $workdir/reuters-out-seqdir-sparse-lda/tfidf-vectors \
-o $workdir/reuters-out-matrix

# rerun LDA using local commands

# following: Run cvb in mahout 0.8

rm -rf $workdir/reuters-ldalocal $workdir/reuters-ldalocal-topics
mahout cvb0_local \
  -i $workdir/reuters-out-matrix/matrix \
  -d $workdir/reuters-out-seqdir-sparse-lda/dictionary.file-* \
  -a 0.5 \
  -top 4 \
  -do $workdir/reuters-ldalocal \
  -to $workdir/reuters-ldalocal-topics

# Inspect the output by showing the top 10 words of each topic:

mahout vectordump \
    -i $workdir/reuters-ldalocal-topics \
    --dictionary $workdir/reuters-out-seqdir-sparse-lda/dictionary.file-* \
    --dictionaryType sequencefile \
    --vectorSize 10 \
    -sort $workdir/reuters-ldalocal-topics

The output looks like this:

{said:12099.546951505947,its:10566.985916212521,year:8333.832279174481,dlrs:6810.206141819796,would:6721.746234281428,been:5329.6753421933945,pct:5313.369659313288,billion:5248.896294419074,from:5158.844069513761,he:4764.16474083869}
{mln:11816.704457054004,cts:7169.159831834528,mar:7081.733955520149,vs:6891.4237560938955,new:6560.720833985039,has:6543.337854529879,1986:6043.850306111383,company:5720.025984843189,pct:5711.399291651732,last:5683.42288907518}
{inc:9704.372248376018,mln:9278.314888220315,said:8562.15377124544,net:7827.149394593728,vs:7736.055883103908,dlrs:7057.160090724306,cts:6177.1590584797605,market:5936.459595191674,exchange:5371.911394611647,co:5314.4250562522}
{said:12514.11646492775,u.s:9207.239974183465,from:7679.363044582878,mar:6588.0987950965,bank:6491.528794438723,pct:6100.417335098452,has:5352.990453581582,dlrs:5091.309618540722,about:4886.923813272583,13:4695.692587191373}

disclaimer - that was paraphrased from my notes, there could be little bugs in it. good luck!

Community
  • 1
  • 1
Ziggy Eunicien
  • 2,858
  • 1
  • 23
  • 28