Clustering in Torch

Question

I am trying to learn the Torch library for machine learning.

I know that the focus of Torch is neural networks, but just for the sake of it I was trying to run kmeans on it. If nothing, Torch implements fast contiguous storage which should be analogous to numpy arrays, and the Torch cheatsheet cites the unsup library for unsupervised learning, so why not?

I already have a benchmark that I use for K-means implementations. Even though all the implementations there are intentionally using an unoptimized algorithm (the README explains why), LuaJIT is able to cluster 100000 points in 611ms. An optimized (or shall I say, not intentionally slowed down) implementation in Nim (not on the repository) runs in 68 ms,so I was expecting something in-between.

Unfortunately, things are much worse, so I suspect I am doing something awfully wrong. What I have written is

require 'io'
cjson = require 'cjson'
require 'torch'
require 'unsup'

content = io.open("points.json"):read("*a")
data = cjson.decode(content)
points = torch.Tensor(data)
timer = torch.Timer()
centroids, counts = unsup.kmeans(points, 10, 15)

print(string.format('Time required: %f s', timer:time().real))

and the running time is around 6 seconds!

Can anyone check if I have done something wrong in using Torch/unsup?

If anyone wants to try it, the file points.json is in the above repository

*running time is around 6 seconds*: what about your hardware/software? — deltheil, Mar 21 '15 at 11:36
It is a Vaio laptop, but anyway I get 611 ms for a very naive implementation in pure lua on the same computer — Andrea, Mar 22 '15 at 11:54

score 2 · Answer 1 · answered Mar 21 '15 at 11:38

2

Can anyone check if I have done something wrong in using Torch/unsup?

Everything sounds correct (note: using local variables is recommended):

data is a 2-dimensional table and you use the corresponding Torch constructor,
points is a 2-dimensional tensor with nb. rows = nb. of points and nb. cols = points dimension (2 here). This is what unsup.kmeans expects as input.

What you can do is change the batch size (4th argument). It may impact the performance. You can also use the verbose mode that will output the average time per iteration:

-- batch size = 5000, no callback, verbose mode
centroids, counts = unsup.kmeans(points, 10, 15, 5000, nil, true)

answered Mar 21 '15 at 11:38

deltheil

15,496
2
44
64

If this is the case, it sounds worrying. An [unoptimized implementation](https://github.com/andreaferretti/kmeans/blob/master/lua/kmeans.lua) using pure Lua tables runs in 1/10th the time. I will try changing the batch size tomorrow at work, but it seems unsup is doing something wrong – Andrea Mar 22 '15 at 13:12
That's indeed a huge difference. I believe speed was not the initial focus. Some loops might optimized like [this one](https://github.com/koraykv/unsup/blob/605f777/kmeans.lua#L70-L73) (by replacing it with a single matrix-matrix multiplication) or [this one](https://github.com/koraykv/unsup/blob/605f777/kmeans.lua#L79-L81) (by directly manipulating the [raw data](https://github.com/torch/torch7/blob/master/doc/tensor.md#result-datatensor-asnumber)). – deltheil Mar 22 '15 at 18:24
@Andrea any update on whether unsup is still/overall slower than an implementation in Lua? – Razi Shaban Dec 18 '15 at 17:22
2

@RaziShaban It seems it is doing a different thing. See the discussion in this issue: https://github.com/koraykv/unsup/issues/27 – Andrea Dec 20 '15 at 18:55

Clustering in Torch

1 Answers1