Clustering and distance calculation in Julia

Question

I have a collection of n coordinate points of the form (x,y,z). These are stored in an n x 3 matrix M.

Is there a built in function in Julia to calculate the distance between each point and every other point? I'm working with a small number of points so calculation time isn't too important.

My overall goal is to run a clustering algorithm, so if there is a clustering algorithm that I can look at that doesn't require me to first calculate these distances please suggest that too. An example of the data I would like to perform clustering on is below. Obviously I'd only need to do this for the z coordinate.

There are several different clustering algorithms. What kind of clustering do you want to run? — niczky12, Apr 12 '16 at 07:39
I have a data set giving the (x,y,z) coordinates of two separate hanging electricity cables. They differ along the z axis (height) only. Thus I'd like to cluster based on z coordinates. However clustering that uses a straight line to cut the clusters doesn't work since the lowest point of the upper catenary can be lower than the highest point of the lower catenary. I am currently splitting the catenary up into little pieces where the straight line type clustering works but this is not a very neat solution. — lara, Apr 14 '16 at 03:47

niczky12 · Accepted Answer · 2016-04-12T08:45:17.673

To calculate distances use the Distances package.

Given a matrix X you can calculate pairwise distances between columns. This means that you should supply your input points (your n objects) to be the columns of the matrices. (In your question you mention nx3 matrix, so you would have to transpose this with the transpose() function.)

Here is an example on how to use it:

>using Distances  # install with Pkg.add("Distances")

>x = rand(3,2)

3x2 Array{Float64,2}:
 0.27436   0.589142
 0.234363  0.728687
 0.265896  0.455243

>pairwise(Euclidean(), x, x)

2x2 Array{Float64,2}:
 0.0       0.615871
 0.615871  0.0

As you can see the above returns the distance matrix between the columns of X. You can use other distance metrics if you need to, just check the docs for the package.

Thanks. Now when I try this on another problem with more data, I get the out of memory error. Any idea how a distance matrix can be calculated on a huge set of data ? — lara, Apr 13 '16 at 22:06

Imanol Luengo · Answer 2 · 2016-04-12T15:15:50.193

Just for completeness to the @niczky12 answer, there is a package in Julia called Clustering which essentially, as the name says, allows you to perform clustering.

A sample kmeans algorithm:

>>> using Clustering         # Pkg.add("Clustering") if not installed

>>> X = rand(3, 100)         # data, each column is a sample
>>> k = 10                   # number of clusters

>>> r = kmeans(X, k)
>>> fieldnames(r)
8-element Array{Symbol,1}:
:centers    
:assignments
:costs      
:counts     
:cweights   
:totalcost  
:iterations 
:converged

The result is stored in the return of the kmeans (r) which contains the above fields. The two probably most interesting fields: r.centers contains the centers detected by the kmeans algorithm and r.assigments contains the cluster to which each of the 100 samples belongs.

There are several other clustering methods in the same package. Feel free to dive into the documentation and apply the one that best suits your needs.

In your case, as your data is an N x 3 matrix you only need to transpose it:

M = rand(100, 3)
kmeans(M', k)

Clustering and distance calculation in Julia

2 Answers2