0

I am using ELKI mini GUI for clustering my data points. I have some 1300 GPS data points which I would like to cluster my GPS points (DBSCAN and OPTICS). As an input file for dbc.in I am using a csv file with only 2 columns (X,Y). The problem is, my X,Y (in projected) coordinates are very precise upto 6 decimal places. But after running the cluster algo I am getting lower precision (upto 3 decimal places). How can I increase the precision of output points?

And also when it is generating the clusters, it is automatically invoking some virtual IDs which are not corresponding to my actual point IDs (ID, X, Y). However, ID is not given in the input csv. It comprises only two columns (X,Y).

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
user26161
  • 131
  • 1
  • 1
  • 7
  • Can you share an example input and output lines? ELKI assigns internal IDs, but you can just discard them if you don't need them. – Has QUIT--Anony-Mousse Feb 22 '14 at 15:41
  • Here is detailed explanation of my problem. input file format (X) (Y) ______________________ 3456.124357 5673.4567 3456.109453 5673.4451 ...................... ...................... output file (with an internal ID and X,Y truncated) ________________ 651 3456.1244 5673.46 652 3456.1095 5673.45 the problem is since the values are getting truncated and the output file doesn't contain the actual ID of the points (say starting from 0) so I am not able to identify which points are clustered and also which point belongs to which cluster. – user26161 Feb 23 '14 at 01:19
  • Can you edit the question to make that more readable? Avoid censoring the data unless absolutely necessary. Use the `FixedDBIDsFilter` to get `DBIDs` that correspond to line number of your input file when using the MiniGUI. Have you considered writing a custom output module for your use case? – Has QUIT--Anony-Mousse Feb 23 '14 at 10:42
  • Thank you Anony-Mouse. ID problem is solved by using FixedDBIDsFilter. Can you tell me how to increase the decimal points (precision) of the output clustered/noise points. I want it to be exactly similar to the input ones. – user26161 Feb 23 '14 at 15:17
  • Floating point is lossy, and exact formatting of these numbers varies from language to language. AFAICT, ELKI just uses Java formatting. There is no option to say "write with exactly as many digits as the input was". This would require storing the original data as strings. – Has QUIT--Anony-Mousse Feb 23 '14 at 16:05
  • Thanx...can anyone tell me how do I label the Y-axis and X-axis in the visualization plots. By default its showing column 0 and column 1. I want it to be X and Y. – user26161 Feb 24 '14 at 02:15

1 Answers1

0

ELKI relies on double for representing numbers. If you need a higher precision, you will have to implement your own parser and output modules (it's easy though, as we have a highly modular architecture).

Default output serialization to text is handled by Java. Precision is therefore what you get from Java by default. This should be 15-16 digits of precision, if you are using DoubleVector, and 7-8 digits if you are using FloatVector.

A quick check with groovysh:

new DoubleVector([12345.678901234567890, 3456.109453] as double[]);
===> 12345.678901234567 3456.109453
new FloatVector([12345.678901234567890, 3456.109453] as float[]);
===> 12345.679 3456.1094

yields only the loss to be expected from double and float precision.

The best way to get row labels is to... add row labels to your data.

Wrt. to your add-on question in the comments: The default parser will treat a text row at the beginning of your file as column labels. So just put "X Y" into the first line of your file.

A reasonable input format will therefore be:

X Y Label
1 2 Point7
3 4 "Point 8"

The following are not-so-good ideas:

5 6 123shouldwork
7 8 don't do this: 3 parser will retain the 3

label should be non-numeric, so that the parser will treat it as label automatically. Otherwise, you have to set the appropriate parameter.

DBIDs are meant for internal handling. Maybe we should not write them to the output at all. FixedDBIDFilter is a hackish work-around; it is meant to be used to get reproducible hashing when using algorithms that need id-based hashing and doing multiple runs in the MiniGUI. Because on multiple runs, DBIDs will be continuously enumerated.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • Thank you for detailed explanation. I have one more doubt. How can I save the visualization (plots) in jpeg or any other image format. I tried export option but the resolution of the image is very poor. – user26161 Feb 26 '14 at 02:23
  • The best export format is SVG. Then you can edit it with inkscape, and e.g. change fonts, colors, or label placement. But if you choose a pixel format, you can also set the image resolution. (for PDF export, also add the Batik pdf export jars) Nevertheless; there are better visualization tools around. The visualizations in ELKI are a convenience functionality. – Erich Schubert Feb 26 '14 at 09:44