0

Using test cases I was able to see how ELKI can be used directly from Java but now I want to read my data from MongoDB and then use ELKI to cluster geographic (long, lat) data.

I can only cluster data from a CSV file using ELKI. Is it possible to connect de.lmu.ifi.dbs.elki.database.Database with MongoDB? I can see from the java debugger that there is a databaseconnection field in de.lmu.ifi.dbs.elki.database.Database.

I query MongoDB creating POJO for each row and now I want to cluster these objects using ELKI.

It is possible to read data from MongoDB and write it in a CSV file then use ELKI to read that CSV file but I would like to know if there is a simpler solution.

---------FINDINGS_1:

From ELKI - Use List<String> of objects to populate the Database I found that I need to implement de.lmu.ifi.dbs.elki.datasource.DatabaseConnection and specifically override the loadData() method which returns an instance of MultiObjectsBundle.

So I think I should wrap a list of POJO with MultiObjectsBundle. Now i'm looking at the MultiObjectsBundle and it looks like the data should be held in columns. Why columns datatype is List> shouldnt it be List? just a list of items you want to cluster?

I'm a little confused. How is ELKI going to know that it should look at the long and lat for POJO? Where do I tell ELKI to do this? Using de.lmu.ifi.dbs.elki.data.type.SimpleTypeInformation?

---------FINDINGS_2:

I have tried to use ArrayAdapterDatabaseConnection and I have tried implementing DatabaseConnection. Sorry I need thing in very simple terms for me to understand.

This is my code for clustering:

    int minPts=3;
    double eps=0.08; 
    double[][] data1 = {{-0.197574246, 51.49960695}, {-0.084605692, 51.52128377}, {-0.120973687, 51.53005939}, {-0.156876, 51.49313}, 
            {-0.144228881, 51.51811784}, {-0.1680743, 51.53430039}, {-0.170134484,51.52834133}, { -0.096440751, 51.5073853}, 
            {-0.092754157, 51.50597426}, {-0.122502346, 51.52395143}, {-0.136039674, 51.51991453}, {-0.123616824, 51.52994371}, 
            {-0.127854211, 51.51772703}, {-0.125979294, 51.52635795}, {-0.109006325, 51.5216612}, {-0.12221963, 51.51477076}, {-0.131161087, 51.52505093} };


    //      ArrayAdapterDatabaseConnection dbcon = new ArrayAdapterDatabaseConnection(data1);
    DatabaseConnection dbcon = new MyDBConnection();

    ListParameterization params = new ListParameterization();
    params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPts);
    params.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, eps);
    params.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);
    params.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, dbcon);
    params.addParameter(AbstractDatabase.Parameterizer.INDEX_ID,
            RStarTreeFactory.class);
    params.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, 
            SortTileRecursiveBulkSplit.class);
    params.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, 1000);

    Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, params);
    db.initialize();

    GeneralizedDBSCAN dbscan = ClassGenericsUtil.parameterizeOrAbort(GeneralizedDBSCAN.class, params);

    Relation<DoubleVector> rel = db.getRelation(TypeUtil.DOUBLE_VECTOR_FIELD);
    Relation<ExternalID> relID = db.getRelation(TypeUtil.EXTERNALID);

    DBIDRange ids = (DBIDRange) rel.getDBIDs();
    Clustering<Model> result = dbscan.run(db);  

    int i =0;
    for(Cluster<Model> clu : result.getAllClusters()) {
        System.out.println("#" + i + ": " + clu.getNameAutomatic());
        System.out.println("Size: " + clu.size());

        System.out.print("Objects: ");
        for(DBIDIter it = clu.getIDs().iter(); it.valid(); it.advance()) {
           DoubleVector v = rel.get(it);
           ExternalID exID = relID.get(it);
           System.out.print("DoubleVec: ["+v+"]");
           System.out.print("ExID: ["+exID+"]");

           final int offset = ids.getOffset(it);
           System.out.print(" " + offset);
        }
        System.out.println();
        ++i;
    } 

The ArrayAdapterDatabaseConnection produces two clusters, I just had to play around with the value of epsilon, when I set epsilon=0.008 dbscan started creating clusters. When i set epsilon=0.04 all the items were in 1 cluster.

I have also tried to implement DatabaseConnection:

@Override
public MultipleObjectsBundle loadData() { 
    MultipleObjectsBundle bundle = new MultipleObjectsBundle(); 

    List<Station> stations = getStations();
    List<DoubleVector> vecs = new ArrayList<DoubleVector>();
    List<ExternalID> ids = new ArrayList<ExternalID>();

    for (Station s : stations){

        String strID = Integer.toString(s.getId());
        ExternalID i = new ExternalID(strID);
        ids.add(i);     

        double[] st = {s.getLongitude(), s.getLatitude()};
        DoubleVector dv = new DoubleVector(st); 
        vecs.add(dv);
    } 

    SimpleTypeInformation<DoubleVector> type = new VectorFieldTypeInformation<>(DoubleVector.FACTORY, 2, 2, DoubleVector.FACTORY.getDefaultSerializer());

    bundle.appendColumn(type, vecs);      
    bundle.appendColumn(TypeUtil.EXTERNALID, ids);
    return bundle;
} 

These long/lat are associated with an ID and I need to link them back to this ID to the values. Is the only way to go that using the ID offset (in the code above)? I have tried to add ExternalID column but I don't know how to retrieve the ExternalID for a particular NumberVector?

Also after seeing Using ELKI's Distance Function I tried to use Elki's longLatDistance but it doesn't work and I could not find any examples to implement it.

Community
  • 1
  • 1
MTA
  • 739
  • 2
  • 9
  • 29
  • I get the feeling that I should be using relations from this example http://elki.dbs.ifi.lmu.de/browser/elki/elki/src/main/java/tutorial/javaapi/PassingDataToELKI.java but i dont know how to adapt it to geo points – MTA Oct 28 '15 at 17:59
  • Actually i think this is what i need http://elki.dbs.ifi.lmu.de/releases/release0.4.0/doc/de/lmu/ifi/dbs/elki/datasource/DatabaseConnection.html I just need to find out how now.. – MTA Oct 29 '15 at 10:54
  • So.. trying to implement databaseConnection I need it to somehow take in long and lat – MTA Oct 29 '15 at 11:42
  • Have you looked at the source of e.g. `ArrayAdapterDatabaseConnection`? – Has QUIT--Anony-Mousse Oct 29 '15 at 21:46
  • @Anony-Mousse I have tried to implement ArrayAdapterDatabaseConnection please see my edited question. Thank you. – MTA Oct 30 '15 at 21:08
  • I think your column type is incorrect. It doesn't specify 2d. You could add an ID column, if you want. Don't initialize the Db twice, and maybe add an index for perofrmance. – Has QUIT--Anony-Mousse Oct 31 '15 at 10:16
  • @Anony-Mousse I modified the column type and added externalID column (modified in question above) but when I iterate through the clusters how do I get the externalID for a specific DoubleVector? Also from here http://stackoverflow.com/questions/19338627/how-can-i-use-the-index-structures-in-elki I just thought I would add the R-tree index, I don't see a difference in the results though. – MTA Oct 31 '15 at 17:45
  • R tree work best with SortTileRecursive bulk loading, and you need to experiment with the page size (plus, it only pays off for large data). You have not yet fixed the vector data type to say 2d! compare to arrayadapter. For the external ids, get the external id relation with `db.getRelation` – Has QUIT--Anony-Mousse Oct 31 '15 at 17:54
  • @Anony-Mousse I modified R-tree code making it similar to http://stackoverflow.com/questions/23869212/elki-dbscan-r-tree-index but still need to experiment with page size. The dataset is large but I was just trying to get it work with a small dataset to begin with. I modified the vector data in the same that is done in ArrayAdapterDatabaseConnection class but I did not know what to set the min and max dimensions as so I just set it as 2. I have also retrieved the externalID. The solution has been edited in the question. Is that ok? Is there something wrong with the vector data type to 2d? – MTA Nov 01 '15 at 04:00
  • Latitude, longitude. That makes 2 dimensions. So yes, min and max should be set to two. R-trees and similar indexes can *only* be used on data with a fixed min=max dimensionality (one of the most common restrictions, data must be in R^d). – Has QUIT--Anony-Mousse Nov 01 '15 at 06:52

1 Answers1

0

The interface for data sources is called DatabaseConnection.

JavaDoc of DatabaseConnection

You can implement a MongoDB-based interface to get the data.

It is not complicated interface, it has a single method.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • Please see edit in my question. I'm a little confused on how to implement DatabaseConnection. – MTA Oct 29 '15 at 12:58
  • You want a column of `DoubleVector`, because that is the data type the distance functions require. You want the type of a 2-dimensional numeric vector field. – Erich Schubert Oct 29 '15 at 13:06
  • So in MultiObjectsBundle List> column, for each entry, you have 2 DoubleVector's one for long and the other for lat – MTA Oct 29 '15 at 13:15
  • Where does SimpleTypeInformation come into play? I dont see how I can use that? – MTA Oct 29 '15 at 13:17
  • You want **one vector per object**, obviously, and a `VectorFieldTypeInformation` that indicates this column to be a 2-dimensional vector field. (With vectors having *exactly* 2 dimensions, not 0-1000 dimensions like in text vectors). Vector as in the *mathematical object*, not as in Java misnomer `Vector` (which is a generic list). – Erich Schubert Oct 30 '15 at 11:18
  • I have tried to implement DatabaseConnection but it's not returning anything. Please see my question edit. Thank you. – MTA Oct 30 '15 at 21:09
  • I modified the code now from @Anony-Mousse comment but still unsure about how to associate the ExternalID with the DoubleVector and how to modify the distancefunction (so that it can make sense for long and lat values) – MTA Oct 31 '15 at 17:49
  • "Doesn't work" is not a very precise statement. The `LngLatDistanceFunction` works fine for me. The code you have in your latest edit otherwise looks okay, you should be able to get both your vector and your ID. – Erich Schubert Nov 02 '15 at 08:26