0

I have gone through this question but the solution doesn't help. ELKI Kmeans clustering Task failed error for high dimensional data

This is my first time with ELKI so, please bear with me. I have 45000 2D data points (after performing doc2vec ) that contain negative values and are not normalized. The dataset looks something like this :

-4.688612   32.793335
-42.990147  -20.499323
-24.948868  -10.822767
-45.502155  -40.917801
27.979715   -40.012688
1.867812    -9.838544
56.284512   6.756072

I am using the K-means algorithm to get 2 clusters. However, I get the following error:

Task failed
de.lmu.ifi.dbs.elki.data.type.NoSupportedDataTypeException: No data type found satisfying: NumberVector,field AND NumberVector,variable
Available types: DBID DoubleVector,variable,mindim=0,maxdim=1 LabelList
    at de.lmu.ifi.dbs.elki.database.AbstractDatabase.getRelation(AbstractDatabase.java:126)
    at de.lmu.ifi.dbs.elki.algorithm.AbstractAlgorithm.run(AbstractAlgorithm.java:81)
    at de.lmu.ifi.dbs.elki.workflow.AlgorithmStep.runAlgorithms(AlgorithmStep.java:105)
    at de.lmu.ifi.dbs.elki.KDDTask.run(KDDTask.java:112)
    at de.lmu.ifi.dbs.elki.application.KDDCLIApplication.run(KDDCLIApplication.java:61)
    at [...]

So my question is, does ELKI require the data to be in the range of [0,1] because all the examples that I came across had their data within that range.

Or is it that ELKI does not accept negative values?

If something else, can someone please guide me through this?

Thank you!

Sascha
  • 35
  • 1
  • 1
  • 7

1 Answers1

1

ELKI can handle negative values just fine.

Your input data is not correctly formatted. Same problem as in ELKI Kmeans clustering Task failed error for high dimensional data

Apparently your lines have either 0 or 1 values. ELKI itself is fine with that, but k-means requires the data to be in a R^d vector space, hence ELKI cannot run k-means on your data set. But the reason is that the input file is bad. You may want to double check your file - there probably is at least one line that is not properly formatted.

Erich Schubert
  • 8,575
  • 2
  • 26
  • 42
  • I did try the solution given for that question. It didn't work for me. I even normalized my data and tried. And I tried for DBSCAN and OPTICS too. With DBSCAN at least the output window opens but no cluster or anything is shown. And in the main window, I get a whole lot of **machine language in red** error. – Sascha Apr 24 '19 at 18:19
  • Also, I don't know the labels of the data points unlike the mouse.csv in the ELKI tutorials. So not knowing the labels, is the NumberVectorLabelParser correct? (...because I don't have labels ) – Sascha Apr 24 '19 at 18:20
  • That is not "machine language". That is called a Java stack trace, and its aimed at human developers... basic knowledge for any developer. As it can easily be seen from the examples, labels are not necessary. But some part of your files fail to parse, and end up being interpreted as labels. But there are more ways how your files can be broken. Are you 100% sure they are all proper numbers? Double check. Because **there is some value in your file that the parser could not parse**. But its supposedly your file that is defect. – Erich Schubert Apr 24 '19 at 20:42
  • Oh, I see. Thank you for your help! I did check the data and all values are numbers. I got it to work finally by inputting txt file instead of csv. It doesn't work with csv for me..But with txt, K-means and DBSCAN are giving some output! – Sascha Apr 25 '19 at 06:28
  • Maybe the separator is not set up right? There is a configuration option for the separators, because CSV is not a uniquely defined format (nor is "txt file"). Or the numbers are formatted in some locale? Either way, it would be good to know what went wrong. How did you produce the CSV? Because if that is a common problem, we could adjust the defaults (at least if that does not end up breaking it for others). – Erich Schubert Apr 25 '19 at 09:29
  • The numbers are basically vector representations of tweets that I obtain using the doc2vec model of Gensim. I am working on Colab. I export the dataframe containing the vectors to txt using **normal = df.to_csv (r'normalized.txt',header=None, index = None)** and to csv using **np.savetxt(r'norm.csv',df,delimiter=',',fmt=('%1.22e'))** ..... The txt file works but the csv doesn't. – Sascha Apr 25 '19 at 13:30
  • Which one is the version that did not work? the `.txt` generated by `to_csv`, or the `.csv` generated by `savetxt`? But the format choice `%1.22e` *can* be responsible, because not every "22 digit" number will be accepted by ELKI, because double precision only has 16-17 digits of precision. Since your input data does not appear to have that many digits, did you consider using something like "%10f"? See also https://github.com/elki-project/elki/issues/39 – Erich Schubert Apr 25 '19 at 15:53
  • The current version *should* have given you a warning "Too many digits in what looked like a double number - treating as string" though. Maybe that was in the red text? – Erich Schubert Apr 25 '19 at 15:53
  • The .csv generated by savetxt did not work ( csv generated by to_csv didn't work either ).. Looking at your suggestion I also tried without the %1.22e..The txt file still works. The error that I got in red looks like this : **Invalid quoted line in input: no closing quote found in:Kv°Ø½çí°d‹Ý{ÞÑKv¨X:,G;,ÙÁbçmý=,ÙÁb÷yÞÑKv°X’×Kv°X’×Ë�Ë!–ɯ_›ø=,‡tX&¿‹‘¼Ÿ“">;ÚaÉ** And I cant make head or tail of it. The data in my csv is in exactly the same format as the sample mouse.csv dataset. – Sascha Apr 25 '19 at 16:26
  • So Java supposedly doesn't allow uninitialized memory. Somewhere these "garbage characters" must come from. Can you upload the broken file somewhere for inspection? – Erich Schubert Apr 27 '19 at 13:21
  • https://github.com/nikitasalkar1997/Twitter-Dataset.git Sorry for the late reply... The entire dataset is uploaded here. – Sascha Apr 30 '19 at 17:11
  • But that file is text, not numbers. So that isn't the broken file, is it? – Erich Schubert May 01 '19 at 13:36
  • So sorry about that! I have uploaded the correct file and also the one with the normalized data. – Sascha May 01 '19 at 17:40
  • Both files load fine in ELKI for me, I cannot reproduce your error. Which is the *broken* file? – Erich Schubert May 01 '19 at 17:50
  • I was getting the red error with the csv files... Not with txt file... I am very new to elki and this is my 1st project. I don't know what is wrong. I am very sorry for bothering you. – Sascha May 01 '19 at 18:35
  • Well the files you uploaded work for me, so they don't help me debug this. That is why I am asking specifically for the not working file. – Erich Schubert May 01 '19 at 21:04