0

Good day.

I am 3 month old in R and R-Studio but am getting the hang of things. I am implementing a SOM solution with 38k records/observations using Kohonen SuperSOM following Self-Organising Maps for Customer Segmentation using R.

  • My data have no missing values but almost 60 columns many of them are dummyVars (I received this data in this format)
  • I have removed the ONE char Column (URL)
  • My Y column (as I understand it) is "shares" (How many times it was shared)
  • My data only consist of numerical data (dummyVars are of course 1 or 0)
  • I have Centered and Scaled my data (entire dataFrame)
  • As per the example I followed I dod convert the entire DF to a matrix

My problem is that my SOM takes ages to train even with multi core processing and my progress graph does not reach a nice flat"ish" plateau, it does come nicely down but still is very erratic, all my other graphs are extremely high in population and there are no nice clustering. I have even tried a 500 iteration with a 100x100 grid ;-(

I think /guess it is because of the huge amount of columns including mostly dummyVars e.g. dayOfWeek.Monday, dayOfWeek.Tuesday, category.LifeStile, category.Computers, etc.

What am I to do?

Should I convert the dummyVars back into another format, How and Why?

Please do not just give me a section of code as I would like to understand why I need to do What.

Thanx

  • If I were you I would try to start easy: use for example only two variables. Choose the ones you think they are somehow very meaningful for your study and try to not include dummy in this first test. The goal is to allow a fast learning process and to start seeing some results. – Seymour Mar 22 '18 at 17:53
  • Also I suggest you to consider a very important aspect that has been extensively discussed on several SO Questions: you are using an Euclidean Distance! So I would think twice before (1) including non-numerical variables (2) including too many variables. As far as concerns the point (2), computing the euclidean distance based on hundredth of variables often it might not the best idea. – Seymour Mar 22 '18 at 17:56
  • Yea, I finally in some runs obtained a random data selection of only 20% of my data to see if anything changed but no... SO I'll start with just 5 odd columns to see if it makes a difference, HOWEVER, @Seymour, you do not suggest me converting or changing the dummyVars to something else? Note, my data only include numerical values. – Cornelius Mostert Mar 22 '18 at 18:02
  • First, I would try starting simple: only 2 columns that are very straightforward like age and salary. Concerning transformation/conversion I have no idea about what are your variable and what is your dataset. I just suggested to investigate better before using a very high number of variables as well as non-numeric variables when the similarity is measured based on Euclidean Distance. – Seymour Mar 22 '18 at 18:05
  • Also, you said you have multicore, but have you parallelized the function or not? – Seymour Mar 22 '18 at 18:11
  • I have seen the "other" method not Euclidean... but forgot what it was, do you think I should rather switch or use both? I have seen an example where a vector was presented to the SOM training to say what should be used for each column. I have followed https://www.kaggle.com/elimiller/the-caret-package-and-the-titanic?scriptVersionId=2171589 library(parallel) library(doParallel) ##Register cores cluster <- makeCluster(detectCores() - 1) registerDoParallel(cluster) # some R code stopCluster(cluster) registerDoSEQ() – Cornelius Mostert Mar 22 '18 at 18:13

0 Answers0