0

I am trying to use Spark/MLlib to obtain the correlation coefficient between multiple columns in a set of data. I am having no problem with the numeric columns, where I have been successfully able to calculate the Pearson correlation. However, I cannot figure out how to correlate string and other non-numeric data. The documentation notes that the Spearman correlation is generally used for this purpose, but all of the examples I have seen seem to use numeric data, even in the Spearman case.

Andy
  • 523
  • 6
  • 20

1 Answers1

3

You will have to homomorphically "encrypt" data into numeric format in order to obtain correlation. In general case you will decide on set of features you want to correlate on and then preprocess the data so it will be represented solely by those features.

If you are working with text data (articles or even words correlated by symbols) you can use the tokenizer/vectorizer/MinHashLSH approach. It is well described in this example. Here is a good example on how to preprocess data using RegexTokenizer. After obtaining hashed features you can reduce them to some degree (it's quite hard to comprehend correlation in 100-dimensional field) and do the normal process.

To some data types the answer might be hashing but it won't represent the features, just the unique values.

If you can provide the sample data we could think of less generic solution.

Dennis Tsoi
  • 1,349
  • 2
  • 11
  • 17
  • OK, so if the is only a limited (potentially fairly large) number of distinct values then could I use something like the [StringIndexer] (https://spark.apache.org/docs/latest/ml-features.html#stringindexer)? Alternatively, if the number of distinct strings is prohibitively large, what about something like MD5 hash digest, truncating the last 'n' characters of the hash (c.f. Git)? – Andy Jan 27 '18 at 17:01
  • @Andy StringIndexer would only retain the frequency of the value and if it's enough to represent correlation then yes. When it's possible I try to go with manual weighting and reducing the set of features to 2 or even 1 dimensional values. It usually yields the best results. – Dennis Tsoi Jan 27 '18 at 17:13
  • @Andy Hashing (doesn't matter truncated or not) should only be used to operate on distinctness of values, not their features. For example you could get the features, reduce them and then hash the value so it would be easier to group the original values by the **exact** feature set. – Dennis Tsoi Jan 27 '18 at 17:18
  • Looks like I'd be better to abstract-out the hash function, and provide concrete implementations on a case-by-case basis. Thanks for all your help - it's made things a lot clearer. – Andy Jan 27 '18 at 18:39