Spark Correlation Coefficient

Question

I have an specific application in which I am trying to verify the strong positive relationship between many of the time series data that I am reading. I Should elaborate more:

I have a lot of actors which are distributed, and they generate some large number of time series streams of data each. The number of actors * time-series-streams is quite large so using them in order for my specific regression analysis is very costly. So I chose sampling and I am getting robust estimations.
The problem is; I need to validate this idea, and in order to
validate it, I want to do "Correlation Coefficient" between a random sample of these time series and create the Gaussian Distribution for the outcomes of it, and assign the mean and stdev of correlations to the actors. To show which actors are producing more related time
series in the application domain.

Questions:

Am I choosing the right way to verify that the correlation exists, so we should expect sampling to reduce the amount of readings of the actual data?
or are there any other ways of doing collective correlation analysis?

Spark Correlation Coefficient

0 Answers0