I had and array of strings whose length is 50000. I am trying to create a a similarity matrix of dimension 50000 * 500000. In order to make it i tried forming the list of tuples using the following code:
terms = [element for element in itertools.product(array1,array1)]
But I am getting memory error or kernel error. It was not able to move forward.
I have also followed this question in stack overflow: Spark Unique pair in cartesian product This is very similar to my implementation of calculating the distances(due to symmetry i can make use of upper or lower triangle in a matrix). Is there any way of getting it done by means of spark or any other ways of working with partitions or in other means. Any idea would be appreciated.
Toy example for small array:
array1 = np.array(['hello', 'world', 'thankyou'])
terms = [element for element in itertools.product(array1,array1)]
Output of terms:
[('hello', 'hello'),
('hello', 'world'),
('hello', 'thankyou'),
('world', 'hello'),
('world', 'world'),
('world', 'thankyou'),
('thankyou', 'hello'),
('thankyou', 'world'),
('thankyou', 'thankyou')]