-2

I had and array of strings whose length is 50000. I am trying to create a a similarity matrix of dimension 50000 * 500000. In order to make it i tried forming the list of tuples using the following code:

terms = [element for element in itertools.product(array1,array1)]

But I am getting memory error or kernel error. It was not able to move forward.

I have also followed this question in stack overflow: Spark Unique pair in cartesian product This is very similar to my implementation of calculating the distances(due to symmetry i can make use of upper or lower triangle in a matrix). Is there any way of getting it done by means of spark or any other ways of working with partitions or in other means. Any idea would be appreciated.

Toy example for small array:

array1 = np.array(['hello', 'world', 'thankyou'])
terms = [element for element in itertools.product(array1,array1)]

Output of terms:

[('hello', 'hello'),
 ('hello', 'world'),
 ('hello', 'thankyou'),
 ('world', 'hello'),
 ('world', 'world'),
 ('world', 'thankyou'),
 ('thankyou', 'hello'),
 ('thankyou', 'world'),
 ('thankyou', 'thankyou')]
Vas
  • 918
  • 1
  • 6
  • 19

1 Answers1

2

50000 * 50000 is 2GB+ elements in the list. Each list element takes 4 bytes (+36 byte overhead for the list). Multiply that with the average string length (6 in your example) + 21 (number of bytes overhead per string). Which means that you'll need to have 216+ GB of RAM just for this single statement (this is on top of memory for your OS, other programs, etc). I think you're hitting real world limitations and need to find better algorithms.

thebjorn
  • 26,297
  • 11
  • 96
  • 138