0

I have two tables. one includes 1 million records, the other includes 20 million records.

table 1 scheme is:
CREATE TABLE IF NOT EXISTS keyword_vector
(keyword string, vec array<double>)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TextFile;

table 1
keyword,     vec
apple       (1, 1)
orange      (2, 2)
banana      (3, 3)
cat         (4, 4)
dog         (5, 4)
....
table 2 schema:
CREATE TABLE IF NOT EXISTS dict_keyword_vector
(keyword string, vec array<double>)
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TextFile;

table 2
keyword         vec
fruit          (55, 11)
animal         (33, 22)
pear           (44, 66)
kiwi           (22, 11)
peach          (11, 33)
....

I need to use the formula cos_simlairy = ab/(|a||b|), this formula is from https://en.wikipedia.org/wiki/Cosine_similarity
then get the rank of the result, and get top 5 in the rank order by similarity. their result would be like:

value from table 1, top 5 for each value in table 1
(1, 1), (1*44 + 1*66)/sqrt(1 + 1)*sqrt(44*44 + 66*66)
(1, 1), (1*55 + 1*11)/sqrt(1 + 1)*sqrt(55*55 + 11*11)
(1, 1), (1*33 + 1*22)/sqrt(1 + 1)*sqrt(33*33 + 22*22) 
(1, 1), (1*11 + 1*33)/sqrt(1 + 1)*sqrt(11*11 + 33*33)
(1, 1), (1*22 + 1* 11)/sqrt(1 + 1)*sqrt(22*22 + 11*11)
.....

I tried to use cross join in hive. but I always get a failure due to the table is too large.

vito yan
  • 135
  • 1
  • 9
  • value is two separate columns or what? please explain table schema. – leftjoin Nov 07 '18 at 11:11
  • How vectors are represented in the tables – leftjoin Nov 07 '18 at 11:32
  • Well, it seems now it is not possible to reduce one of the tables before join like it was done in this answer https://stackoverflow.com/a/53186249/2700344. If you have no other join keys and it is impossible to reduce datasets before cross join and you really need to cross join first then calculate rank and filter by rank, then let's try to fix cross join itself. Could you please provide your query and container error log? – leftjoin Nov 07 '18 at 12:49
  • @leftjoin sorry for late. I will add the schema. – vito yan Nov 07 '18 at 21:10
  • @leftjoin, i am trying to get error log. once i got it, i will put it on here – vito yan Nov 07 '18 at 21:21
  • have found article on similar problem http://sciencesql.blogspot.com/2016/03/calculating-cosine-similarity-between.html – leftjoin Nov 09 '18 at 13:28

0 Answers0