0

I am having a few issues with scaling a text matching program. I am using text2vec which provides very good and fast results.

The main problem I am having is manipulating a large matrix which is returned by the text2vec::sim2() function.

First, some details of my hardware / OS setup: Windows 7 with 12 cores about 3.5 GHz and 128 Gb of memory. Its a pretty good machine.

Second, some basic details of what my R program is trying to achieve.

We have a database of 10 million unique canonical addresses for every house / business in address. These reference addresses also have latitude and longitude information for each entry.

I am trying to match these reference addresses to customer addresses in our database. We have about 600,000 customer addresses. The quality of these customer addresses is not good. Not good at all! They are stored as a single string field with absolutely zero checks on input.

The techical strategy to match these addresses is quite simple. Create two document term matrices (DTM) of the customer addresses and reference addresses and use cosine similarity to find the reference address which is the most similar to a specific customer address. Some customer addresses are so poor that will result in a very low cosine similarity -- so, for these addresses a "no match" would be assigned.

Despite being a pretty simple solution, the results obtained are very encouraging.

But, I am having problems scaling things....? And I am wondering if anyone has any suggestions.

There is a copy of my code below. Its pretty simple. Obviously, I cannot include real data but it should provide readers a clear idea of what I am trying to do.

SECTION A - Works very well even on the full 600,000 * 10 million input data set.

SECTION B - the text2vec::sim2() function causes R studio to shut down when the vocabulary exceeds about 140,000 tokens (i.e columns). To avoid this, I process the customer addresses in chunks of about 200.

SECTION C - This is the most expensive section. When processing addresses in chunks of 200, SECTION A and SECTION B take about 2 minutes. But SECTION C, using (what I would have thought to be super quick functions) take about 5 minutes to process to process a 10 million row * 200 column matrix.

Combined, SECIONS A:C take about 7 minutes to process 200 addresses. As there are 600,000 addresses to process, this will take about 14 days to process.

Are they are ideas to make this code run faster...?

rm(list = ls())
library(text2vec)
library(dplyr)


# Create some test data

# example is 10 entries.  
# but in reality we have 10 million addresses
vct_ref_address <- c("15 smith street beaconsfield 2506 NSW", 
"107 orange grove linfield 2659 NSW",
"88 melon drive calton 3922 VIC", 
"949 eyre street sunnybank 4053 QLD",
"12 black avenue kingston 2605 ACT", 
"5 sweet lane 2004 wynyard NSW",
"32 mugga way 2688 manuka ACT",
"4 black swan avenue freemantle 5943 WA",
"832 big street narrabeet 2543 NSW", 
"5 dust road 5040 NT")


# example is 4 entries
# but in reality, we have 1.5 million addresses
vct_test_address <- c("949 eyre street sunnybank 4053 QLD",  
"1113 completely invalid suburb with no post code QLD", 
"12 black road kingston 2605 ACT",  
"949 eyre roaod sunnybank 4053 QLD" )

# ==========================
# SECTION A ===== prepare data
# A.1 create vocabulary 
t2v_token <- text2vec::itoken(c(vct_test_address, vct_ref_address),  progressbar = FALSE)
t2v_vocab <- text2vec::create_vocabulary(t2v_token)
t2v_vectorizer <- text2vec::vocab_vectorizer(t2v_vocab)
# A.2 create document term matrices dtm
t2v_dtm_test <- text2vec::create_dtm(itoken(vct_test_address, progressbar = FALSE), t2v_vectorizer)
t2v_dtm_reference <- text2vec::create_dtm(itoken(vct_ref_address, progressbar = FALSE), t2v_vectorizer)

# ===========================
# SECTION B ===== similarity matrix
mat_sim <- text2vec::sim2(t2v_dtm_reference, t2v_dtm_test,  method = 'cosine', norm = 'l2')

# ===========================
# SECTION C ===== process matrix
vct_which_reference <- apply(mat_sim, 2, which.max)
vct_sim_score <- apply(mat_sim, 2, max)

# ============================
# SECTION D ===== apply results
# D.1 assemble results
df_results <- data.frame(
test_addr = vct_test_address,
matched_addr = vct_ref_address[vct_which_reference],
similarity =  vct_sim_score )

# D.2 print results
df_results %>% arrange(desc(similarity))
markthekoala
  • 1,065
  • 1
  • 11
  • 24
  • This is quite verbose. Can you distill your question down a bit? – Roman Luštrik Feb 16 '18 at 08:06
  • Yes. It is a bit verbose @RomanLuštrik. The function sim2() produces a very large matrix which takes a long time to process. Is there any way I can avoid returning such a large matrix? Or, Are there quicker ways to process a large matrix than using apply()? – markthekoala Feb 16 '18 at 09:22

1 Answers1

0

The issue in step C is that mat_sim is sparse and all the apply calls make column/row subsetting which are super slow (and convert sparse vectors to dense).

There could be several solutions:

  1. if mat_sim is not very huge convert to the dense with as.matrix and then use apply
  2. Better you can convert mat_sim to sparse matrix in a triplet format with as(mat_sim, "TsparseMatrix") and then use data.table to get indices of the max elements. Here is an example:

    library(text2vec)
    library(Matrix)
    data("movie_review")
    it = itoken(movie_review$review, tolower, word_tokenizer)
    dtm = create_dtm(it, hash_vectorizer(2**14))
    
    
    mat_sim = sim2(dtm[1:100, ], dtm[101:5000, ])
    mat_sim = as(mat_sim, "TsparseMatrix")
    
    library(data.table)
    
    # we add 1 because indices in sparse matrices in Matrix package start from 1
    mat_sim_dt = data.table(row_index = mat_sim@i + 1L, col_index = mat_sim@j + 1L, value = mat_sim@x)
    
    res = mat_sim_dt[, 
            { k = which.max(value); list(max_sim = value[[k]], row_index = row_index[[k]]) }, 
            keyby = col_index]  
    res
    

Also as a side suggestion - I recommend to try char_tokenizer() with ngrams (for example of the size c(3, 3)) to "fuzzy" match different spelling and abbreviations of addresses.

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
  • Thanks very much for such a comprehensive reponse. `mat_sim` is about 10 million rows * 200 columns. Its huge! So the second solution is probably more appropriate. The solution you proposed is indeed much faster than mine: in it took 90 seconds to process compared to 250 seconds. The matrix conversion took 47 secs; the conversion to data.table took 33 seconds and the data.table group by took 11 seconds. `mat sim' had 40% zero values. – markthekoala Feb 19 '18 at 22:41
  • Thanks also for the `char_tokenizer()` suggestion. I have not (yet) tried this out. One concern is that --- although the results may be more accurate --- the speed of execution may deteriorate. At the moment, we are splitting the problem up into 3,000 chunks and each of these takes about 210 seconds to process. Thanks again, enjoying using `text2vec` – markthekoala Feb 19 '18 at 22:44