How can I prevent words with hyphens from being tokenized when using scikit-learn`s term document matrix?

Question

I am currently working with a large corpus of articles (around 205 thousand), which require the construction of a term document matrix.

I have looked around and it seems that sklearn offers an efficient way to construct it. However, when applying the proposed code to a small list of documents (as a test), I find out that words containing hyphens are divided, with the hyphens as delimiters. This is not desirable, as I am working with documents in Portuguese, in which hyphens are very common due to the large number of compound nouns. I would like to find out how I can generate a term document matrix that simply contains, as coluns, all tokens of my corpus, in which only empty spaces are used as delimiters between tokens (if a word contains a hyphen, it should be considered as a single token).

Here is the code:

index=['doc 1','doc 2','doc 3','doc 4']
docs=['como você está', 'guarda-chuva!','covid-19 piorou','teto-de-gastos do tesouro']

df = pd.DataFrame(list(zip(index, docs)))
df.columns = ['index', 'docs']

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()  
vects = vect.fit_transform(df.docs)
td = pd.DataFrame(vects.todense()).iloc[:len(df)]  
td.columns = vect.get_feature_names()
term_document_matrix = td.T
term_document_matrix.columns = ['Doc '+str(i) for i in range(1, len(df)+1)]
term_document_matrix['total_count'] = term_document_matrix.sum(axis=1)

When printing the matrix, I find that “teto-de-gastos” was transformed into “teto”,”de”,”gastos”, which I do not want. Any suggestions on how to fix this hyphen issue?

score 0 · Answer 1 · answered Oct 29 '21 at 19:04

There is an argument that lets you overwrite the tokenizer function when creating a CountVectorizer instance. The only thing you should think of is implementing your desired tokenizer.

vect = CountVectorizer(tokenizer=lambda document:document.strip().split())

output:

   como  covid-19  do  está  guarda-chuva!  piorou  tesouro  teto-de-gastos  \
0     1         0   0     1              0       0        0               0   
1     0         0   0     0              1       0        0               0   
2     0         1   0     0              0       1        0               0   
3     0         0   1     0              0       0        1               1   

   você  
0     1  
1     0  
2     0  
3     0

As a side note, you should do this very carefully and with plenty of test scenarios to prevent unexpected behavior on the potentially major part which you have not consider yet :)

How can I prevent words with hyphens from being tokenized when using scikit-learn`s term document matrix?

1 Answers1