I am currently working with a large corpus of articles (around 205 thousand), which require the construction of a term document matrix.
I have looked around and it seems that sklearn offers an efficient way to construct it. However, when applying the proposed code to a small list of documents (as a test), I find out that words containing hyphens are divided, with the hyphens as delimiters. This is not desirable, as I am working with documents in Portuguese, in which hyphens are very common due to the large number of compound nouns. I would like to find out how I can generate a term document matrix that simply contains, as coluns, all tokens of my corpus, in which only empty spaces are used as delimiters between tokens (if a word contains a hyphen, it should be considered as a single token).
Here is the code:
index=['doc 1','doc 2','doc 3','doc 4']
docs=['como você está', 'guarda-chuva!','covid-19 piorou','teto-de-gastos do tesouro']
df = pd.DataFrame(list(zip(index, docs)))
df.columns = ['index', 'docs']
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
vects = vect.fit_transform(df.docs)
td = pd.DataFrame(vects.todense()).iloc[:len(df)]
td.columns = vect.get_feature_names()
term_document_matrix = td.T
term_document_matrix.columns = ['Doc '+str(i) for i in range(1, len(df)+1)]
term_document_matrix['total_count'] = term_document_matrix.sum(axis=1)
When printing the matrix, I find that “teto-de-gastos” was transformed into “teto”,”de”,”gastos”, which I do not want. Any suggestions on how to fix this hyphen issue?