-1

I have a dataframe text column (in french) and I want to split each text into sentences by their meaning ( break down text into units of sense ), any idea how to do it with Python libraries and NLP techniques ?!

P.S I tried NLTK sent_tokenize and word tokenize but it’s not well split respecting the meaning

For example: “ text discussing sports and then economic and then school systems” => I want to break down the text into sentences like this:

  • sport related text
  • economic related text
  • school system related text

Or at least extract tags out of the whole text, so for this example: I’ll have the following tags: sports/economic/school.

If I can achieve one of these two cases would be great

Paradisum
  • 11
  • 2

1 Answers1

1

Unfortunately, based on my knowledge, there is not a straightforward answer to that.

However, there might be some workarounds, what I suggest is to apply transformers to the list of phrases that you have, something like this:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

#Sentences we want to encode. Example:
sentence1 = ['This framework generates embeddings for each input sentence']
sentence2 = ['This is another phrase']

#Sentences are encoded by calling model.encode()
embedding = model.encode(sentence1, sentence2)

The result provide a list of dense space vectors representing each one of the sentences.

There is a list of different transformers models on https://huggingface.co/sentence-transformers they differ on the language of application, parameters and performances.

After that I would set a list of word that may used as tags, those may be arbitraly chosen or otherwise took from the corpus itself based on the top N most frequent words (some pre-processing of the text might be necessary in this case).

With NLTK library it should look like this:

import nltk
words= nltk.tokenize.word_tokenize(corpus)

stopwords = nltk.corpus.stopwords.words('english')   #remove stopwords
common_words_without_stopwords = nltk.FreqDist(w.lower() for w in words if w not in stopwords)

mostCommon= common_words.most_common(10).keys() #top 10 most common words, that you can use as tags

Then, I would convert also those tags into vectors and eventually iterate a for loop cycle that aims to identify the top N tags that are the closest to the single phrase.

For what concerns the latter, there are different ways to compute the distance of 2 dense vectors, like: Euclidean distance, cosine similarity, l-norm.. It's a matter of choice.

For example, with euclidean distance:

import numpy as np
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(embedding[0]- embedding[1])