2

Normally, a paragraph contain many sub - paragraph and each sub - paragraph has the certain meaning.
In NLP, How can I split paragraph into sub - paragraph which have meaning or in other words, I would like to detect the boundaries between sub-paragraphs

Sherry
  • 21
  • 2
  • Why splitting just into paragraphs doesn't work for you? Why splitting into single sentences doesn't work for you? By which criteria do you decide that a particular splitting of your paragraph is good or bad? – David Dale Jun 03 '20 at 08:21

1 Answers1

2

The problem you are stating is interesting, but poorly defined, because "meaning" itself is defined poorly, and we actually don't know how to tell a good partition of the paragraph from a bad one.

However, we can simplify the problem as this: we want to group together adjacent sentences, if their topic is similar, i.e. if they are about the same or similar objects, or contain otherwise similar words or phrases. Thus we can describe our algorithm formally:

  1. Split the paragraph into sentences.
  2. Represent each sentence as some formal object (e.g. a bag of words, or a bag of word embeddings from w2v, fasttext, ELMO or BERT, or a sentence embedding from some neural network, such as USE).
  3. Compute the distances between each pair of sentences (e.g. cosine distance between sentence embeddings or word counts, or word mover distance between word embeddings).
  4. Run an agglomerative clustering algorithm on this distance matrix, with one additional restriction: only adjacent clusters can be merged together.
  5. Try the clustering on different paragraphs with different stopping criteria (usually thresholds) and choose the threshold which produces the most meaningful partitions.

If this algorithm looks like what you want, I could provide its baseline implementation in Python.

Upd Please take a loog at this gist with my basic implementation: Spacy sentenizer + cosine similarity of Spacy sentence vectors + naive clustering based only on neighbouring sentences. https://gist.github.com/avidale/e4450da902d36bb14c595987943120dc

David Dale
  • 10,958
  • 44
  • 73