Questions tagged [text-segmentation]

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics.

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics.

References:

Related Tags:

197 questions
3
votes
0 answers

Is there a way in Python to do paragraph segmentation based on topic of short texts that were created by speech-to-text?

I have multiple transcripts of short vidoes, that were created by speech-to-text algorithm. I want to segment these transcripts into paragraphs, based on their content. I tried to use Texttilling in Python but for every such trial I got the "No…
user3017075
  • 351
  • 3
  • 16
3
votes
1 answer

Segment text from bad lightining images using python

I have three types of images and want to segment text from them. So I get a clean binarized img like the first image below. The three types of images are below I've tried various techniques but it always have some cases to fail. I tried first to…
3
votes
5 answers

Split text file at sentence boundary

I have to process a text file (an e-book). I'd like to process it so that there is one sentence per line (a "newline-separated file", yes?). How would I do this task using sed the UNIX utility? Does it have a symbol for "sentence boundary" like a…
Isaac G.
  • 93
  • 1
  • 6
3
votes
2 answers

How to split concatenated strings of this kind: "howdoIsplitthis?"

Suppose I have a string such as this: "IgotthistextfromapdfIscraped.HowdoIsplitthis?" And I want to produce: "I got this text from a pdf I scraped. How do I split this?" How can I do it?
3
votes
3 answers

extract a sentence using python

I would like to extract the exact sentence if a particular word is present in that sentence. Could anyone let me know how to do it with python. I used concordance() but it only prints lines where the word matches.
sudh
  • 1,085
  • 4
  • 12
  • 13
3
votes
3 answers

regular expression that extracts words from a string

I want to extract all words from a java String. word can be written in any european language, and does not contain spaces, only alpha symbols. it can contain hyphens though.
EugeneP
  • 11,783
  • 32
  • 96
  • 142
3
votes
1 answer

Structure of a trie for a word with subwords

What will be the structure of a trie for words which have subwords like "icecream" (contains 'i', 'ice', 'cream', 'icecream'); "businessman" (contains 'bus', 'is', 'business', 'man', 'businessman'). I know how will it be for those which do not have…
divyum
  • 1,286
  • 13
  • 20
3
votes
1 answer

python.NLTK (WindowDiff and PK) vs python.Segeval (WindowDiff and PK)

Python NLTK implementation of Beeferman's PK and WindowDIFF are getting complete different results from python segeval implementation of both. Using the same parameters. hyp: 0100100000 ref: 0101000000 k=2 PK's SegEval:0.2222222 PK's…
Matheus Araujo
  • 5,551
  • 2
  • 22
  • 23
3
votes
1 answer

How do I change a paragraph into an array in PHP including the spaces and punctation

I have a string like this Hello? My name is Ben! @ My age is 32. I want to change it into an array with all words, spaces and punctuation as separate entities in the array. For example if I did var_dump($sentence) the array should look like…
Ben Paton
  • 1,432
  • 9
  • 35
  • 59
3
votes
1 answer

Sentence matching with regex

I have a text that splits into many lines, no particular formats. So I decided to line.strip('\n') for each line. Then I want to split the text into sentences using the sentence end marker . considering: period . that is followed by a \s…
Iykeln
  • 149
  • 2
  • 2
  • 8
3
votes
1 answer

Sentence segmentation and aligment in noisy text corpus

I have a parallel corpus which contains about 100,000 aligned paragraphs in Arabic and Persian. My corpus is a noisy corpus which its paragraphs are incomplete translation of each other (i.e., the parts of Arabic paragraphs are not translated to…
htaghizadeh
  • 571
  • 6
  • 20
3
votes
1 answer

Sentence boundary detection in HTML

I need to detect sentence boundaries in HTML. There is lots of sentence boundary detection software out there (java.text.BreakIterator is the one I'm using), but all of it assumes plain text. HTML is richer than that, and includes some clues as to…
ccleve
  • 15,239
  • 27
  • 91
  • 157
2
votes
2 answers

Get a whole unicode sentence

I'm trying to parse a sentence like Base: Lote Numero 1, Marcelo T de Alvear 500. Demanda: otras palabras. I want to: first, split the text by periods, then, use whatever is before the colon as a label for the sentence after the colon. Right now I…
tutuca
  • 3,444
  • 6
  • 32
  • 54
2
votes
1 answer

Custom segmentation and override segmentation rules in spacy

I want to split into sentences a large corpus (.txt) with a custom rule i.e. {SENT} using Spacy 3.1. My main issue is that I want to "disable" the segmentation from the pretrained spacy models with spacy i.e. en_core_web_lg but keep all the other…
Artemis
  • 145
  • 7
2
votes
5 answers

Sentence segmentation using Regex

I have few text(SMS) messages and I want to segment them using period('.') as a delimiter. I am unable to handle following types of messages. How can I segment these messages using Regex in Python. Before segmentation: 'hyper count 16.8mmol/l.plz…
Maggie
  • 5,923
  • 8
  • 41
  • 56