Questions tagged [text-segmentation]

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics.

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics.

References:

Related Tags:

197 questions
2
votes
3 answers

Remove timestamp in the bracket from text Python

I'd like to remove all the timestamps in the parentheses in the below sample text data. Input: Agent: Can I help you? ( 3s ) Customer: Thank you( 40s ) Customer: I have a question about X. ( 8m 1s ) Agent: I can help here. Log in this website…
LY1
  • 35
  • 5
2
votes
0 answers

how to remove noise in the background of an old document image

how to remove the background of a image which contains many noises and lines etc [sample image][1] import cv2 from PIL import Image image = cv2.imread("1.jpg") #input image image = cv2.fastNlMeansDenoisingColored(image,None,10,10,7,21) gray =…
Milan KD
  • 21
  • 1
2
votes
2 answers

Perform line segmentation (cropping) serially with OpenCV

I am performing full Page Offline Handwriting Recognition with Deep Learning. The main idea is to build the model that can take one line of text image and give it's corresponding text. For this main task is do line segmentation of every line in a…
susan097
  • 3,500
  • 1
  • 23
  • 30
2
votes
2 answers

R: consider punctuation to do word segmentation

I use NGramTokenizer() to do 1~3 gram segmentation, but it seems doesn't consider punctuation, and removes punctuation. So the segmentation words isn't ideal for me. (like the result: oxidant amino, oxidant amino acid, pellet oxidant and so…
Eva
  • 483
  • 1
  • 4
  • 13
2
votes
1 answer

Python - How to Extract sentences that contains Citation mark?

text = "Trondheim is a small city with a university and 140000 inhabitants. Its central bus systems has 42 bus lines, serving 590 stations, with 1900 (departures per) day in average. T h a t gives approximately 60000 scheduled bus station passings…
gameon67
  • 3,981
  • 5
  • 35
  • 61
2
votes
1 answer

NLP: Within Sentence Segmentation / Boundary Detection

I am interested if there are libraries that break a sentence into small pieces based on content. E.g. input: sentence: "During our stay at the hotel we had a clean room, very nice bathroom, breathtaking view out the window and a delicious …
Uther Pendragon
  • 302
  • 2
  • 14
2
votes
1 answer

Using Tesseract OCR for Character Segmentation Only

I want to do text segmentation on a printed document. I already segment the document to the character segmentation but i failed when i meet some touching character. I want to use the Tesseract OCR only to segment the word. I know Tesseract can do…
2
votes
1 answer

How to count ocurrences of substings in string from text file - python

I want to count the number of lines on a .txt file were a string contains two sub-strings. I tried the following: with open(filename, 'r') as file: for line in file: wordsList = line.split() if any("leads" and "show" in s for s…
ignasibm
  • 25
  • 6
2
votes
0 answers

Chinese Segmentation : ICTCLAS Training Corpora

I am using the ICTCLAS segmentation tool for Chinese. We can read in "Automatic Recognition of Chinese Unknown Words Based on Roles Tagging" (Zhang, Liu, 2002) that it has been trained on the Peking University Corpus (PKU) : "The training corpus…
Starckman
  • 145
  • 6
2
votes
2 answers

Getting the least amount of sub words

Solution by Dávid Horváth adapted to return the biggest smallest word: import java.util.*; public class SubWordsFinder { private Set words; public SubWordsFinder(Set words) { this.words = words; } …
BullyWiiPlaza
  • 17,329
  • 10
  • 113
  • 185
2
votes
3 answers

Parsing data from a file

I have been provided with a file containing data on recorded sightings of species, which is laid out in the format; "Species", "\t", "Latitude", "\t", "Longitude" I need to define a function that will load the data from the file into a list, whilst…
NKing
  • 23
  • 4
2
votes
2 answers

Non reducable grapheme clusters in unicode

I'm of the opinion that "user perceived character" (henceforth UPC) iterator would be very useful in a unicode library. By UPC I mean the sense discussed in unicode standard annex 29, which is what a user perceives as a character, but might be…
Spacemoose
  • 3,856
  • 1
  • 27
  • 48
2
votes
1 answer

How can I fix this memory issue in my maximum matching algorithm with RealmSwift?

I wrote my own maximum matching function in Swift to divide Chinese sentences into words. It works fine, except with abnormally long sentences the memory usage goes up over 1 gb. I need help figuring out how to modify my code so that there isn't…
webmagnets
  • 2,266
  • 3
  • 33
  • 60
2
votes
2 answers

Extract a Sentence Containing a Word Using Python... As well as the sentences around it?

There are a bunch of questions that get at extracting a particular sentence that contains a word (like extract a sentence using python and Python extract sentence containing word), and I have enough beginner experience with NLTK and SciPy to be able…
alxlvt
  • 675
  • 2
  • 10
  • 18
2
votes
1 answer

segment paragraph to sentences

I'm trying to segment a paragraph to sentences. I selected '.', '?' and '!' as the segmentation symbols. I tried: format = r'((! )|(. )|(? ))' delimiter = re.compile(format) s = delimiter.split(line) but it gives me sre_constants.error: unexpected…
ChuNan
  • 1,131
  • 2
  • 11
  • 27