Questions tagged [text-segmentation]

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics.

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics.

References:

Related Tags:

197 questions
16
votes
6 answers

php sentence boundaries detection

I would like to divide a text into sentences in PHP. I'm currently using a regex, which brings ~95% accuracy and would like to improve by using a better approach. I've seen NLP tools that do that in Perl, Java, and C but didn't see anything that…
Noam
  • 3,341
  • 4
  • 35
  • 64
15
votes
6 answers

Extract the last word in sentence/string?

I have an array of strings, of different lengths and contents. Now i'm looking for an easy way to extract the last word from each string, without knowing how long that word is or how long the string is. something like; array.each{|string| puts…
BSG
  • 1,382
  • 4
  • 14
  • 25
15
votes
3 answers

A Viable Solution for Word Splitting Khmer?

I am working on a solution to split long lines of Khmer (the Cambodian language) into individual words (in UTF-8). Khmer does not use spaces between words. There are a few solutions out there, but they are far from adequate (here and here), and…
15
votes
1 answer

opencv - cropping handwritten lines (line segmentation)

I'm trying to build a handwriting recognition system using python and opencv. The recognition of the characters is not the problem but the segmentation. I have successfully : segmented a word into single characters segmented a single sentence into…
15
votes
3 answers

How to use NLP to separate a unstructured text content into distinct paragraphs?

The following unstructured text has three distinct themes -- Stallone, Philadelphia and the American Revolution. But which algorithm or technique would you use to separate this content into distinct paragraphs? Classifiers won't work in this…
user193116
  • 3,498
  • 6
  • 39
  • 58
15
votes
8 answers

Explode a paragraph into sentences in PHP

I have been using explode(".",$mystring) to split a paragraph into sentences. However this doen't cover sentences that have been concluded with different punctuation such as ! ? : ; Is there a way of using an array as a delimiter instead of a…
Chris Headleand
  • 6,003
  • 16
  • 51
  • 69
14
votes
11 answers

Split a sentence into separate words

I need to split a Chinese sentence into separate words. The problem with Chinese is that there are no spaces. For example, the sentence may look like: 主楼怎么走 (with spaces it would be: 主楼 怎么 走). At the moment I can think of one solution. I have a…
Peterim
  • 1,029
  • 4
  • 16
  • 25
13
votes
2 answers

How to break up a paragraph by sentences in Python

I need to parse sentences from a paragraph in Python. Is there an existing package to do this, or should I be trying to use regex here?
David542
  • 104,438
  • 178
  • 489
  • 842
13
votes
5 answers

Regex to match first word in sentence

I am looking for a regex that matches first word in a sentence excluding punctuation and white space. For example: "This" in "This is a sentence." and "First" in "First, I would like to say \"Hello!\"" This doesn't…
princess of persia
  • 2,222
  • 4
  • 26
  • 43
12
votes
2 answers

Splitting paragraphs into sentences with regexp and PHP

I'm a regexp noob and trying to split paragraphs into sentences. In my language we use quite a bit of abbreviations (like: bl.a.) in the middle of sentences, so I have come to the conclusion, that what I need to do is to look for punctuations, that…
acrmuui
  • 2,040
  • 1
  • 22
  • 33
11
votes
6 answers

How to separate words in a "sentence" with spaces?

Background Looking to automate creating Domains in JasperServer. Domains are a "view" of data for creating ad hoc reports. The names of the columns must be presented to the user in a human readable fashion. Problem There are over 2,000 possible…
Dave Jarvis
  • 30,436
  • 41
  • 178
  • 315
10
votes
7 answers

How to capitalize first letter of first word in a sentence?

I am trying to write a function to clean up user input. I am not trying to make it perfect. I would rather have a few names and acronyms in lowercase than a full paragraph in uppercase. I think the function should use regular expressions but I'm…
Enkay
  • 1,898
  • 6
  • 24
  • 35
10
votes
5 answers

Parsing HTML into sentences - how to handle tables/lists/headings/etc?

How do you go about parsing an HTML page with free text, lists, tables, headings, etc., into sentences? Take this wikipedia page for example. There is/are: free text: http://en.wikipedia.org/wiki/Neurotransmitter#Discovery lists:…
Lance
  • 75,200
  • 93
  • 289
  • 503
9
votes
1 answer

How to improve NLTK sentence segmentation?

Given the paragraph from Wikipedia: An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwell Hall was dedicated in November 1958, housing the…
Abdulrahman Bres
  • 2,603
  • 1
  • 20
  • 39
8
votes
5 answers

Sentence detection using NLP

I am trying to parse out sentences from a huge amount of text. using java I started off with NLP tools like OpenNLP and Stanford's Parser. But here is where i get stuck. though both these parsers are pretty great they fail when it comes to a non…
1
2
3
13 14