Questions tagged [text-normalization]
26 questions
51
votes
2 answers
Programatic Accent Reduction in JavaScript (aka text normalization or unaccenting)
I need to compare 2 strings as equal such as these:
Lubeck == Lübeck
In JavaScript.
Why? Well, I have an auto-completion field that's going out to a Java service using Lucene, where place names are stored naturally (as Lübeck), but also indexed as…

dlamblin
- 43,965
- 20
- 101
- 140
8
votes
2 answers
How do I properly implement Unicode passwords?
Adding support for Unicode passwords it an important feature that should not be ignored by developers.
Still, adding support for Unicode in passwords is a tricky job because the same text can be encoded in different ways in Unicode and you don't…

sorin
- 161,544
- 178
- 535
- 806
6
votes
1 answer
Which form of unicode normalization is appropriate for text mining?
I've been reading a lot on the subject of Unicode, but I remain very confused about normalization and its different forms. In short, I am working on a project that involves extracting text from PDF files and performing some semantic text…

Louis Thibault
- 20,240
- 25
- 83
- 152
5
votes
1 answer
tackle different types of utf hyphens in ruby 1.8.7
We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45).
The possible dashes we have to convert are:…

intellidiot
- 11,108
- 4
- 34
- 41
5
votes
0 answers
Unicode normalization in GWT
Possible Duplicate:
Replace éàçè… with equivalent “eace” In GWT
Is there some library I can use to make unicode normalization operations in gwt? (to contextually guarantee that the latin O is equal to the Cyrillic O, for instance)

M. F.
- 73
- 4
3
votes
0 answers
Why does NFKC normalization lose superscript & subscript info?
I notice that when normalizing a Unicode string to NFKC form, superscript characters like ¹ (U+00B9), ² (U+00B2), ³ (U+00B3), etc are converted to the corresponding ASCII digit (ex. 1, 2, 3, etc).
Does anyone know the rationale for this behavior? …

codesniffer
- 1,033
- 9
- 22
3
votes
2 answers
How do I capture items from StringScanner?
I am using Ruby's StringScanner to normalize some English text.
def normalize text
s = ''
ss = StringScanner.new text
while ! ss.eos? do
s += ' ' if ss.scan(/\s+/) # mutiple whitespace => single space
s += 'mice' if…

zhon
- 1,610
- 1
- 22
- 31
2
votes
0 answers
QWebView::findText doesn't work with Unicode’s Combining Diacritical Marks
I’m using QtWebKit (QWebView) to display text, and I want to implement a search functionality in it via QWebView::findText.
Problem is that the text that has to be displayed contains so-called Unicode’s Combining Diacritical Marks, and both…

Linas Valiukas
- 1,316
- 1
- 13
- 23
2
votes
2 answers
Normalizing text file from abnormal newlines?
I have several text files that have lots of newlines between texts that I would like to normalize but there is no pattern to amount of newline between the texts for example:
Text
Some text
More text
More
more
So what I wanted to…

Guapo
- 3,446
- 9
- 36
- 63
1
vote
1 answer
Expanding abbreviations using regex
I have a dictionary of abbreviations, I would like to expand. I would like to use these to go through a text and expand all abbreviations.
The defined dictionary is as follows:
contractions_dict = {
"kl\.": "klokken",
}
The text I…

Kiri
- 55
- 4
1
vote
1 answer
How to normalize text with regex?
How to normilize text with regex with some if statements?
If we have string like this
One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1
And I want to normilize like this
one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1
Remove all…

Dmiich
- 325
- 2
- 16
1
vote
1 answer
What is the best way to search for an exact match using Postgres full-text search?
I have a Postgres database with around 1.5 million records. In my Ruby on Rails app, I need to search the statement_text field (which can contain anywhere from 1 to hundreds of words).
My problem: I know I can use the pgSearch gem to create scopes…

jayp
- 192
- 2
- 13
1
vote
1 answer
String normalization in Neo4j Cypher - how to?
Problem background: Chinese words consists of characters which are words themselves. I have 3 nodes representing Chinese words each with the attribute word having the string-values:
node (1): "a"
node (2): "b"
node (3): "ab"
Question 1: Using…

Mika
- 11
- 1
0
votes
0 answers
Text Normalization for abbreviations, acronym and any other shortcut written in english
I want to predict some typo shortcuts.
For example:
8 in. micrometer has to be predicted as 8 inch micrometer
9 lbs Bag - 9 pounds bag
10" scale - 10 inch scale
10 no. - 10 numbers
77 mm length - 77 millimeter length and so on. I already created a…

SRI PRIYA
- 21
- 1
0
votes
2 answers
Normalize vector such that sum equals 1, while satisfying a lower bound
Given a lower bound of 0.025, I want a vector consisting of weights that sum up to 1 and satisfy this lower bound. Starting from a vector with an arbitrary length and the values ranging from 0.025 (lower bound) to 1.
For example,
[0.025, 0.8,…

Jaques duBalzac
- 3
- 2