Questions tagged [text-normalization]

26 questions
51
votes
2 answers

Programatic Accent Reduction in JavaScript (aka text normalization or unaccenting)

I need to compare 2 strings as equal such as these: Lubeck == Lübeck In JavaScript. Why? Well, I have an auto-completion field that's going out to a Java service using Lucene, where place names are stored naturally (as Lübeck), but also indexed as…
dlamblin
  • 43,965
  • 20
  • 101
  • 140
8
votes
2 answers

How do I properly implement Unicode passwords?

Adding support for Unicode passwords it an important feature that should not be ignored by developers. Still, adding support for Unicode in passwords is a tricky job because the same text can be encoded in different ways in Unicode and you don't…
sorin
  • 161,544
  • 178
  • 535
  • 806
6
votes
1 answer

Which form of unicode normalization is appropriate for text mining?

I've been reading a lot on the subject of Unicode, but I remain very confused about normalization and its different forms. In short, I am working on a project that involves extracting text from PDF files and performing some semantic text…
5
votes
1 answer

tackle different types of utf hyphens in ruby 1.8.7

We have different types of hyphens/dashes (in some text) populated in db. Before comparing them with some user input text, i have to normalize any type of dashes/hyphens to simple hyphen/minus (ascii 45). The possible dashes we have to convert are:…
intellidiot
  • 11,108
  • 4
  • 34
  • 41
5
votes
0 answers

Unicode normalization in GWT

Possible Duplicate: Replace éàçè… with equivalent “eace” In GWT Is there some library I can use to make unicode normalization operations in gwt? (to contextually guarantee that the latin O is equal to the Cyrillic O, for instance)
3
votes
0 answers

Why does NFKC normalization lose superscript & subscript info?

I notice that when normalizing a Unicode string to NFKC form, superscript characters like ¹ (U+00B9), ² (U+00B2), ³ (U+00B3), etc are converted to the corresponding ASCII digit (ex. 1, 2, 3, etc). Does anyone know the rationale for this behavior? …
codesniffer
  • 1,033
  • 9
  • 22
3
votes
2 answers

How do I capture items from StringScanner?

I am using Ruby's StringScanner to normalize some English text. def normalize text s = '' ss = StringScanner.new text while ! ss.eos? do s += ' ' if ss.scan(/\s+/) # mutiple whitespace => single space s += 'mice' if…
zhon
  • 1,610
  • 1
  • 22
  • 31
2
votes
0 answers

QWebView::findText doesn't work with Unicode’s Combining Diacritical Marks

I’m using QtWebKit (QWebView) to display text, and I want to implement a search functionality in it via QWebView::findText. Problem is that the text that has to be displayed contains so-called Unicode’s Combining Diacritical Marks, and both…
2
votes
2 answers

Normalizing text file from abnormal newlines?

I have several text files that have lots of newlines between texts that I would like to normalize but there is no pattern to amount of newline between the texts for example: Text Some text More text More more So what I wanted to…
Guapo
  • 3,446
  • 9
  • 36
  • 63
1
vote
1 answer

Expanding abbreviations using regex

I have a dictionary of abbreviations, I would like to expand. I would like to use these to go through a text and expand all abbreviations. The defined dictionary is as follows: contractions_dict = { "kl\.": "klokken", } The text I…
Kiri
  • 55
  • 4
1
vote
1 answer

How to normalize text with regex?

How to normilize text with regex with some if statements? If we have string like this One T933 two, three35.4. four 9,3 8.5 five M2x13 M4.3x2.1 And I want to normilize like this one t 933 two three 35.4 four 9,3 8.5 five m2x13 m4.3x2.1 Remove all…
Dmiich
  • 325
  • 2
  • 16
1
vote
1 answer

What is the best way to search for an exact match using Postgres full-text search?

I have a Postgres database with around 1.5 million records. In my Ruby on Rails app, I need to search the statement_text field (which can contain anywhere from 1 to hundreds of words). My problem: I know I can use the pgSearch gem to create scopes…
1
vote
1 answer

String normalization in Neo4j Cypher - how to?

Problem background: Chinese words consists of characters which are words themselves. I have 3 nodes representing Chinese words each with the attribute word having the string-values: node (1): "a" node (2): "b" node (3): "ab" Question 1: Using…
Mika
  • 11
  • 1
0
votes
0 answers

Text Normalization for abbreviations, acronym and any other shortcut written in english

I want to predict some typo shortcuts. For example: 8 in. micrometer has to be predicted as 8 inch micrometer 9 lbs Bag - 9 pounds bag 10" scale - 10 inch scale 10 no. - 10 numbers 77 mm length - 77 millimeter length and so on. I already created a…
0
votes
2 answers

Normalize vector such that sum equals 1, while satisfying a lower bound

Given a lower bound of 0.025, I want a vector consisting of weights that sum up to 1 and satisfy this lower bound. Starting from a vector with an arbitrary length and the values ranging from 0.025 (lower bound) to 1. For example, [0.025, 0.8,…
1
2