Search for data that is not english text

Question

in a nutshell: I need to be able to search within Oracle DB inside a certain column, for all occurrences that are not English text, or contain signs like -^ etc'. (capital and non-capital are ok) in general, I'm looking to find all occurrences of other languages, Korean Spanish etc'.

ID    NAME      DATE
1     TEST      2018-12-02 11:09:05
2     TE-ST     2018-12-02 11:09:05
3     测试       2018-12-02 11:09:05

i expect the query to find only row #3.

https://stackoverflow.com/a/21140501/10815568 Have you tried this solutions? — Shiba Tatsuya, Dec 26 '18 at 08:10
I can't think of a pure database solution to this. Example: the word `autobus` or even `bus` are valid words in both English and Spanish, and probably a few other languages. — Tim Biegeleisen, Dec 26 '18 at 08:14
In fact, [`test` is a commonly used word in Spanish](http://lema.rae.es/dpd/srv/search?key=test) :) — Álvaro González, Dec 26 '18 at 08:35
maybe the example of Spanish is indeed wrong, since as you mentioned test is also usable. differentiating between Korean/Chinese or any other type of language where no works are in English is good enough. — Dmoy, Dec 26 '18 at 08:43
@Shiba Tatsuya i saw some answers there that didnt work for me, i noticed some said it was tested on mySql which is not what i'm using - Oracle db — Dmoy, Dec 26 '18 at 09:03
To be clear, are you looking for text which contains **characters** which do not appear in English (accents, Cyrillic, etc)? Or do you want to identify **words** which are not written English? — APC, Dec 26 '18 at 09:31

Moudiz · Accepted Answer · 2018-12-26T09:03:38.653

4

with test as
(
select 'hello good morning' txt from dual 
union select 'Bad weather' from dual
union select '测试 ' from dual
    union select 'L''Inhêrit ' from dual
    union select 'هلا' from dual
)
select *
from test
where txt != asciistr( txt )

edited Dec 26 '18 at 09:03

answered Dec 26 '18 at 08:14

Moudiz

7,211
22
78
156

I don't think so. Did you try "lasjdvje" instead of the first string and "aljfei kszzz" for the second one? None of these are *English". True, that's garbage, but - use "dobro jutro" and "gadno vrijeme" which are Croatian translations for your examples; these aren't English for sure either, but your query will return only the last ... huh, no idea what it really is, I don't know that language. – Littlefoot Dec 26 '18 at 08:36
@Littlefoot well the above code i used for foreign language. and his requirement was `I'm looking to find all occurrences of other languages, Korean Spanish` – Moudiz Dec 26 '18 at 08:44
@Littlefoot my select will get the characters are not english, for example `ê' or turkish/arabic/russian/korean characters, and in one of his example he mentioned that. if he wants other then that he can specify so i provide anotehr query – Moudiz Dec 26 '18 at 09:04
OK. We agree that we disagree. In my opinion, your query doesn't do what's being asked. – Littlefoot Dec 26 '18 at 09:06
@Littlefoot - the question is unclear. So this may be the solution the OP is looking for. Equally it may not. – APC Dec 26 '18 at 09:33
1

The OP said: "search (...) for all occurrences that are not English text". String "dobro jutro" (*good morning* in Croatian) isn't *English*, and that `SELECT` won't return it. As we've already found out, we (Moudiz and me) have different thoughts about it, which is perfectly OK. If the question was "that are not written in English alphabet" (you know, [a-zA-Z]), that would still be very broad as "aaieffnx" is written like that, but certainly isn't a valid English word. Oh well, we'll see what the OP says (if anything). Thank you for the comment, anyway, @APC. – Littlefoot Dec 26 '18 at 09:43
1

@Moudiz perfect, thanks :) it is what i was looking for. i apologize if I wasn't clear enough that i don't really care if the word is valid, i just want to know if in English or not, if some1 misspelled that and wrote thet, i'm ok with that. – Dmoy Dec 26 '18 at 11:11
I wonder why this answer is accepted as solution. Strings like `"()@;:` will be considered as "valid English" despite the question says "not English text, or contain signs like -^ etc'" – Wernfried Domscheit Dec 26 '18 at 13:47
Note `asciistr('\')` gives `\005C`, it may fail for certain strings. Use `ASCIISTR(REPLACE(txt, '\', '/')) <> REPLACE(txt, '\', '/')` or see https://stackoverflow.com/questions/50914930/find-out-if-a-string-contains-only-ascii-characters – Wernfried Domscheit Dec 26 '18 at 13:50
@WernfriedDomscheit thanks for your point. he is question is not clear but i guess what he meant he wants to get the foreign lanaguages. – Moudiz Dec 26 '18 at 15:03

APC · Answer 2 · 2018-12-26T10:35:33.120

1

Finding non-English characters is pretty straightforward. @moudiz 's solution will solve that problem. But identifying whether a body of text is written in English or some other language requires some form of AI / ML capability which does not come as standard in Oracle RDBMS.

One possibility might be Oracle Text. The World Lexer has auto-detection support for a number of languages. It may be possible to wrangle its capability to tell whether a piece of text is in English. Find out more. (Caveat: blue sky thinking here, never tried anything like this.)

Another solution would be to build a PL/SQL package which calls Google Translate API. The detect() can identify the language of the passed text. Find out more.

There are a couple of obvious snags:

A lot of organisations would object to passing text from a database to an external site like Google.
If you have a lot of data to test the licensing would get expensive.

edited Dec 26 '18 at 10:35

answered Dec 26 '18 at 10:02

APC

144,005
19
170
281

thank you for your comment, i didnt know about the Google translate API, i will keep it in mind for future notice in case i want to actually validate a word or use some kind of a translation. – Dmoy Dec 26 '18 at 11:13
2

Note that 1. it is naïve to assume non-7-bit-ascii isn't English (even without considering common characters like `£`); 2. text can easily contain words from another language without changing the base language. This is a hard problem. – Richard Jan 17 '19 at 09:48
Yeah, when seeing people assume English text can only contain ASCII I always get a sense of déjà vu. English dictionaries have a lot of odd stuff in them. – rghome Jan 17 '19 at 11:10

score 0 · Answer 3 · edited Jan 17 '19 at 08:46

0

thanks to @moudiz I was able to find a perfect solution for this. I'm using:

select * from table 
where not  REGEXP_LIKE (field_name, '^[^0-9a-z]+$', 'i');

edited Jan 17 '19 at 08:46

APC

144,005
19
170
281

answered Dec 26 '18 at 11:34

Dmoy

57
7

Search for data that is not english text

3 Answers3

Linked