1

In our database we have Latin and Cyrillic alphabets and all is fine with those characters. But, sometimes people enter characters that appear as ? or squares. Is there a way to find those specific symbols, because looking manually through a couple of tables with >400 000 records for 1 mistake is insane.

I have looked for online solutions that I can load my data as text, I have tried notepad++, but either I get all Cyrillic marked or I don't get anything.

EDIT: L..D. is an example.

jarlh
  • 42,561
  • 8
  • 45
  • 63
DarkBlade
  • 47
  • 6
  • 2
    Which dbms are you using? – jarlh Jul 18 '22 at 12:50
  • I use oracle. But i can export all the data into a txt or excel or whatever. – DarkBlade Jul 18 '22 at 13:15
  • If your database character set is UTF-8 (i.e. `AL32UTF8`) then **every** character is UTF-8, others cannot be stored. "Looking for non utf8 symbols" does not make any sense. What is your real requirement? Invalid symbols are automatically replace by placeholder, typically `¿`. Maybe you are looking for a certain range of characters. – Wernfried Domscheit Jul 18 '22 at 13:57
  • DB is set to utf8. Yes, I would say I would be looking outside a certain range, because i do get those placeholders. The DB accepts symbols that are out of a certain range, but don't keep the original character. In the edit I give an example data of what is stored. – DarkBlade Jul 19 '22 at 06:30

1 Answers1

0

Based on the ASCII value of the character you can filter the text.Here i am assuming all characters greater than 65534 as outside the normal range.You can modify the range according to your requirement.The db fiddle here

 with test (col) as (
  select 
    'L.A.D' 
  from 
    dual 
  union all 
  select 
    'L..D.' 
  from 
    dual
) 
select 
  col, 
  case when max(ascii_of_one_character) >= 65535 then 'NOT OK' else 'OK' end result 
from 
  (
    select 
      col, 
      substr(col, column_value, 1) one_character, 
      ascii(
        substr(col, column_value, 1)
      ) ascii_of_one_character 
    from 
      test cross 
      join table(
        cast(
          multiset(
            select 
              level 
            from 
              dual connect by level <= length(col)
          ) as sys.odcinumberlist
        )
      )
  ) 
group by 
  col
  having max(ascii_of_one_character) >= 65535;
psaraj12
  • 4,772
  • 2
  • 21
  • 30
  • I think this will not work if the data has Cyrillic characters. – Wernfried Domscheit Jul 18 '22 at 15:49
  • this is just an example if user provides the ascii range we can modify according to it – psaraj12 Jul 18 '22 at 15:54
  • @DarkBlade you can let us know which all languages needs to be supported from this list https://codepoints.net/basic_multilingual_plane – psaraj12 Jul 18 '22 at 16:58
  • Basic Latin (U+0000 to U+007F) and Cyrillic (U+0400 to U+04FF) should be enough to have. So if I get flagged all the rest maybe it should be enough. The error I get also is: _Error 1366: Incorrect string value: '\xED\xAF\x88\xED\xB0\x93...'_ – DarkBlade Jul 19 '22 at 07:12
  • I modified the script in the first part `with test (col) as (select table.colum from table)` But I got back all records, not just those that are flagged as problematic (19 rows). I got all 300 000+ rows. – DarkBlade Jul 19 '22 at 07:38
  • have added the having clause to filter only the required records – psaraj12 Jul 19 '22 at 07:51
  • Thank you for the addition. Before it i got 21 000 records 'NOT OK'. With the addition I get 5500 records 'NOT OK'. It is a big improvement. The unrecognized characters are only 19 though. Is there a way to find characters that are in specific ranges? For example Special (U+FFF0 to U+FFFF) or another range from the link you shared above. – DarkBlade Jul 19 '22 at 09:19
  • find the min(ascii()) for the 19 records and let me know – psaraj12 Jul 19 '22 at 09:25
  • I was able to find the ascii(absinfo1) of only the character that gives me problems. The result of ascii(column) is `4102194338`. Does this help? If I can look for this specific character in some way `'%ascii(absinfo1)%'` it would be quite helpful. – DarkBlade Jul 19 '22 at 13:12
  • i have updated the answer ,please check if it helps or you can use ascii(absinfo1)=4102194338 – psaraj12 Jul 19 '22 at 13:14
  • This finds me only 1 record with this character (the one I have with this character and nothing else). The script is looking for this character anywhere in a string right? If that is the case, then each special square I see has a different ascii which means it will be hell to find them one by one... – DarkBlade Jul 19 '22 at 13:31
  • as per the multilingual support anything greater than or equal to 65535 you can filter out ascii(absinfo1)>=65535 – psaraj12 Jul 19 '22 at 13:39
  • Thank you man! I finally was able to find those squares. I changed the HAVING clause to be >= 4000000000 which is where most (if not all) characters are. You saved me hours and hours of manual searching! – DarkBlade Jul 19 '22 at 14:17
  • glad to be of help :) – psaraj12 Jul 19 '22 at 14:19