2

For last couple of days I've been trying to sort a list of alphanumeric text in a natural order.I found that using the NLS_SORT option can order the list correctly (see this answer). But when trying out that solution I found that it made no difference. The list was still displayed as with a normal ORDER BY query. Please not that a solution involving regex is not an option for me.

For testing purposes I made a table and filled it with some data. When running SELECT name FROM test ORDER BY name ASC I get the following result:

enter image description here

As you see the the ordering is unnatural. It should be more like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.

The solutions I tried involved setting the nls_sort option.

ALTER SESSION SET nls_sort='BINARY'; -- or BINARY_AI
SELECT name FROM test ORDER BY NLSSORT(name,'NLS_SORT=BINARY') -- or BINARY_AI

It should order the text in the list based on the decimal code of each character as stated in the ASCII table. So I expected it to turn out the right way (as the order in that table is 'space', 'dot', numbers, letters), but it did not change anything. The order is still the same as in the image.

If it is BINARY then the sort order is based on the numeric value of each character, so it's dependant on the database character set

It might have something to do with the character set I'm using, but I'm not sure what wrong with it. Running SELECT value$ FROM sys.props$ WHERE name = 'NLS_CHARACTERSET'; gives me the value AL32UTF8. Which seems like a slightly extended version of UTF8 (correct me if I'm wrong). I'm running on Oracle database version 11.2.0.4.0.

So can anyone tell me what I'm doing wrong or what I'm missing?

Thanks in advance.

Community
  • 1
  • 1
Robin Hermans
  • 1,579
  • 1
  • 24
  • 52
  • Why is it natural for the order to be 1,2,3,... rather than 1,10,100? It looks at the string character by character (essentially), so it *is* correctly putting `1.` before `10`. The value of the second character doesn't affect how the first character is sorted. Look at the string of that `nlssort(name)` returns. Why isn't regex an option? – Alex Poole Sep 13 '16 at 09:49
  • In terms of software it normal. But I meant 'normal' sorting in a human readable way (sorry, should have declared that). Take for example the windows explorer. When you sort files that contain a number which acts as a version of that file, I expect the highest version to be at the bottom of the list when I sort in a ascending manner (or top when using descending order). – Robin Hermans Sep 13 '16 at 09:57

2 Answers2

3

You seem to expect a binary sort to look at multiple characters at once. It doesn't. It effectively sorts by the first character (so everything starting with 1 comes before anything starting with 2); then by the second character (so a period comes before a 0) - which means it's correct that 1. comes before 10, but also that 10 (or 100000) comes before 2. You can't change that aspect of the sorting behaviour. In the earlier question you linked to, it looks like only the first character was numeric, which is a slightly different situation.

From the documentation:

When character values are compared linguistically for the ORDER BY clause, they are first transformed to collation keys and then compared like RAW values. The collation keys are generated either explicitly as specified in NLSSORT or implicitly using the same method that NLSSORT uses.

You can see the byte order used for sorting:

with t (name) as (
  select level - 1 || '. test' from dual connect by level < 13
  union all select '20. test' from dual
  union all select '100. test' from dual
)
select name, nlssort(name, 'NLS_SORT=BINARY') as sort_bytes
from t
order by name;

NAME       SORT_BYTES         
---------- --------------------
0. test    302E207465737400    
1. test    312E207465737400    
10. test   31302E207465737400  
100. test  3130302E207465737400
11. test   31312E207465737400  
2. test    322E207465737400    
20. test   32302E207465737400  
3. test    332E207465737400    
4. test    342E207465737400    
5. test    352E207465737400    
6. test    362E207465737400    
7. test    372E207465737400    
8. test    382E207465737400    
9. test    392E207465737400  

You can see that the raw NLSRORT results (collation keys) are in a logical order.

If you don't want to use a regular expression you can use substr() and instr() to get the part before the period/space and convert that to a number; though that assumes the format is fixed:

with t (name) as (
  select level - 1 || '. test' from dual connect by level < 13
  union all select '20. test' from dual
  union all select '100. test' from dual
)
select name
from t
order by to_number(substr(name, 1, instr(name, '. ') - 1)),
  substr(name, instr(name, '. '));

NAME     
----------
0. test   
1. test   
2. test   
3. test   
4. test   
5. test   
6. test   
7. test   
8. test   
9. test   
10. test  
11. test  
20. test  
100. test 

If there might not be a period/space you could check for that:

select name
from t
order by case when instr(name, '. ') > 0 then to_number(substr(name, 1, instr(name, '. ') - 1)) else 0 end,
  case when instr(name, '. ') > 0 then substr(name, instr(name, '. ')) else name end;

... but you still have a problem if you had, say, two sentences in the name but the first can't be converted to a number. You could implement a 'safe' to_number() function that squashes an ORA-01722 if that happens.

It would be simpler and safer to use a regular expression, e.g.:

select name
from t
order by to_number(regexp_substr(name, '^\d+', 1)), name;
Alex Poole
  • 183,384
  • 11
  • 179
  • 318
2

Adding to Alex Poole's excellent post, here's a simple trick I learned from a Tom Kyte post (here). It works in this situation anyway:

-- padding with spaces ala Tom Kyte approach
with t (name) as (
  select level - 1 || '. test' from dual connect by level < 13
  union all select '20. test' from dual
  union all select '100. test' from dual
)
select name
from t
order by lpad(name, 20);

Output:

0. test
1. test
2. test
3. test
4. test
5. test
6. test
7. test
8. test
9. test
10. test
11. test
20. test
100. test

Hope that helps

EDIT:

This approach is more complex, but covers the situation Alex Poole brings up (again, credit to Tom Kyte):

with t (name) as (
  select level - 1 || '. test' from dual connect by level < 13
  union all select '20. hello' from dual
  union all select '100. test' from dual
)
select 
    --substr(name,1,length(name)-nvl(length(replace(translate(name,'0123456789','0000000000'),'0','')),0)),
    --substr(name,1+length(name)-nvl(length(replace(translate(name,'0123456789','0000000000'),'0','')),0)) ,
    name
from t
order by
    to_number(substr(name,1,length(name)-nvl(length(replace(translate(name,'0123456789','0000000000'),'0','')),0))),
              substr(name,1+length(name)-nvl(length(replace(translate(name,'0123456789','0000000000'),'0','')),0)) NULLS FIRST;

Output:

0. test
1. test
2. test
3. test
4. test
5. test
6. test
7. test
8. test
9. test
10. test
11. test
20. hello
100. test
tbone
  • 15,107
  • 3
  • 33
  • 40
  • Mmm, that works with the sample data, but only because the non-numeric part is the same length for each row. If you change the 20 text from 'test' to 'hello', for instance, that then sorts after 100. It's a good trick for purely numeric values (which shouldn't be string at all, of course!), but not for a mixture of numbers and text? – Alex Poole Sep 13 '16 at 18:27
  • Its definitely not a magic bullet, but it works in some situations depending on the data involved. – tbone Sep 13 '16 at 18:30
  • @AlexPoole Also, if you look at the Tom Kyte post, it discusses the case you bring up further down the post. If I get a chance, I'll update this post using that approach as well. – tbone Sep 13 '16 at 18:38
  • Anything is going to be a bit of a fudge - but yes, the shale of the actual data is going to be a big factor. The translate approach would work if the numbers are only at the start, for instance. These are all useful things for the OP to consider to see if they're suitable. – Alex Poole Sep 13 '16 at 18:44