1

I am trying to extract units of measure from strings in an Oracle database that houses a pharmaceutical catalog. I have been using regexp_substr to get the concentration of a medication from a string

ie.

Name col in schema:
CYCLOSPORINE 100MG 30 CAPS
TERBUTALINE 2.5MG 100 TABS 

Output of query:

col 1: CYCLOSPORINE 100MG 30 CAPS, Col 2: 100MG
col 1: TERBUTALINE 2.5MG 100 TABS, Col 2: 2.5MG     



select name, 
regexp_substr(upper(name), 
'(\d*\.*\d+\s*ML|\d*\.*\d+\s*MG|\d*\.*\d+\s*OZ|\d*\.*\d+\s*LB)') 
CONCENTRATION 
from schema.table t 
where t.discontinuedflag=0
and t.restrictioncode <> 0
and t.distributor_id =19

Anyone know how I could pull out 200MG/mL using regexp_substr() in Oracle from the string below?

'TESTOSTERONE CYP 200MG/mL 10ML VIAL C3' 
Gary_W
  • 9,933
  • 1
  • 22
  • 40
  • `how I could pull out 200MG/mL from the string below?` This is easy, but question is: string structure is always same, or may be it changed? – Oto Shavadze Apr 06 '17 at 18:45
  • This will change by item, for example you might see 200 MG/ML or 200MG/ML and there are thousands of medications with different units of measure in this table ie.GEMCITABINE 1 GM-26.3 ML VL NOREPINEPHRINE 1MG/mL 4mL 10 AMP – John Flicker Apr 06 '17 at 19:03
  • Which values do you expect to get out of these last 2 examples? Please update the original post as the comment section does not allow for formatting. – Gary_W Apr 06 '17 at 20:11
  • As an aside, your regex could be simplified: `'(\d*\.*\d+\s*)(ML|MG|OZ|LB)'` – Gary_W Apr 06 '17 at 20:21
  • You first need to know without a doubt if a number followed by a measure is part of the name or indeed the concentration (subpart of the name). If both are present, which comes first? In our NDC table I see descriptions like "DARVOCET-N 100 TABLET". be careful, that is a count, not a concentration. I guess if your rule is if the number is not followed by (ML|MG|OZ|LB) then ignore? You need to make sure you get all measures listed! I hope no one's life depends on this data you are extracting, just in case of error! :-) – Gary_W Apr 06 '17 at 20:46
  • Yes I would want a number followed by a measure, should be in any of the pharma strings. Problem I am seeing is that sometimes you get "(200 MG)" and other times you get "200 MG" – John Flicker Apr 06 '17 at 20:50
  • Hey I was wrong in my previous post, "DARVOCET-N 100 TABLET" is just the name, the 100 part of that is NOT a tablet count. Just illustrates all the more how this is a tough problem. Is the 100 then just a name, or short for mg maybe? If so, then do you assume any number NOT followed by a measure is mg? It's getting fugly. – Gary_W Apr 06 '17 at 20:58
  • So how would you want to handle the DARVOCET example? 100 would be returned but the concentration is implied really. Could get ugly if you have to look at the next word i.e. if TABLET the it's MG, etc. – Gary_W Apr 07 '17 at 13:54

2 Answers2

0

It looks like you want the first "token" in the string that starts with digits. If so:

select regexp_substr(name || ' ', ' [0-9.]+[^ ]+ ') as concentration

This concatenates a space to the end of name, so the pattern can end in a space, even if it is at the end of name.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Nicely done. But it runs into trouble with specific strings ALBON ORAL SUSP 5% 16OZ – John Flicker Apr 06 '17 at 20:03
  • Sorry coworker came by and had to switch screens. EDIT: Nicely done. But it runs into trouble with specific strings ALBON ORAL SUSP 5% 16OZ outputs 5% (Which might actually work for me) AMOXI-DROP (50MG) outputs NULL – John Flicker Apr 06 '17 at 20:10
0

Well this seems to be working so far for these specific examples, but like I said in comments above you need to be sure about the data so don't trust this too much without more extensive testing. I have an NDC table and did some checking and it seems that the concentration is first in the description but I didn't check each and every code so test very carefully!

The regex puts parens around groups to be remembered, which are read from left to right and the first and second remembered groups are returned. It can be read as: Starting at the beginning of the line, look for one or more characters that are not a digit, followed by one or more digits, then an optional decimal point and zero or more digits, followed by zero or more spaces, then one of the optional measures (pipe is a logical OR), then an optional "/ML", then the rest of the string.

SQL> with tbl(drug_name) as (
     select 'CYCLOSPORINE 100MG 30 CAPS' from dual union
     select 'TERBUTALINE 2.5MG 100 TABS' from dual union
     select 'TESTOSTERONE CYP 200MG/mL 10ML VIAL C3' from dual union
     select 'GEMCITABINE 1 GM-26.3 ML VL' from dual union
     select 'NOREPINEPHRINE 1MG/mL 4mL 10 AMP' from dual union
     select 'AMOXI-DROP (50MG)' from dual union
     select 'DARVOCET-N 100 TABLET' from dual union
     select 'ALBON ORAL SUSP 5% 16OZ' from dual
   )
   select drug_name,
   regexp_replace(upper(drug_name), '^\D+(\d+\.?\d*) *((GM|ML|MG|OZ|LB|%)?(/ML)?).*$', '\1\2') CONCENTRATION
   from tbl;

DRUG_NAME                              CONCENTRATION
-------------------------------------- ------------------------------
ALBON ORAL SUSP 5% 16OZ                5%
AMOXI-DROP (50MG)                      50MG
CYCLOSPORINE 100MG 30 CAPS             100MG
DARVOCET-N 100 TABLET                  100
GEMCITABINE 1 GM-26.3 ML VL            1GM
NOREPINEPHRINE 1MG/mL 4mL 10 AMP       1MG/ML
TERBUTALINE 2.5MG 100 TABS             2.5MG
TESTOSTERONE CYP 200MG/mL 10ML VIAL C3 200MG/ML

8 rows selected.

SQL>

Notes:- If the regex does not find a match, the DRUG_NAME column will be returned.
      - Since you upshift the drugname, the original 'mL' spelling becomes 'ML'.  
        Technically it's the same thing but you are altering data which may matter to the 
        consumers of the data.
      - Some drug names like the DARVOCET example don't seem to have a measure in the 
        description.  You need to decide if that's ok.
      - The space between the number and measure is removed.

Oh and I used REGEXP_REPLACE as it allows referring to multiple saved groups with the '\1' shorthand where REGEXP_SUBSTR does not allow that (only 1 subgroup).

Gary_W
  • 9,933
  • 1
  • 22
  • 40
  • I think we have a winner! Definitely performs as expected on pharmacy items. I used upper() to tackle the problem of ML v. mL or mg v. MG, seems to be ok in this case and I do not think it will pose a problem. Good stuff!!! Thank you thank you thank you! – John Flicker Apr 07 '17 at 16:26
  • This was interesting, especially since I deal with this type of data as well so it's a good one for the bag of tricks. However, It needs way more testing. – Gary_W Apr 07 '17 at 16:52
  • select name, regexp_replace(upper(name), '^\D+(\d+\.?\d*) *((GM|MCG|MG|%)?(/ML|\S+\d*)?).*$', '\1\2') CONCENTRATION, from schema.table; This versions seems to capture things like 300MG/3ML – John Flicker Apr 07 '17 at 17:24
  • Here's the risk, you'll have to constantly tweak as new combinations are found. Retest thoroughly after tweaking the regex and please post back. – Gary_W Apr 07 '17 at 17:31