0

I have this redshift SQL query. I extracted a number with decimal from the comment using the "REGEXP_SUBSTR" function. I also need to convert it from string to number/decimal. Then, I need to subtract that number from the total.

This is my query

SELECT sc.comment, 
       sm.subtotal, 
       to_number(REGEXP_SUBSTR(sc.comment, '[0.00-9]+..[0.00-9]+', 1),'9999999D99')

FROM "sales_memo_comments" sc INNER JOIN "sales_memo" sm ON sc.foreign_id = sm.parent_id

I tried using the "to_number" function on Redshift SQL, but its giving me the following: ERROR: invalid input syntax for type numeric: " "

This is the current output Before extracting the number refund amount from the comment column:

comment
"SAR719.00 Refund transaction executed successfully, Refund Request ID:504081288877953603 \n    , Authorization Code:095542 "
"AUD52.07 Refund transaction executed successfully, Refund Request ID:6J45695858A90833"
Canceled by : ron.johnd@company.co.us
refund amount is [MYR197.41]
"Please Ignore Order refunded by Refund Request ID:5002758809696048 , Authorization Code:2587759"
OMR37.83($98.23) Refund transaction executed successfully

This is it after using the above SQL query with REGEXP. I still get some anomalies.

comment
719
52.07
.co.
197.41
5.0027621
37.83($98.23

Two questions

  1. How do I edit the REGEXP to take account for the anomalies seen above
  2. How do I convert my string REGEXP to a numeric value to do a subtraction with another numeric column?

Any help would be appreciated.

Almo Ubuy
  • 1
  • 2
  • 5
  • Are you sure you're able to extract the number from the string using Regex for all the rows ? Seems like you might be running into the conversion error for a particular row or rows, which might not be in the desired form to be converted to a number. – Shailesh Nov 28 '17 at 08:26
  • Yes you are right, there are some empty rows after the using Regex. The comment column is where the refund is recorded and some comments dont have a number typed. – Almo Ubuy Nov 28 '17 at 08:47
  • The comment section has refund amounts with number that have "$" signs next to them and other having more than decimals (human errors)... Thats about 20% of the data is like that... – Almo Ubuy Nov 28 '17 at 09:05
  • That means you'll have to makes changes to your regex to accommodate those rows. – Shailesh Nov 28 '17 at 10:01
  • Thank you for the tip. You are right. Do you have any ideas how I can accommodate to rows where I do not want to take any signs of '$' '()' or no numbers available? – Almo Ubuy Nov 28 '17 at 10:03
  • I'm sorry, you'll find a regex expression for this on your own, that'll be able to extract the number from columns having or not having the '$' or '()'. Alternatively, you can run `sed` or `awk` text processing commands to remove '$' or '()' from the csv itself, but I think this should only be used if the first one doesn't work, and there's no reason the 1st won't work. – Shailesh Nov 28 '17 at 10:09
  • Yes, I want to avoid the latter. I have been trying to figure out for the past couple of hours on how to edit the Regex to filter those out, so reached a frustrating point a little bit haha... – Almo Ubuy Nov 28 '17 at 10:15
  • Start by fixing the [bracket expression](http://docs.aws.amazon.com/redshift/latest/dg/pattern-matching-conditions-posix.html) `[0.00-9]+`. At the moment it matches one or more 0, or a dot, or 0, or a character in the range 0 to 9. Then look up what role a dot has in a regexp pattern and fix `..`. And in the end having to parse refund amounts from comments sounds like a lost battle. The amount should be a separate field in a proper message. – Ilja Everilä Dec 02 '17 at 05:41

2 Answers2

0

Here is a way - you need to be able to test whether a string is numeric and for that you need a UDF - so just run this once to define that function

create or replace function isnumeric (aval VARCHAR(20000))
  returns bool
IMMUTABLE
as $$
    try:
       x = int(aval);
    except:
       return (1==2);
    else:
       return (1==1);
$$ language plpythonu;

Then, you could change your code as follows

SELECT sc.comment, 
       sm.subtotal, 
       to_number(
case when isnumeric(REGEXP_SUBSTR(sc.comment, '[0.00-9]+..[0.00-9]+', 1))
then REGEXP_SUBSTR(sc.comment, '[0.00-9]+..[0.00-9]+', 1)
else 0 end
,'9999999D99')

FROM "sales_memo_comments" sc INNER JOIN "sales_memo" sm ON sc.foriegn_id = sm.parent_id
Jon Scott
  • 4,144
  • 17
  • 29
  • Hey Thanks for sharing your inputs. I forgot to mention that I am on RJMetrics - Magento BI Software and dont have access to the DB directly to create functions. – Almo Ubuy Nov 28 '17 at 12:14
  • so you cannot run adhoc sql? – Jon Scott Nov 28 '17 at 12:14
  • I can definitely, but cant create new functions – Almo Ubuy Nov 28 '17 at 12:35
  • either get someone to give you access or get them to run that sql so you can then use the function. http://docs.aws.amazon.com/redshift/latest/dg/udf-security-and-privileges.html – Jon Scott Nov 28 '17 at 13:23
  • 1
    The database admin told me in my dreams – Almo Ubuy Nov 29 '17 at 07:32
  • :) - Yes I remember working in those kind of companies who have DBAs there to put brakes on innovation. Maybe the DBA could work you an alternative? – Jon Scott Nov 29 '17 at 07:46
  • There is no hope at all :( – Almo Ubuy Nov 29 '17 at 10:02
  • Can you help me with fixing the REGEXP_SUBSTR where it takes out the $ or any thing that is not a number .... I also found another problem in the comment, the customer service agents are reporting refunds in the comment like "INR12,999+INR2000" ...OMFG – Almo Ubuy Nov 29 '17 at 10:04
  • almo - i hate to say this - but the UDF's are also great at doing things that cannot be done with regex. – Jon Scott Nov 29 '17 at 10:30
0

My approach would be to first add two columns: one for string length and the other counting allowed characters. From this table, you could filter for only rows where the two matched (ie no non-allowed characters) and then just cast the remaining values to floats or decimals or whatever.

with temp as (
SELECT '719' as comment
UNION SELECT '52.07'
UNION SELECT '.co.'
UNION SELECT '197.41'
UNION SELECT '5.0027621'
UNION SELECT '37.83($98.23'
),
temp2 as (
  SELECT *
    ,regexp_count(comment, '[0-9.]') as good_char_length
    ,len(comment) as str_length
  FROM
    temp
)

SELECT *
  ,comment::float as comment_as_float
FROM
  temp2
WHERE
  good_char_length = str_length
ScottieB
  • 3,958
  • 6
  • 42
  • 60