Split SQL statements on function name but keep delimiter in Python

Question

Assuming that I have the following string that contains SQL statements extracted from a SELECT clause (in reality this is a huge SQL statement with hundreds of such statements);

  SUM(case when(A.money-B.money>1000
                and A.unixtime-B.unixtime<=890769
                and B.col10 = "A"
                and B.col11 = "12"
                and B.col12 = "V") then 10
      end) as finalCond0,
  MAX(case when(A.money-B.money<0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "4321"
                and B.cond3 in ("E", "F", "G")) then A.col10
        end) as finalCond1,
  SUM(case when(A.money-B.money>0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "1234"
                and B.cond3 in ("A", "B", "C")) then 2
      end) as finalCond2

how can I split this query on function (i.e. SUM, MAX, MIN, MEAN etc.) such that I can extract the last query but without removing the delimiter (which in this case is SUM)?

So the desired output would be a string like the one below:

  SUM(case when(A.money-B.money>0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "1234"
                and B.cond3 in ("A", "B", "C")) then 2
      end) as finalCond2

PS: For presentation purposes I have provided some sort of indentation but in reality these statements are separated by a comma meaning that no whitespaces or new lines appear in the original form.

@Ralf It won't work in this scenario. A split on `,` (`sql.split(',').pop()`) would give `"C")) then 2 end) as finalCond2` — Tokyo, Mar 07 '19 at 11:07
@Ralf Won't work either. For presentation purposes I have provided some sort of indentation but in reality these statements are just separated by a comma meaning that no whitespaces or new lines appear in the original form. — Tokyo, Mar 07 '19 at 11:11

Martijn Pieters · Accepted Answer · 2019-03-07T11:57:28.073

You can't use a regular expression here, because SQL syntax does not form regular patterns you could match with the Python re engine. You'd have to actually parse the string into a token stream or syntax tree; your SUM(...) can contain a wide array of syntax, including sub-selects, after all.

The sqlparse library can do this, even though it is a bit underdocumented and not that friendly to external uses.

Re-using the walk_tokens function I defined in the other post I linked to:

from collections import deque
from sqlparse.sql import TokenList

def walk_tokens(token):
    queue = deque([token])
    while queue:
        token = queue.popleft()
        if isinstance(token, TokenList):
            queue.extend(token)
        yield token

extracting the last element from the SELECT identifier list then is:

import sqlparse
from sqlparse.sql import IdentifierList

tokens = sqlparse.parse(sql)[0]
for tok in walk_tokens(tokens):
    if isinstance(tok, IdentifierList):
        # iterate to leave the last assigned to `identifier`
        for identifier in tok.get_identifiers():
            pass
        break

print(identifier)

Demo:

>>> sql = '''\
...   SUM(case when(A.money-B.money>1000
...                 and A.unixtime-B.unixtime<=890769
...                 and B.col10 = "A"
...                 and B.col11 = "12"
...                 and B.col12 = "V") then 10
...       end) as finalCond0,
...   MAX(case when(A.money-B.money<0
...                 and A.unixtime-B.unixtime<=6786000
...                 and B.cond1 = "A"
...                 and B.cond2 = "4321"
...                 and B.cond3 in ("E", "F", "G")) then A.col10
...         end) as finalCond1,
...   SUM(case when(A.money-B.money>0
...                 and A.unixtime-B.unixtime<=6786000
...                 and B.cond1 = "A"
...                 and B.cond2 = "1234"
...                 and B.cond3 in ("A", "B", "C")) then 2
...       end) as finalCond2
... '''
>>> tokens = sqlparse.parse(sql)[0]
>>> for tok in walk_tokens(tokens):
...     if isinstance(tok, IdentifierList):
...         # iterate to leave the last assigned to `identifier`
...         for identifier in tok.get_identifiers():
...             pass
...         break
...
>>> print(identifier)
SUM(case when(A.money-B.money>0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "1234"
                and B.cond3 in ("A", "B", "C")) then 2
      end) as finalCond2

identifier is a sqlparse.sql.Identifier instance, but converting it to a string again (which print() does, or you can just use str()) gives you the input SQL string again for that section.

@Old-School: no, because Python's regular expressions can't be used to parse nested structures. — Martijn Pieters, Mar 07 '19 at 11:36
@Old-School: SQL is not 'regular', you can't predict the number of parentheses, or where commas are going to used, etc. — Martijn Pieters, Mar 07 '19 at 11:37
Would this solution work even if I don't have complete SQL statements? Meaning that they are just identifiers from a `SELECT` statement and not the actual statement per se. — Tokyo, Mar 07 '19 at 11:40
@Old-School: I used your literal input, it's in the demonstration. No `SELECT` or `FROM` or `JOIN` in sight. — Martijn Pieters, Mar 07 '19 at 11:55

Ralf · Answer 2 · 2019-03-07T11:31:52.530

I have a solution, but it is a bit much code. This is without using regex, just many iterations of splitting on keywords.

s = """
SUM(case when(A.money-B.money>1000
                and A.unixtime-B.unixtime<=890769
                and B.col10 = "A"
                and B.col11 = "12"
                and B.col12 = "V") then 10
      end) as finalCond0,
MAX(case when(A.money-B.money<0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "4321"
                and B.cond3 in ("E", "F", "G")) then A.col10
      end) as finalCond1,
SUM(case when(A.money-B.money>0
                and A.unixtime-B.unixtime<=6786000
                and B.cond1 = "A"
                and B.cond2 = "1234"
                and B.cond3 in ("A", "B", "C")) then 2
      end) as finalCond2 
"""

# remove newlines and doble spaces
s = s.replace('\n', ' ')
while '  ' in s:
    s = s.replace('  ', ' ')
s = s.strip()

# split on keywords, starting with the original string
current_parts = [s, ]
for kw in ['SUM', 'MAX', 'MIN']:
    new_parts = []
    for part in current_parts:
        for i, new_part in enumerate(part.split(kw)):
            if i > 0:
                # add keyword to the start of this substring
                new_part = '{}{}'.format(kw, new_part)

            new_part = new_part.strip()
            if len(new_part) > 0:
                new_parts.append(new_part.strip())

    current_parts = new_parts

print()
print('current_parts:')
for s in current_parts:
    print(s)

The output I get is:

current_parts:
SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0,
MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1,
SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2

Does it work for you? It seems to work for the example string you put in the question.

But I have already told you that there are not new lines (`\n`) in the string. — Tokyo, Mar 07 '19 at 11:28
The code I posted will work, wheter there are newlines in the string or not. I just make sure that there are no newlines present at the start, but that wont affect the outcome if the string does not have any newlines in it. — Ralf, Mar 07 '19 at 11:29

ALFA · Answer 3 · 2019-03-07T11:57:53.430

You could use something like:

import re

str = 'SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0, MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1, SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2'

result = re.finditer('as\s+[a-zA-Z0-9]+', str);

commas = []
parts = []

for reg in result:
    end = reg.end()
    if(len(str) > end and str[end] == ','):
        commas.append(end)

idx = 0
for comma in commas:
    parts.append(str[idx:comma])
    idx = comma + 1
parts.append(str[idx:])

print(parts)

In commas array you will have the commas that need to be splitted. Output will be:

[151, 322]

In parts you'll have the final array with the parts (Not sure if this implementation is the best way):

[
    'SUM(case when(A.money-B.money>1000 and A.unixtime-B.unixtime<=890769 and B.col10 = "A" and B.col11 = "12" and B.col12 = "V") then 10 end) as finalCond0',
    ' MAX(case when(A.money-B.money<0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "4321" and B.cond3 in ("E", "F", "G")) then A.col10 end) as finalCond1',
    ' SUM(case when(A.money-B.money>0 and A.unixtime-B.unixtime<=6786000 and B.cond1 = "A" and B.cond2 = "1234" and B.cond3 in ("A", "B", "C")) then 2 end) as finalCond2'
]

Is the output of this code even close to the desired output I've posted in my question? — Tokyo, Mar 07 '19 at 11:47

Split SQL statements on function name but keep delimiter in Python

3 Answers3