Parse CASE WHEN statements with sqlparse

Question

I have the following SQL query and would like to parse it using sqlparse

import sqlparse

query =  """
select SUM(case when(A.dt_unix<=86400
                     and B.flag="V") then 1
           end) as TEST_COLUMN_1,
       SUM(case when(A.Amt - B.Amt > 0
                     and B.Cat1 = "A"
                     and (B.Cat2 = "M"
                          or B.Cat3 = "C"
                          or B.Cat4 = "B")
                     and B.Cat5 is NULL) then 1
           end) as TEST_COLUMN_2
from test_table A
left join test_table_2 as B on A.ID=B.ID
where A.DT >B.DT
group by A.ID
"""

query_tokens = sqlparse.parse(query)[0].tokens
print(query_tokens)

would give all the tokens included in the SQL statement:

[<Newline ' ' at 0x7FAA62BD9F48>, <DML 'select' at 0x7FAA62BE7288>, <Whitespace ' ' at 0x7FAA62BE72E8>, <IdentifierList 'SUM(ca...' at 0x7FAA62BF7CF0>, <Newline ' ' at 0x7FAA62BF6288>, <Keyword 'from' at 0x7FAA62BF62E8>, <Whitespace ' ' at 0x7FAA62BF6348>, <Identifier 'test_t...' at 0x7FAA62BF7570>, <Newline ' ' at 0x7FAA62BF64C8>, <Keyword 'left j...' at 0x7FAA62BF6528>, <Whitespace ' ' at 0x7FAA62BF6588>, <Identifier 'test_t...' at 0x7FAA62BF7660>, <Whitespace ' ' at 0x7FAA62BF67C8>, <Keyword 'on' at 0x7FAA62BF6828>, <Whitespace ' ' at 0x7FAA62BF6888>, <Comparison 'A.ID=B...' at 0x7FAA62BF7B10>, <Newline ' ' at 0x7FAA62BF6B88>, <Where 'where ...' at 0x7FAA62BF28B8>, <Keyword 'group' at 0x7FAA62BD9E88>, <Whitespace ' ' at 0x7FAA62BD93A8>, <Keyword 'by' at 0x7FAA62BD9EE8>, <Whitespace ' ' at 0x7FAA62C1CEE8>, <Identifier 'A.ID' at 0x7FAA62BF2F48>, <Newline ' ' at 0x7FAA62BF6C48>]

How can I parse these tokens in order to process CASE WHEN statements in a way that I can extract all the conditions and maintain their precedence as defined by the use of parentheses. I was not able to find any relevant examples in the documentation.

Any thoughts on this?

Having written an answer, I now am looking at [this project instead](https://github.com/mozilla/moz-sql-parser), which remarks that *sqlparse does not provide a tree, rather a list of tokens.*. While that statement is not entirely accurate, `sqlparse` doesn't make it easy to use the token tree. — Martijn Pieters, Mar 04 '19 at 15:36
@MartijnPieters I've had a look at mozzila's project but it raises recursion exceptions when more complex SQL statements are parsed. — Giorgos Myrianthous, Mar 04 '19 at 15:38
Crumbs, `moz-sql-parser` actually raises a recursion exception when asked to parse your sample statement. — Martijn Pieters, Mar 04 '19 at 15:38
That's [this bug report](https://github.com/mozilla/moz-sql-parser/issues/41). — Martijn Pieters, Mar 04 '19 at 15:39
Raising the recursion limit does give me the parse tree now. — Martijn Pieters, Mar 04 '19 at 15:42
@MartijnPieters Right. It seems good but still has some issues; for example `IS (NOT) NULL` is parsed as `missing`. — Giorgos Myrianthous, Mar 04 '19 at 15:57
Yup, it's not ideal either, and I don't see any other libraries filling this void either. — Martijn Pieters, Mar 04 '19 at 15:59
@MartijnPieters Do you have any recursive parser in mind that would be useful for parsing the JSON output of `moz-sql-parser`? — Giorgos Myrianthous, Mar 04 '19 at 17:03
Its already structured data, just traverse the dictionaries and lists, with a stack perhaps. — Martijn Pieters, Mar 04 '19 at 17:20

Martijn Pieters · Accepted Answer · 2019-03-04T15:47:01.677

The project is indeed a little underdocumented. I looked at the examples and scanned the source code a little. The documentation unfortunately doesn't include all methods on the Token and TokenList classes that are useful for this task.

For example, an important but omitted method is the TokenList.get_sublists() method, which lets you traverse over nested token lists more easily than other methods do; the TokenList.flatten() method only yields ungrouped tokens in the tree, whereas CASE is a grouped token, so going purely by the documentation you might find it hard to do something useful with the parsed token tree.

Another handy method that I noticed in the codebase is the TokenList._pprint_tree() method, which dumps out the current token tree to stdout. This is very helpful when trying to write code that analyses the tree.

All in all my overall impression of sqlparse is that it is less of a parsing library than a tool to re-format SQL. It includes a good parser but doesn't include the tools necessary to make general use of the tree it produces.

What is really missing in the library is a base node visitor class such as that provided by the Python ast module, or a tree node walker, again like the ast module provides. Either is easy enough to build yourself, luckily:

from collections import deque
from sqlparse.sql import TokenList

class SQLTokenVisitor:
    def visit(self, token):
        """Visit a token."""
        method = 'visit_' + type(token).__name__
        visitor = getattr(self, method, self.generic_visit)
        return visitor(token)

    def generic_visit(self, token):
        """Called if no explicit visitor function exists for a node."""
        if not isinstance(token, TokenList):
            return
        for tok in token:
            self.visit(tok)

def walk_tokens(token):
    queue = deque([token])
    while queue:
        token = queue.popleft()
        if isinstance(token, TokenList):
            queue.extend(token)
        yield token

Now you can use either to access the Case nodes:

statement, = sqlparse.parse(query)

class CaseVisitor(SQLTokenVisitor):
    """Build a list of SQL Case nodes

      The .cases list is a list of (condition, value) tuples per CASE statement

    """
    def __init__(self):
        self.cases = []

    def visit_Case(self, token):
        branches = []
        for when, then_ in token.get_cases():
            branches
        self.cases.append(token.get_cases())

visitor = CaseVisitor()
visitor.visit(statement)
cases = visitor.cases

or

statement, = sqlparse.parse(query)

cases = []
for token in walk_tokens(statement):
    if isinstance(token, sqlparse.sql.Case):
        cases.append(token.get_cases())

The difference between the walk_tokens() and NodeVisitor patterns is negligible in this example, but we are simply extracting the separated tokens for each of the CASE statements, with no processing of the WHEN ... THEN ... tokens. In the NodeVisitor pattern you'd set more attributes on the current visitor instance to 'switch gears' and capture further information about those subtree tokens in more visit_.... methods, which may be easier to follow than a nested for loop over a generator.

On the other hand, with the walk_tokens() generator, if you create a separate variable to reference the generator, you can hand over iteration to helper functions:

all_tokens = walk_tokens(stamement)
for token in walk_tokens(statement):
    if isinstance(token, sqlparse.sql.Case):
        branches = extract_branches(all_tokens)

where extract_branches would further iterate until it came to the end of the case statement.

score 0 · Answer 2 · answered Sep 20 '22 at 02:56

To build on Martin's [fantastic] answer, I think that Newline and Whitespace should be ignored when visiting the nodes.

    ...
    ...
    def visit(self, token):
        """Visit a token."""
        if not token.is_whitespace:
            visitor = getattr(self, method, self.generic_visit)
            return visitor(token)

Parse CASE WHEN statements with sqlparse

2 Answers2

Linked