11

I am not that hot at regular expressions and it has made my little mind melt some what.

I am trying to find all the tables names in a query. So say I have the query:

SELECT one, two, three FROM table1, table2 WHERE X=Y

I would like to pull out "table1, table2" or "table1" and "table2"

But what if there is no where statement. It could be the end of the file, or there could be a group by or an order by etc. I know "most" of the time this will not be an issue but I don't like the idea of coding for "most" situations and knowing I have left a hole that could cause things to go wrong later.

Is this a doable Regex expression? Am I being a Regex pleb?

(P.S. this will be done in C# but presume that doesn't matter much).

wonea
  • 4,783
  • 17
  • 86
  • 139
Jon
  • 15,110
  • 28
  • 92
  • 132
  • Regular expressions are the least of your problems.Just enumerating all the ways that a table can appear in a SQL statement is a complex problem. BTW. You never mentioned which flavor of SQL you are trying to parse. – JohnFx Feb 12 '10 at 16:58
  • Nor what the underlying problem he's trying to solve is. – Lasse V. Karlsen Feb 12 '10 at 17:10
  • I don't think regular expression is the right solution, you need a SQL Parser instead, check this article: http://www.dpriver.com/blog/list-of-demos-illustrate-how-to-use-general-sql-parser/get-columns-and-tables-in-sql-script-net-version/ –  Dec 21 '10 at 14:51

12 Answers12

13

RegEx isn't very good at this, as it's a lot more complicated than it appears:

  • What if they use LEFT/RIGHT INNER/OUTER/CROSS/MERGE/NATURAL joins instead of the a,b syntax? The a,b syntax should be avoided anyway.
  • What about nested queries?
  • What if there is no table (selecting a constant)
  • What about line breaks and other whitespace formatting?
  • Alias names?

I could go on.

What you can do is look for an sql parser, and run your query through that.

Joel Coehoorn
  • 399,467
  • 113
  • 570
  • 794
  • I think the real deal killer is going to be views. There is going to be no practical way to parse the underlying table names of any views included in the query. – JohnFx Feb 12 '10 at 16:59
5

Everything said about the usefulness of such a regex in the SQL context. If you insist on a regex and your SQL statements always look like the one you showed (that means no subqueries, joins, and so on), you could use

FROM\s+([^ ,]+)(?:\s*,\s*([^ ,]+))*\s+ 
Stefan Gehrig
  • 82,642
  • 24
  • 155
  • 189
4

I found this site that has a GREAT parser!

http://www.sqlparser.com/

well worth it. Works a treat.

Jon
  • 15,110
  • 28
  • 92
  • 132
3

I'm pretty late to the party however I thought I would share a regex I am currently using to analyse all our database objects and I disagree with the sentiment that it is not possible to do this using one.

The regex has a few assumptions

1) You are not using the A,B join syntax style

2) Whatever regex parser you are using supports ignore case.

3) You're analyzing, selects, joins, updates, deletes and truncates. It doesn't support the aforementioned MERGE/NATURAL because we don't use them, however I'm sure further support wouldn't be difficult to add.

I am keen to know what type of transaction the table is part of so I have included Named Capture groups to tell me.

Now I've not used regex for a long time so there are probably improvements that can be made however so far in all my testing this is accurate.

\bjoin\s+(?<Retrieve>[a-zA-Z\._\d]+)\b|\bfrom\s+(?<Retrieve>[a-zA-Z\._\d]+)\b|\bupdate\s+(?<Update>[a-zA-Z\._\d]+)\b|\binsert\s+(?:\binto\b)?\s+(?<Insert>[a-zA-Z\._\d]+)\b|\btruncate\s+table\s+(?<Delete>[a-zA-Z\._\d]+)\b|\bdelete\s+(?:\bfrom\b)?\s+(?<Delete>[a-zA-Z\._\d]+)\b
MrEdmundo
  • 5,105
  • 13
  • 46
  • 58
2

One workaround is to implement a naming convention on tables and views. Then the SQL statement can be parsed on the naming prefix.

For example:

SELECT tbltable1.one, tbltable1.two, tbltable2.three
FROM tbltable1
    INNER JOIN  tbltable2
        ON tbltable1.one = tbltable2.three

Split whitespace to array:

("SELECT","tbltable1.one,","tbltable1.two,","tbltable2.three","FROM","tbltable1","INNER","JOIN","tbltable2","ON","tbltable1.one","=","tbltable2.three")

Get left of elements to period:

("SELECT","tbltable1","tbltable1","tbltable2","FROM","tbltable1","INNER","JOIN","tbltable2","ON","tbltable1","=","tbltable2")

Remove elements with symbols:

("SELECT","tbltable1","tbltable1","tbltable2","FROM","tbltable1","INNER","JOIN","tbltable2","ON","tbltable1","tbltable2")

Reduce to unique values:

("SELECT","tbltable1","tbltable2","FROM","INNER","JOIN","ON")

Filter on Left 3 characters = "tbl"

("tbltable1","tbltable2")

KPavezC
  • 305
  • 2
  • 5
  • 20
Will
  • 1,048
  • 9
  • 10
1

Constructing a regular expression is going to be the least of your problems. Depending on the flavor of SQL you expect to support with this code, the number of ways you can reference a table in a SQL statement is staggering.

PLUS, if the query includes a reference to a view or UDF, the information about what underlying tables won't even be in the string at all making it completely impractical to get that information by parsing it. Also, you'd need to be smart about detecting temporary tables and excluding them from your results.

If you must do this, a better approach would be to make use of the APIs to the particular database engine that the SQL was intended for. For example you could create a view based on the query and then use the DB Server api to detect dependencies for that view. The DB engine is going to be able to parse it much more reliably than you ever will without an enormous effort to reverse engineer the query engine.

If, by chance, you are working with SQL Server, here is an article about detecting dependencies on that platform: Finding Dependencies in SQL Server 2005

Community
  • 1
  • 1
JohnFx
  • 34,542
  • 18
  • 104
  • 162
1

It's definitely not easy.

Consider subqueries.

select
  *
from
  A
  join (
    select
       top 5 *
    from
      B)
    on B.ID = A.ID
where
  A.ID in (
    select
      ID
    from
      C
    where C.DOB = A.DOB)

There are three tables used in this query.

LeppyR64
  • 5,251
  • 2
  • 30
  • 35
1

I think it would be easier to tokenize the string and look for SQL keywords that could bound the table names. You know the names will follow FROM, but they could be followed by WHERE, GROUP BY, HAVING, or no keyword at all if they're at the end of the query.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
0

I used this code as an Excel macro to parse the select and extract table names.

My parsing assumes that the syntax select from a , b , c is not used.

Just run it against your SQL query and if you are not satisfied with the result you should be only a few lines of codes away from the result you expect. Just debug and modify the code accordingly.

Sub get_tables()
    sql_query = Cells(5, 1).Value
    tables = ""

    'get all tables after from
    sql_from = sql_query

    While InStr(1, UCase(sql_from), UCase("from")) > 0

        i = InStr(1, UCase(sql_from), UCase("from"))
        sql_from = Mid(sql_from, i + 5, Len(sql_from) - i - 5)
        i = InStr(1, UCase(sql_from), UCase(" "))

        While i = 1

            sql_from = Mid(sql_from, 2, Len(sql_from) - 1)
            i = InStr(1, UCase(sql_from), UCase(" "))

        end

        i = InStr(1, sql_join, Chr(9))

        While i = 1

            sql_join = Mid(sql_join, 2, Len(sql_join) - 1)
            i = InStr(1, sql_join, Chr(9))

        end

        a = InStr(1, UCase(sql_from), UCase(" "))
        b = InStr(1, sql_from, Chr(10))
        c = InStr(1, sql_from, Chr(13))
        d = InStr(1, sql_from, Chr(9))

        MinC = a

        If MinC > b And b > 0 Then MinC = b
        If MinC > c And c > 0 Then MinC = c
        If MinC > d And d > 0 Then MinC = d

        tables = tables + "[" + Mid(sql_from, 1, MinC - 1) + "]"

    end

    'get all tables after join
    sql_join = sql_query

    While InStr(1, UCase(sql_join), UCase("join")) > 0

        i = InStr(1, UCase(sql_join), UCase("join"))
        sql_join = Mid(sql_join, i + 5, Len(sql_join) - i - 5)
        i = InStr(1, UCase(sql_join), UCase(" "))

        While i = 1

            sql_join = Mid(sql_join, 2, Len(sql_join) - 1)
            i = InStr(1, UCase(sql_join), UCase(" "))

        end

        i = InStr(1, sql_join, Chr(9))

        While i = 1

            sql_join = Mid(sql_join, 2, Len(sql_join) - 1)
            i = InStr(1, sql_join, Chr(9))

        end

        a = InStr(1, UCase(sql_join), UCase(" "))
        b = InStr(1, sql_join, Chr(10))
        c = InStr(1, sql_join, Chr(13))
        d = InStr(1, sql_join, Chr(9))

        MinC = a

        If MinC > b And b > 0 Then MinC = b
        If MinC > c And c > 0 Then MinC = c
        If MinC > d And d > 0 Then MinC = d

        tables = tables + "[" + Mid(sql_join, 1, MinC - 1) + "]"

    end

    tables = Replace(tables, ")", "")
    tables = Replace(tables, "(", "")
    tables = Replace(tables, " ", "")
    tables = Replace(tables, Chr(10), "")
    tables = Replace(tables, Chr(13), "")
    tables = Replace(tables, Chr(9), "")
    tables = Replace(tables, "[]", "")

End Sub
KPavezC
  • 305
  • 2
  • 5
  • 20
0

This will pull out a table name on an insert Into query:

(?<=(INTO)\s)[^\s]*(?=\(())

The Following will do the same but with a select including joins

(?<=(from|join)\s)[^\s]*(?=\s(on|join|where))

Finally going back to an insert if you want to return just the values that are held in an insert query use the following Regex

(?i)(?<=VALUES[ ]*\().*(?=\))

I know this is an old thread but it may assist someone else looking around

Enjoy

Psymon25
  • 336
  • 2
  • 18
0

I tried all the above but none worked since I use a wide variety of queries. I'm working with PHP though and used a PEAR library called SQL_Parser, but hope my solution helps. Also, I was having trouble with apostrophes and MySQL reserved sencences so I decided to strip off all the fields section from the query before parsing it.

function getQueryTable ($query) {
    require_once "SQL/Parser.php";
    $parser = new SQL_Parser();
    $parser->setDialect('MySQL');

    // Stripping fields section
    $queryType = substr(strtoupper($query),0,6);            
    if($queryType == 'SELECT') { $query  = "SELECT * ".stristr($query, "FROM"); }
    if ($havingPos = stripos($query, 'HAVING')) { $query = substr($query, 0, $havingPos); }


    $struct = $parser->parse($query);

    $tableReferences = $struct[0]['from']['table_references']['table_factors'];

    foreach ((Array) $tableReferences as $ref) {
        $tables[] = ($ref['database'] ? $ref['database'].'.' : $ref['database']).$ref['table'];
    }

    return $tables;

}
Mauro
  • 3,946
  • 2
  • 27
  • 41
0

In PHP, I use this function, it returns an array with the table names used in a sql statement:

function sql_query_get_tables($statement){
    preg_match_all("/(from|into|update|join) [\\'\\´]?([a-zA-Z0-9_-]+)[\\'\\´]?/i",
            $statement, $matches);
    if(!empty($matches)){
        return array_unique($matches[2]);
    }else return array();
}

Notice that it does not work with a,b joins or schema.tablename naming

I hope it works for you

itsjavi
  • 2,574
  • 2
  • 21
  • 24