0

I have a large SQL database containing 'curves'. Each curve has an ID (curveid). I'm trying to determine main users of each curve and if it is used at all. To enable this, the DBAs are providing logs of all statements executed against the database.

These statements can very in complexity. All I want to do is extract what curveids are being queried for.

Example statement follows:

WITH G AS ( SELECT [Timevalue] FROM [mc].[GranularityLookup] 
WHERE [TimeValue] BETWEEN '19-Jul-2017 00:00' AND '30-Sep-2017 00:00' 
AND [1 Hr] = 1), 
D AS ( SELECT [CurveID], [DeliveryDate], [PublishDate], AVG([Value]) Value, MAX([PeriodNumber]) PeriodNumber 
FROM mc.CURVEID_6657_1_LATEST data 
JOIN 
(SELECT CurveID ID, DeliveryDate dDate, MAX(PublishDate) pDate 
FROM mc.CURVEID_6657_1_LATEST
WHERE CurveID = 90564
    AND DeliveryDate >= '19-Jul-2017 00:00' AND DeliveryDate <= '30-Sep-2017 00:00'
GROUP BY DeliveryDate,  CurveID ) Dates 
ON data.DeliveryDate = dates.dDate AND data.PublishDate = dates.pDate 
WHERE data.CurveID = 90564
AND data.DeliveryDate >= '19-Jul-2017 00:00' AND data.DeliveryDate <= '30-Sep-2017 00:00'
GROUP BY [CurveID], [PublishDate], [DeliveryDate] )
SELECT 
G.[TimeValue] [Deliver
yDate] , D.[PublishDate], D.[Value], D.[PeriodNumber]
FROM 
G
LEFT JOIN 
D
ON 
G.[TimeValue] = D.[DeliveryDate]
ORDER BY DeliveryDate ASC, PeriodNumber ASC, publishDate DESC

From this statement, all I'm interested in is extracting that the user queried for curveid 90564.

The statement may also resemble either of the following:

SELECT * FROM anytable WHERE curveid = 123 AND deliverydate BETWEEN '2017-01-01' AND 2017-02-01'

or

SELECT * FROM mc.anytable WHERE curveid IN (1,2,3,4,5,6,7)

Again, all I want to know is the curve IDs. I don't care about any of the other clauses.

I'm using the Microsoft.SqlServer.TransactSql.ScriptDom namespace to parse the SQL and have got to the point where I can identify all WHERE statements using code similar to below (pieced together from some other samples):

string sql = @"WITH 
            G AS ( SELECT [Timevalue] FROM [mc].[GranularityLookup] 
            WHERE [TimeValue] BETWEEN '19-Jul-2017 00:00' AND '30-Sep-2017 00:00' 
            AND [1 Hr] = 1), 
            D AS ( SELECT [CurveID], [DeliveryDate], [PublishDate], AVG([Value]) Value, MAX([PeriodNumber]) PeriodNumber 
            FROM mc.CURVEID_6657_1_LATEST data 
            JOIN 
            (SELECT CurveID ID, DeliveryDate dDate, MAX(PublishDate) pDate 
            FROM mc.CURVEID_6657_1_LATEST
            WHERE CurveID = 90564
                AND DeliveryDate >= '19-Jul-2017 00:00' AND DeliveryDate <= '30-Sep-2017 00:00'
            GROUP BY DeliveryDate,  CurveID ) Dates 
            ON data.DeliveryDate = dates.dDate AND data.PublishDate = dates.pDate 
            WHERE data.CurveID = 90564
            AND data.DeliveryDate >= '19-Jul-2017 00:00' AND data.DeliveryDate <= '30-Sep-2017 00:00'
            GROUP BY [CurveID], [PublishDate], [DeliveryDate] )
            SELECT 
            G.[TimeValue] [Deliver
            yDate] , D.[PublishDate], D.[Value], D.[PeriodNumber]
            FROM 
            G
            LEFT JOIN 
            D
            ON 
            G.[TimeValue] = D.[DeliveryDate]
            ORDER BY DeliveryDate ASC, PeriodNumber ASC, publishDate DESC";
            var parser = new TSql120Parser(false);

            IList<ParseError> errors;
            var fragment = parser.Parse(new StringReader(sql), out errors);

            var whereVisitor = new WhereVisitor();
            fragment.Accept(whereVisitor);

            //  I now have all WHERE clauses in whereVisitor.WhereStatements

class WhereVisitor : TSqlConcreteFragmentVisitor
{
    public readonly List<WhereClause> WhereStatements = new List<WhereClause>();

    public override void Visit(WhereClause node)
    {
        WhereStatements.Add(node);
    }

}

Each of the clauses in whereVisitor.WhereStatements (3 in this example) expose a property called SearchCondition. Unfortunately, this is where I run out of ideas. What I want to achieve is logic as per below:

foreach (var clause in whereVisitor.WhereStatements)
{
    //  IF any part of the clause filters based on curveid THEN

    //        Capture curveIDs

    //  END IF
}

Other details:

  • Using C# (.net 4.0)
  • SQL Server 2008
  • DLL for this is Microsoft.SqlServer.TransactSql.ScriptDom (located in my case at 'c:\Program Files (x86)\Microsoft SQL Server\130\Tools\PowerShell\Modules\SQLPS\Microsoft.SqlServer.TransactSql.ScriptDom.dll')

Edit 1

Some additional info:

  • CurveID is a key to another table. It would not make sense in this case to operate on it (e.g. curveId+1 or curveId <= 10).

Edit 2 (Partial Solution)

Having the following visitor helps with the case where clause resembles curveid = 123:

class CurveIdVisitor : TSqlConcreteFragmentVisitor
{
    public readonly List<int> CurveIds = new List<int>();

    public override void Visit(BooleanComparisonExpression exp)
    {
        if (exp.FirstExpression is ColumnReferenceExpression && exp.SecondExpression is IntegerLiteral )
        {
            //  there is a possibility that this is of the ilk 'curveid = 123'
            //  we will look for the 'identifier'
            //  we take the last if there are multiple.  Example:
            //      alias.curveid
            //  goives two identifiers: alias and curveid
            if (
                ((ColumnReferenceExpression) exp.FirstExpression).MultiPartIdentifier.Identifiers.Last().Value.ToLower() ==
                "curveid")
            {
                //  this is definitely a curveid filter
                //  Now to find the curve id
                int curveid = int.Parse(((IntegerLiteral) exp.SecondExpression).Value);
                CurveIds.Add(curveid);
            }
        }
GinjaNinja
  • 776
  • 6
  • 18
  • The `SearchCondition` returns a [boolean expression](https://msdn.microsoft.com/en-us/library/microsoft.sqlserver.transactsql.scriptdom.booleanexpression.aspx) of some type, which you then can walk through further until you find a term for `curveID` (or an alias). But for anything except very simple terms it's hard to get which curves it hits. `where curveID+1=12-6` might be possible, `where curveID=12 or (1=1)` will be harder to understand and `where curveID > (select max(id) from table2)` will not tell you anything. Or try `with g as (select id * 2 as curveid ...) ... where g.curveid = 2`. – Solarflare Jul 25 '17 at 14:21
  • @Solarflare I have edited question above to provide some further clarification regarding the other cases you mention. Looking at the SearchCondition there seems to be a horrendously complex object model and I'm hoping that someone can provide a definitive answer on how to use it. Thanks – GinjaNinja Jul 25 '17 at 14:31
  • The object will just help you interpret the code. It will e.g. disassemble a binary expression to the 3 parts "term1 operator term2". If you know your sql text can only look like `curveid = something` or `curveid IN something`, then you can just extract the the parts by walking through the string. But then the object would be simple too - and still easier. To e.g. know where the 2nd term in `curveid = 1 + 2 AND somethingelse` ends, you would need some logic (that the boolean expression already uses). But maybe you are right, and someone has a different idea. – Solarflare Jul 25 '17 at 14:50
  • Sorry - I'm not entirely following what you are suggesting. "The object will just help you interpret the code".... that's exactly what I want it to do :) I could just comb the string and forget about the parser altogether but the issue there is the options are endless. [curveid] = 5 or curveid =5 etc. – GinjaNinja Jul 25 '17 at 14:52
  • Exactly what I mean. Since the options are endless, you will have to apply some logic (which, apart from very basic situations, will more or less come down to creating your own state machine/parser again). So while it might be a "horrendously complex object model", the boolean expression will do what you need and already do most of this "endless option"-work for you by reducing the query to terms (which, as I mentioned earlier, might still not tell you which curves are used in almost all but simple cases). I don't see much middle ground, but then again, maybe someone has a different idea. – Solarflare Jul 25 '17 at 15:08
  • I think I now understand you. So my question becomes, are you able to help me using the BooleanExpression object? – GinjaNinja Jul 25 '17 at 15:14

1 Answers1

0

Finally solved this and hope this benefits someone else in the future. Perhaps someone else may read ni time and provide a better solution.

public class SqlParser
{
    public List<int> GetQueriedCurveIds(string sql)
    {
        var parser = new TSql120Parser(false);

        IList<ParseError> errors;
        var fragment = parser.Parse(new StringReader(sql), out errors);

        List<int> curveIds = new List<int>();
        CurveIdVisitor cidv = new CurveIdVisitor();
        InPredicateVisitor inpv = new InPredicateVisitor();
        fragment.AcceptChildren(cidv);
        fragment.AcceptChildren(inpv);

        curveIds.AddRange(cidv.CurveIds);
        curveIds.AddRange(inpv.CurveIds);
        return curveIds.Distinct().ToList();
    }
}



class CurveIdVisitor : TSqlConcreteFragmentVisitor
{
    public readonly List<int> CurveIds = new List<int>();

    public override void Visit(BooleanComparisonExpression exp)
    {
        if (exp.FirstExpression is ColumnReferenceExpression && exp.SecondExpression is IntegerLiteral )
        {
            //  there is a possibility that this is of the ilk 'curveid = 123'
            //  we will look for the 'identifier'
            //  we take the last if there are multiple.  Example:
            //      alias.curveid
            //  goives two identifiers: alias and curveid
            if (
                ((ColumnReferenceExpression) exp.FirstExpression).MultiPartIdentifier.Identifiers.Last().Value.ToLower() ==
                "curveid")
            {
                //  this is definitely a curveid filter
                //  Now to find the curve id
                int curveid = int.Parse(((IntegerLiteral) exp.SecondExpression).Value);
                CurveIds.Add(curveid);
            }
        }
    }
}

class InPredicateVisitor : TSqlConcreteFragmentVisitor
{
    public readonly List<int> CurveIds = new List<int>();

    public override void Visit(InPredicate exp)
    {
        if (exp.Expression is ColumnReferenceExpression)
        {
            if (
                ((ColumnReferenceExpression) exp.Expression).MultiPartIdentifier.Identifiers.Last().Value.ToLower() ==
                "curveid")
            {
                foreach (var value in exp.Values)
                {
                    if (value is IntegerLiteral)
                    {
                        CurveIds.Add(int.Parse(((IntegerLiteral)value).Value));
                    }
                }
            }
        }
    }
}

This is cut down code to demonstrate answer. In real life, you would want to check the ParseError collection and add some error handling!

GinjaNinja
  • 776
  • 6
  • 18