0

I have a need for a pattern interpretation and rule generating system. Basically how it will work is that it should parse through text and interpret patterns from it, and based on those interprtation, i need to output a set of rules. Here is an example. Lets say i have an HTTP header which looks like

GET https://website.com/api/1.0/download/8hqcdzt9oaq8llapjai1bpp2q27p14ah/2139379149 HTTP/1.1
Host: website.com
User-Agent: net.me.me/2.7.1;OS/iOS-5.0.1;Apple/iPad 2 (GSM)
Accept: */*
Accept-Language: en-us
Accept-Encoding: gzip, deflate

The parser would run through this and output

req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"

The above rule contains a modified version of regex. Each variable e.g STRING:auth_token or STRING:id is to be extracted.

For parsing through the text(header in this case) i will have to tell the parser that it needs to extract whatever comes after the "download". So basically there is a definition of a set of rules which this parser will use to parse through the text and eventually output the final rule.

Now the question is, is there any such module available in python for pattern matching,detection,generation that can help me with this? This is somewhat like a compiler's parser part. I wanted to ask before going deep into trying to make one myself. Any help ?

auny
  • 1,920
  • 4
  • 20
  • 37

3 Answers3

1

I think that this has been already answered in:

Parser generation

Python parser Module tutorial

I can assure that what you want is easy with pyparsing module.

Community
  • 1
  • 1
Zaka Elab
  • 576
  • 5
  • 14
0

You'd best do this yourself. It is not much work.

As you say, you'd have to define regular expressions as rules. Your program would then find the matching regular expression and transform the match into an output rule.

** EDIT ** I do not think there is a library to do this. If I understand you correctly, you want to specify a set of rules like this one:

EXTRACT AFTER download

And this will output a text like this:

req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"

For this you'd have to create a parser that would parse your rules. Depending on the complexity of the rule syntax, you could use pyparsing, use regular expressions or do it by hand. My rule of the thumb is, if your syntax is recursive (i.e. like html), then it makes sense to use pyparsing, otherwise it is not worth it.

From these parsed rules your program would have to create new regular expressions to match the input text. Basically, your program would translate rules into regular expressions.

Using these regular expressions you'd match extract the data from your input text.

Hans Then
  • 10,935
  • 3
  • 32
  • 51
  • No, the point is that i should not have to write regex expresions.There would be no advantage of my solution then. I could directly write the rule manually. – auny Sep 14 '12 at 13:39
  • Do I understand you correctly like this? You want a program to parse through some text, extract patterns from it and then output those patterns as rules? – Hans Then Sep 14 '12 at 13:44
  • Yes, But in order to extract those patterns, i should give only the exact pattern and its context. No regex. Call this a parsing rule. The parsing Rule will look like EXTRACT AFTER "DOWNLOAD". Do you get what i mean? – auny Sep 14 '12 at 13:48
  • No, the rule you mentioned above is to be the OUTPUT of the parser. The text will not match this. The parser should parse text and then output this rule. – auny Sep 14 '12 at 17:43
  • I am afraid I don't understand what you want. You say you should "give only the exact pattern and its context". What do you mean by that? Give implies some input. Can you please give an example as to the input pattern? – Hans Then Sep 14 '12 at 17:48
  • Ok, So we have the http header.I want to make a program that will run over this http header text and output this 'req-hdr-pattern: "^GET[ ].*/api/1\\.0/download/{STRING:auth_token}/{STRING:id}[].*website\\.com"' . To do the output the program will need to know where to put the extraction variables. For that i will have to tell using some parsing rule(dont call the output a rule, just think of it as a string). that parsing rule will be EXTRACT AFTER "Download". So what the parser will do is that it interpret that and output the req string. – auny Sep 14 '12 at 17:52
  • So "EXTRACT AFTER Download" will be your parsing rule? I think there is still some information missing. How will the program know that what follows after download will be an authentication token and an id? Or are the URL's that come in always the same? – Hans Then Sep 14 '12 at 17:57
  • Thats right. i want to write simple lang type parsing rules which would enable the program to skim through all the text and output all such string(which inturn are regex expression.But no worries about that, i have that working) – auny Sep 14 '12 at 18:01
  • Well that is easy to handle. The url cant just be ignored we can append another modifier to the rule "EXTRACT AFTER DOWNLOAD HOST IS website.com"... Extraction can be handled in the same way.. Well i might have to think the exact solution but that the level of intelligence i want to build into it – auny Sep 14 '12 at 18:03
  • the eventual goal is to extract and make patterns out of http traffic without human intervention :) – auny Sep 14 '12 at 18:06
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/16706/discussion-between-hans-then-and-auny) – Hans Then Sep 14 '12 at 18:11
0

Sorry if this is not quite what you're looking for, but I'm a little rushed for time. The re module documentaiton for Python contains a section on writing a tokenizer.
It's under-documented, but might help you in making something workable.
Certainly easier than tokenizing things yourself, though may not provide the flexibility you seem to be after.