Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

Question

(note) I would appreciate help generalizing the title. I am sure that this is a class of problems in OO land, and probably has a reasonable pattern, I just don't know a better way to describe it.

I'm considering the following -- Our server script will be called by an outside program, and have a bunch of text dumped at it, (usually XML).

There are multiple possible types of data we could be getting, and multiple versions of the data representation we could be getting, e.g. "Report Type A, version 1.2" vs. "Report Type A, version 2.0"

We will generally want to do the same thing action with all the data -- namely, determine what sort and version it is, then parse it with a custom parser, then call a synchronize-to-database function on it.

We will definitely be adding types and versions as time goes on.

So, what's a good design pattern here? I can come up with two, both seem like they may have som problems.

Option 1

Write a monolithic ID script which determines the type, and then imports and calls the properly named class functions.
Benefits
- Probably pretty easy to debug,
- Only one file that does the parsing.
Downsides
- Seems hack-ish.
- It would be nice to not have to create knowledge of dataformats in two places, once for ID, once for actual merging.

Option 2

Write an "ID" function for each class; returns Yes / No / Maybe when given identifying text.
the ID script now imports a bunch of classes, instantiates them on the text and asks if the text and class type match.
Upsides:
- Cleaner in that everything lives in one module?
Downsides:
- Slower? Depends on logic of running through the classes.

Put abstractly, should Python instantiate a bunch of Classes, and consume an ID function, or should Python instantiate one (or many) ID classes which have a paired item class, or some other way?

Could you make this clearer? You don't want to detect unknown filetypes in general (e.g. JPG vs MP3). You already know it's XML, you just want to know which format and version. I don't see a compelling reason for writing multiple custom parsers, why not just avoid that by using (say) the excellent [ElementTree](http://effbot.org/zone/element-index.htm), or else [SAX](http://docs.python.org/library/xml.sax.html). For a three-way comparison, see [this](http://stackoverflow.com/questions/192907/xml-parsing-sax-vs-dom-vs-elementtree). This sidesteps what sounds to me like unnecessary class design — smci, Aug 15 '11 at 21:47
An XML file comes in. It could represent an Account, or a Transaction, or an Investment request. It could be sourced from a new or old version of our sync tool. It could be sourced from a third-party's sync tool. I am not concerned with ID'ing the files, I am concerned with an architecture for doing the ID and processing. Hope this clarifies! — Peter V, Aug 15 '11 at 21:49
I don't see any compelling reason you can't use one parser and one architecture (I guess it's the methods for handling it that differ). Can you post some data file snippets to show why it can't all be unified? I don't understand what you mean by 'sourced from a sync tool' inasmuch as how does that change the format of the XML (post an example)? — smci, Aug 15 '11 at 21:53
Devin has the same perspective. In brief, we might see a file like , or , or.. We essentially politely take whatever is thrown at us by thousands of remote data sync tools (ten or so unique ones) and try and parse what they send. — Peter V, Aug 16 '11 at 18:56

score 2 · Accepted Answer · answered Aug 15 '11 at 21:53

You could use the Strategy pattern which would allow you to separate the logic for the different formats which need to be parsed into concrete strategies. Your code would typically parse a portion of the file in the interface and then decide on a concrete strategy.

The structure of the strategy pattern.

As far as defining the grammar for your files I would find a fast way to identify the file without implementing the full definition, perhaps a header or other unique feature at the beginning of the document. Then once you know how to handle the file you can pick the best concrete strategy for that file handling the parsing and writes to the database.

Thanks, this pretty much answers my question; that is, I'll generate a context with one set of functionality and match to a strategy. — Peter V, Aug 16 '11 at 18:54

Pythonic way to ID a mystery file, then call a filetype-specific parser for it? Class creation q's

1 Answers1