OpenText DokuStar Capture Center Extraction Enhancement

Question

Since there is virtually no documentation or code snippets on programming inside OpenText Capture Center. I need some input from someone with experience.

Here is the crux of what I need... In the Scripting Manager, I need to be able to access all of the Phrase objects that the OCR identified in the document, regardless of the Fields matched or identified during extraction.

As long as I have access to the OCR phrases, I can do two things that will greatly increase our matching percentage on any field.

Perform sanitations and transformations of the invoice phrases as a type of pre-processing before matching occurs (I.E. turn Corporation into CORP, remove apostrophes, etc..)
Write a custom matching function that is more understanding of our data than the native Generic SnapMatch.

Thanks!

This is an extraordinarily narrow question. You may find you get better results from talking to OpenText directly. Good luck though! — tomfanning, Dec 14 '12 at 16:35
Thanks for wishing me luck. In acknowledgement with your observation, yes it is a very specific question, which should keep it very simple to answer. Why is it every OCC question on stackoverflow gets directed to OpenText who charges for support? Should every C# question be directed to the Microsoft help desk? — David C, Dec 14 '12 at 18:34
Thing is, literally millions of people use C# every day. The number of developers building solutions against OpenText products I would hazard is tiny compared to C#. Hence me wishing you good luck here, and suggesting you might get better results from your vendor. Sorry I don't know the answer to your specific issue. — tomfanning, Dec 14 '12 at 23:37
Yeah I was just hoping there was at least 1 human on StackOverflow that would be familiar with OpenText. ** sigh ** — David C, Dec 17 '12 at 15:15

score 0 · Accepted Answer · answered Feb 05 '13 at 21:33

Ok, ultimately there is no way to do this via the Scripting Manager entry points. The reason for this is that all the image data is parsed and extracted prior to entry into the scripting manager. By the time you get to the extraction phase of the manager, you have an XML Runtime document which represents the meta structure of the output document with data that the extraction "thought might be useful" before entry. All other possible "phrases" and other data types extracted that did not fit a field directly or an alternative is "discarded". Meaning that the Vendor Name or something similar which DoKuStar didn't find interesting, is still not searchable with any code mechanism.

The problem I needed to solve was very specific to my particular domain, and was caused indirectly by policy of the Oracle group. The names of vendors was stripped of special characters and concatenated. Basically, they just did not match what was on the invoice, and therefore snapmatch was virtually useless.

I created an intermediate solution whereby the local SnapMatch database could be updated by users directly, "Rename Vendor" so to speak. And therefore our local SnapMatch database will match what was on the invoices as we make corrections, even if the Oracle database doesn't. All in all, not a specific solution to the coding side, but it turned out to be an effective solution to the domain issue.

OpenText DokuStar Capture Center Extraction Enhancement

1 Answers1