Methods to identify possible barcode symbology from stored text string?

Question

Question:

Is there a method [1] or best practice(s) to identify what barcode symbologies are valid for a specific piece of data, after the scanning occurs?

~ (without having access to images or hardware; only a single string or database field is available with no relevant metadata)

Discussion:

Since the same data or number could be used for different purposes not under our control (e.g. 1 11111 11111 1 for a product UPC and 111-11111111-1 for an invoice or order number, both printed by third-parties), identifying the symbology has several useful purposes.

Ideally, the scanner is one programmed to report the symbology to the application (as a one to three-character data prefix, using one the Symbol or AIM standards), but that ability is not guaranteed (e.g. a scanner breaks after-hours and another is borrowed that is not configured in the same manner, or the data is hand-keyed by a human).

Immediate use case: identify possible (i.e. valid) symbologies from text string stored in a database field.

Example:

Stored barcode value: “049000042566”

Found attributes:

numbers only; no lower-case, upper-case, symbols, or control codes
length: 12 digits
no symbology identifier prefix (neither Symbol nor AIM-style)

Result (internal; not necessarily displayed):

Likely: UPC-A (exact length match), EAN 13 (trimmed initial digit or missing check digit), trimmed GTIN 14
Probable alternatives: any variable-length symbology
Possible: any variable-length symbology, numeric or alpha-numeric
Unlikely, but possible: stacked or large symbologies unsuited for short numerical values
Rejected / impossible symbologies: UPC-E, EAN 8
Rejected data contents: UPS tracking number, URI/URL, etc.

Purpose / scenario:

Why is this wanted? Imagine a kiosk scanning station. Someone scans a barcode, and depending on the symbology (to minimize domain collisions), the system either provides information from the correct database or prompts the user for data entry. (In the future, a voice response in leu of a display may be used.)

In a warehouse, a clerk scans a barcode. The system determines it is likely a UPC code, finds it in the inventory database, then displays inventory status and shelf location for re-shelving item.

The clerk scans a second barcode. The system determines it is also likely a UPC code, but does not find it in the inventory database. The clerk is prompted to enter product data from the scanned object.

Next, the clerk scans an invoice from a vendor. The scanned value doesn’t match known data patterns and is not found as an exception in the database. The system asks the clerk to classify the scanned object (e.g. invoice, packing slip, other trade item, other / unknown). The clerk selects “invoice” and is prompted to enter data or file the invoice in a particular mail slot.

Finally, the clerk scans another invoice. The system identifies the likely symbology as ‘code 39’, and since we only have one vendor who as ever used that symbology, the clerk is instructed to put that invoice in a particular mail slot.

The system is designed to require almost no training (in the system operation, that is), as consistant training is often lacking or impractical when operating with volunteers or personnel from other departments in a disaster response situation.

Notes:

Future possible use case from the same effort: given a specific string, identify possible valid symbology choices for that data from which the operator can choose (e.g. from a ordered list of preferred symbology types).

Tertiary use case: analyze log of scanned barcodes to determine which symbologies are in use, to inform choices on future barcodes to use in the organization or to inform new/replacement equipment purchases (e.g. perhaps we’re buying expensive 2D scanners when 99.95% of barcodes scanned would only require an inexpensive laser scanner).

—

Possible hiccups:

Data was manually keyed, and either contains an error or extraneous symbols (e.g. spaces in a UPC code; special characters, such as tab/cr/non-breaking space, were inadvertently entered from a copy/paste operation)
Data was automatically entered by a barcode scanner, but with a differently-configured scanner than expected (included symbology prefix or carriage return when non was expected, or excluded symbology prefix when one was expected)
Data was automatically entered by a barcode scanner, but the barcode check digit was not stored when entered
Data is imported later; example: the system was down, so workers scanned barcodes using an app on their phone for future processing

Clearly, having the scanner provide a symbology prefix would make this entire question trivial, but there may be cases where that cannot be enabled by policy (e.g. another app on the same computer, such as web form, requires barcode input without prefix) or is limited by the system used (i.e. the data is entered into a website field or via an mobile app that does not create those prefixes for scanned codes).

Comments and footnotes:

To be clear, I’m sure I can brute force this [2], but would hate to reinvent the wheel for something it seems must already exist somewhere [3]. Since it can be done to an extent offline, I’d hate to have to rely on a network API (and thus make internet connectivity a dependency).

Creating reusable, modular code is of course the preferred solution.

—

[1] e.g. an existing library, sample function, or some clever algorithm - in any language or pseudo-code

[2] My initial newbie thought was to populate a boolean array with “true” values for every possible symbology, then rule out those invalid for the data in question by setting to “false” any such symbology (e.g. does length match fixed length, does value contain invalid symbols for a particular symbology, are there control codes such as FNC1 or ascii that implicate a certain symbology).

Reminder: depending on the amount of processing done, some substrings always found in machine scans of a particular symbology are not always (or never) printed in a human-readable form under the barcode.

(For example, if the operator manually enters a tracking number from a USPS IMpb, the result will be a 22-digit number starting with a number such as “92”. When machine-scanned, especially if not parsed as GS1 data, the data will be a 34-digit number: starting with the AI “420” followed by a nine-digit postal code, neither of which are printed as human-readable numbers.)

This affects how we must parse the data.

# 1. Initialize data structures and load values / parameters
# 1.1 (optional) if date of scan is available, 
#     rule out any symbologies not used in the organization at that time
    
# 2. check if symbology prefix present; 
#       if so, compare against valid prefixes 
#       and verify data is valid for the identified symbology (to rule out false-positives)
#    Example: AIM prefix    “]L2” —> ‘PDF417 with no tx protocol enabled’
#             Symbol prefix “X”   —> ‘PDF417’ or ‘ISSN EAN’
    
# 3.  check input length (trimmed of leading/trailing whitespace)
# 3.1 Rule out fixed-length codes not matching length of input
#      (or length+1, for codes allowing check digits, in case they were omitted at scan time)
# 3.2 Rule out variable-length codes where input length is greater than their maximum length

# 4. If input string contains:
#     - any lower-case letters, rule out numeric-only and ‘upper-case’-only symbologies
#     - any letters, rule out numeric-only symbologies
#     - any punctuation, keyboard symbols, or control codes, rule out symbologies as appropriate
#     - any control codes that only belong to specific symbologies (such as GS1)
#     
#     [example: a valid URI found in the data may rule out all but a very few symbologies, 
#               based on
#                 symbols (“:”, “/“), length, and data format]

# 4.1 Is the input string missing any elements used by a remaining symbology in expected usage?
#     (e.g. if string does not contain substring “1Z”, it’s not a UPS tracking barcode,
#           and thus highly unlikely to have been encoded using Maxicode)


# 5. If input length matches length of fixed-length symbology that hasn’t already been ruled out,
#    mark that symbology as a possible preferred match

# 6. (optional) For each symbology remaining, calculate data checksum per symbology specification
#    and compare to recorded checksum - matching checksums may help to weight likely symbologies
#    not already ruled out

# 7. Check database(s) for matches based on symbologies remaining 
#    (e.g. if UPC-A still a possibility, query UPC/GTIN databases)

# 8. Identify data structure of data for each remaining candidate, and break out components
#    (e.g. UPC country code, GS1 application identifiers, 
#          postal service level, company-specific codes)

# 9. Present operator with list of possible / impossible symbologies


# 10. Display found data 
#     - Highlight if notable data found, such as a GTIN, GS1 AIs, or a shipping tracking number

# 11. If ambiguity on which business process to execute remains,
#        ask user to choose process (e.g. log invoice, re-stock item) 
#                 or to identify item scanned (e.g. invoice, trade product)

The values remaining “true” represent the valid possibilities; after this point further data analysis, database lookups, and/or business logic can occur. (steps 5-11 in example above)

[3] If I missed something seemingly obvious, please consider sharing your suggested search terms with your comment or answer, so all can learn to search smarter.

I didn't read the entire thing in detail. But I work professionally with barcodes. The library we use can do around 120 different symbologies, most of them can do simple strings like "0123456789". So it's very difficult to estimate the exact symbology used just from the string returned. The more complex the string, the more symbologies you can rule out. But overall, it's a near-impossible task to say for sure. — MyICQ, Mar 24 '21 at 08:27
I agree on all counts. A 10-digit numerical string wouldn't mean much (but it would automatically exclude things such as the UPC and EAN symbiologies, i.e. any fixed-length code with a length other than 10). Strings with alpha or special characters, or those of longer lengths, would exclude many more possibilities. — Jim Grisham, Mar 24 '21 at 21:34
There is also more to it, since the data may be valid/invalid in the structure itself, not just the content. Example: EAN13 has a country code + manufacturer. UPC has a manufacturer. These may be invalid. So the answer is not just simple. If you have time in your application, you could add a lookup for company part on what may look like UPC / EAN13 ? If valid, it must exist. I agree with your list of points, it makes sense. Lots of work. Good luck finding the right symbology ;) Only 120 or so to choose from. — MyICQ, Mar 24 '21 at 22:41
Absolutely. You’d need more logic to determine/validate that. My intention was more to eliminate definite negatives rather than identify definite positives, hopefully to narrow down your list of 120 a bit. ... in a generic, reusable, application-independent way (because you’re right, it will be a lot of work!). Think of this as an initial filter leading to further processing / data analysis. (e.g. if all you tell me is that a string contains one or more Cyrillic characters, I can’t say it is Russian text but I can tell you with confidence that it _isn’t_ pure English, Portuguese, or Tagalog) — Jim Grisham, Mar 28 '21 at 01:08