1

Use Case: OCR PDFs, index the text and make the text searchable

Say I have a class like:

public class Scan
{
    public int Id { get; set; }
    public string Name { get; set; }
    public int PageNumber { get; set; }
    public string[] Names { get; set; }
    public string[] OCRText { get; set; }
}

When I scan a PDF, I want to store the documents in individual page results, so that say Scanned.PDF gets storred in Name:

ID: 1, Name: 'Scanned.PDF, PageNumber: 1, ...'
ID: 2, Name: 'Scanned.PDF, PageNumber: 2, ...'
ID: 3, etc.

I will then attach metadata (IE: Names) and the resultant OCR Text

My question:

What's the best way to make the OCRText "searchable" ala Google/ElasticSearch.

I want to be able to search for "John" and find all pages that have the name John (IE: Johnny)

I'm afraid an index on the OCRText blocks could be unwieldy.

WernerCD
  • 2,137
  • 6
  • 31
  • 51
  • Your best bet would be to use a database with full-text search indexing. One way you can do this is with something like this: https://www.elastic.co/products/elasticsearch and have it store document IDs. Or, if you want to keep it within litedb, the author has a suggestion here: https://github.com/mbdavid/LiteDB/issues/910 – willaien Feb 12 '19 at 15:08

0 Answers0