Use Case: OCR PDFs, index the text and make the text searchable
Say I have a class like:
public class Scan
{
public int Id { get; set; }
public string Name { get; set; }
public int PageNumber { get; set; }
public string[] Names { get; set; }
public string[] OCRText { get; set; }
}
When I scan a PDF, I want to store the documents in individual page results, so that say Scanned.PDF gets storred in Name:
ID: 1, Name: 'Scanned.PDF, PageNumber: 1, ...'
ID: 2, Name: 'Scanned.PDF, PageNumber: 2, ...'
ID: 3, etc.
I will then attach metadata (IE: Names) and the resultant OCR Text
My question:
What's the best way to make the OCRText "searchable" ala Google/ElasticSearch.
I want to be able to search for "John" and find all pages that have the name John (IE: Johnny)
I'm afraid an index on the OCRText blocks could be unwieldy.