Q1. Yes, there is a way to make Alexa look up these PDF and read the definition. Amazon Alexa supports lambda functions. Lambda supports .Net Core. Foxit PDF SDK 6.4 works in .Net Core. Foxit PDF SDK 6.4 supports searching PDFs for keywords. You can use Foxxit PDF SDK to search for the keywords and attempt to parse the text data in the PDF for the definition.
This solution requires Foxit PDF SDK 6.4 for .net. You can find request for the evaluation package at this link: https://developers.foxitsoftware.com/pdf-sdk/free-trial
To start add fsdk_dotnet.dll as a reference to the Visual Studio "AWS Lambda Project(.Net Core - C#)." The fsdk_dotnet.dll is located in the lib directory of the evaluation package. Once you do, you can add the following using statements.
using foxit;
using foxit.common;
using foxit.common.fxcrt;
using foxit.pdf;
For you function, it will look like this.
public string SearchPDF(string inputPDF, string searchTerm)//inputPDF is the PDF path with the PDF itself and its .pdf extension. the serachTerm is the term you want to search.
{
string sn = "SNValue"; //the SN value provided in the evaluation package at lib\gsdk_sn.txt
string key = "SignKeyValue"; //the Sign value provided in evaluation package at lib\gsdk_key.txt
ErrorCode error_code;
try
{
error_code = Library.Initialize(sn, key); //Unlocks the library to be used. Make sure you update the sn and key file accordingly.
if (error_code != ErrorCode.e_ErrSuccess)
{
return error_code.ToString();
}
PDFDoc doc = new PDFDoc(inputPDF);
error_code = doc.Load(null); //Loads the PDF into the Foxit PDF SDK
if (error_code != ErrorCode.e_ErrSuccess)
{
return error_code.ToString(); //Returns a error code if loading the document fails
}
using (TextSearch search = new TextSearch(doc, null))
{
int start_index = 0;
int end_index = doc.GetPageCount() - 1;
search.SetStartPage(0);
search.SetEndPage(doc.GetPageCount() - 1);
search.SetPattern(searchTerm); //Sets the search term to be search in the PDF
Int32 flags = (int)TextSearch.SearchFlags.e_SearchNormal;
// if want to specify flags, you can do as followings:
// flags |= TextSearch::e_SearchMatchCase;
// flags |= TextSearch::e_SearchMatchWholeWord;
// flags |= TextSearch::e_SearchConsecutive;
int match_count = 0;
while (search.FindNext())
{
RectFArray rect_array = search.GetMatchRects()
string sentenceWithSearchTerm = search.GetMatchSentence();// Gets the sentence with the search term
match_count++;
}
}
doc.Dispose();
Library.Release();
}
catch (foxit.PDFException e)
{
return e.Message;
}
catch (Exception e)
{
return e.Message;
}
return error_code.ToString().ToUpper(); //If successful this will return the "E_ERRSUCCESS." Please check out the headers for other error codes.
}
Q2: The solution above uses AWS Lambda, which will require internet. This does not uses a database, however, if you wish you can extract the text in the PDF pages to a database. The code above shows how to get the sentence with the string data. If you want to get all of the text in a PDF, please see the code below.
using (var doc = new PDFDoc(inputPDF)){
error_code = doc.Load(null);
if (error_code != ErrorCode.e_ErrSuccess)
{
return error_code.ToString();
}
// Get page count
int pageCount = doc.GetPageCount();
for (int i = 0; i < pageCount; i++) //A loop that goes through each page
{
using (var page = doc.GetPage(i))
{
// Parse page
page.StartParse((int)PDFPage.ParseFlags.e_ParsePageNormal, null, false);
// Get the text select object.
using (var text_select = new TextPage(page, (int)TextPage.TextParseFlags.e_ParseTextNormal))
{
int count = text_select.GetCharCount();
if (count > 0)
{
String chars = text_select.GetChars(0, count); //gets the text on the PDF page.
}
}
}
}
}
Q3. I am not sure what you mean by machine readable format, but the Foxit PDF SDK can provide the text in a string format.
Q4. The best way to crawl the PDF text data is with the solution I have provided above.