-1

i´m looking for a way to use Alexa as a notification and dictionary system for newly released papers and announcements from a site.

For that, i would use an Node.js instance on a Raspberry to crawl intervally new PDFs.

I´m quite new in the Alexa environment and looking for some directions.

Q: Is there a way to make Alexa lookup these PDFs and read definitions of asked keywords like the Wikipedia query skill? Q: Would it be better to make the raspberry not public available over the internet and instead push the data intervally to a cloud database which is queried by alexa? Q: Do i have to parse it in a machine readable format? Q: Is there any better way to crawl the data?

Thank you for any advices

Simon D
  • 29
  • 1
  • 6

2 Answers2

1

I believe you are asking how to make an Alexa skill that can make queries such as "are there new papers about ?"

You are correct that a good design is for your scraper to be separate and publish to a database. You can then create a skill that uses an intent with a AMAZON.SearchQuery slot to capture the users query. Your skill code can perform the database lookup and decide how to respond.

You may find the following helpful: https://forums.developer.amazon.com/questions/128538/sample-skill-using-amazonsearchquery.html.

Mike Liddell
  • 1,971
  • 1
  • 12
  • 9
0

Q1. Yes, there is a way to make Alexa look up these PDF and read the definition. Amazon Alexa supports lambda functions. Lambda supports .Net Core. Foxit PDF SDK 6.4 works in .Net Core. Foxit PDF SDK 6.4 supports searching PDFs for keywords. You can use Foxxit PDF SDK to search for the keywords and attempt to parse the text data in the PDF for the definition.

This solution requires Foxit PDF SDK 6.4 for .net. You can find request for the evaluation package at this link: https://developers.foxitsoftware.com/pdf-sdk/free-trial

To start add fsdk_dotnet.dll as a reference to the Visual Studio "AWS Lambda Project(.Net Core - C#)." The fsdk_dotnet.dll is located in the lib directory of the evaluation package. Once you do, you can add the following using statements.

using foxit;
using foxit.common;
using foxit.common.fxcrt;
using foxit.pdf;

For you function, it will look like this.

public string SearchPDF(string inputPDF, string searchTerm)//inputPDF is the PDF path with the PDF itself and its .pdf extension.  the serachTerm is the term you want to search.
{
    string sn = "SNValue"; //the SN value provided in the evaluation package at lib\gsdk_sn.txt
    string key = "SignKeyValue"; //the Sign value provided in evaluation package at lib\gsdk_key.txt
    ErrorCode error_code;
    try
    {
        error_code = Library.Initialize(sn, key);  //Unlocks the library to be used.  Make sure you update the sn and key file accordingly.
        if (error_code != ErrorCode.e_ErrSuccess)
        {
            return error_code.ToString();
        }
        PDFDoc doc = new PDFDoc(inputPDF); 
        error_code = doc.Load(null); //Loads the PDF into the Foxit PDF SDK
        if (error_code != ErrorCode.e_ErrSuccess)
        {
            return error_code.ToString(); //Returns a error code if loading the document fails
        }

        using (TextSearch search = new TextSearch(doc, null))
        {
            int start_index = 0;
            int end_index = doc.GetPageCount() - 1;
            search.SetStartPage(0);
            search.SetEndPage(doc.GetPageCount() - 1);
            search.SetPattern(searchTerm); //Sets the search term to be search in the PDF

            Int32 flags = (int)TextSearch.SearchFlags.e_SearchNormal;
            // if want to specify flags, you can do as followings:
            // flags |= TextSearch::e_SearchMatchCase;
            // flags |= TextSearch::e_SearchMatchWholeWord;
            // flags |= TextSearch::e_SearchConsecutive;

            int match_count = 0;
            while (search.FindNext())
            {
                RectFArray rect_array = search.GetMatchRects()
                string sentenceWithSearchTerm = search.GetMatchSentence();// Gets the sentence with the search term
                match_count++;
            }
        }

        doc.Dispose();
        Library.Release();
    }
    catch (foxit.PDFException e)
    {
        return e.Message;
    }
    catch (Exception e)
    {
        return e.Message;
    }
    return error_code.ToString().ToUpper(); //If successful this will return the "E_ERRSUCCESS." Please check out the headers for other error codes.
}

Q2: The solution above uses AWS Lambda, which will require internet. This does not uses a database, however, if you wish you can extract the text in the PDF pages to a database. The code above shows how to get the sentence with the string data. If you want to get all of the text in a PDF, please see the code below.

 using (var doc = new PDFDoc(inputPDF)){
    error_code = doc.Load(null);
    if (error_code != ErrorCode.e_ErrSuccess)
    {
        return error_code.ToString();
    }

    // Get page count
    int pageCount = doc.GetPageCount();
    for (int i = 0; i < pageCount; i++) //A loop that goes through each page
    {
        using (var page = doc.GetPage(i))
        {
            // Parse page
            page.StartParse((int)PDFPage.ParseFlags.e_ParsePageNormal, null, false);
            // Get the text select object.
            using (var text_select = new TextPage(page, (int)TextPage.TextParseFlags.e_ParseTextNormal))
            {
                int count = text_select.GetCharCount();
                if (count > 0)
                {
                    String chars = text_select.GetChars(0, count); //gets the text on the PDF page.
                }
            }
        }
    }   
}

Q3. I am not sure what you mean by machine readable format, but the Foxit PDF SDK can provide the text in a string format.

Q4. The best way to crawl the PDF text data is with the solution I have provided above.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Huy Tran
  • 1
  • 3
  • Please disclose any [affiliations](https://stackoverflow.com/help/promotion) and do not use the site as a way to promote your site through posting. See [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer). –  Aug 06 '19 at 01:06