How to recognize text in a PDF order?

Question

I'm trying to recognize text in a pdf order with Ghostscript and Tesseract 3.0.2 .

I cannot use itextsharp because the pdf doesn't contain text but just an image.

First, I convert the pdf page in an image and then I try to get the text.

In a first test I tried to get all the text with the "preserve_interword_spaces" variable set to true, but I see that the information in the "Articolo" column in the table is missing. After I tried to obtain just a column like "Consegna", but some "/" symbols are missing.
I have used this code:

string sDLLPath = @".\gsdll64.dll";
GhostscriptVersionInfo gvi = new GhostscriptVersionInfo(sDLLPath);
using (GhostscriptRasterizer rasterizer = new GhostscriptRasterizer())
{
    rasterizer.Open(path_file_pdf, gvi, false);
    int dpi_x = 600;
    int dpi_y = 600;
    for (int i = 1; i <= rasterizer.PageCount; i++)
    {
        Image imgg = rasterizer.GetPage(dpi_x, dpi_y, i);

        imgg.Save(".\\Temp2.png", System.Drawing.Imaging.ImageFormat.Png);

        using (var tEngine = new TesseractEngine(@".\tessdata", "ita", EngineMode.Default))
        {
            tEngine.SetVariable("tessedit_char_whitelist", "/0123456789");
            using (var img = Pix.LoadFromFile(".\\Temp2.png")) 
            {
                Tesseract.Rect region = new Tesseract.Rect(4120, 3215, 550, 840);
                using (var page = tEngine.Process(img, region, PageSegMode.SingleBlock))
                {
                    var text = page.GetText(); 
                    Console.WriteLine(text); 
                    Console.WriteLine(page.GetMeanConfidence()); 
                    Console.ReadKey();
                }
            }
        }
    }
}

Could someone help me to obtain the whole text in the image? Thanks in advance

This is the image link (Temp2.png).

Loïc Sombart · Answer 1 · 2017-02-24T10:53:31.583

Try to set EngineMode to TesseractAndCube. It detect more word than the other.

using (var engine = new TesseractEngine(@".\tessdata", "ita", EngineMode.TesseractAndCube))
{
    using (var img = Pix.LoadFromFile(sourceFilePath))
    {
        using (var page = engine.Process(img))
        {
            var text = page.GetText();                        
        }
    }
}

Otherwise, try to convert your PDF to a XPS file. Then, you'll be able to extract words from the XPS file. Like this :

XpsDocument xpsDocument = new XpsDocument(pSourceDocPath, FileAccess.Read);
IXpsFixedDocumentSequenceReader fixedDocSeqReader = xpsDocument.FixedDocumentSequenceReader;
if (fixedDocSeqReader == null) return null;

const string uniStr = "UnicodeString";
const string glyphs = "Glyphs";
IXpsFixedDocumentReader document = fixedDocSeqReader.FixedDocuments[0];
FixedDocumentSequence sequence = xpsDocument.GetFixedDocumentSequence();

for (int pageCount = 0; pageCount < sequence.DocumentPaginator.PageCount; ++pageCount)
{
    IXpsFixedPageReader page = document.FixedPages[pageCount];
    XmlReader pageContentReader = page.XmlReader;

    if (pageContentReader == null) return null;
    while (pageContentReader.Read())
    {
        if (pageContentReader.Name != glyphs) continue;
        if (!pageContentReader.HasAttributes) continue;
        if (pageContentReader.GetAttribute(uniStr) != null)
        {
            string words = uniStr;
        }
    }
}

I hope that helps.

@francesco, If this answer has helped you, please validate it. — Loïc Sombart, Jan 30 '19 at 13:30

How to recognize text in a PDF order?

1 Answers1