0

I am trying to read a PDF document using iTextSharp. The document is read, but somehow I notice that name is abbreviated. E.g. if the name is "procurement Define document", it will abbreviate the name to "Proc def doc". I am not sure what am I doing wrong, but I don't want to shorten the names.

Below is my code:

Imports System
Imports System.Collections.Generic
Imports System.Text
Imports iTextSharp.text
Imports iTextSharp.text.pdf

Public Class _Default
    Inherits System.Web.UI.Page

    Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
        Dim oReader As New iTextSharp.text.pdf.PdfReader("C:\4012014.pdf")
        Dim sOut As StringBuilder = New StringBuilder()

        For i = 1 To oReader.NumberOfPages
            Dim its As New iTextSharp.text.pdf.parser.SimpleTextExtractionStrategy
            Dim strLineText As String = iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, i, its)

            strLineText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strLineText)))
            sOut.Append(strLineText)
        Next

        oReader.Close()
        sOut.Append("<br/>")
        txtTest1.Text = sOut.ToString()

    End Sub

End Class
Code Maverick
  • 20,171
  • 12
  • 62
  • 114
user3862503
  • 21
  • 1
  • 6
  • 1
    Can you share a sample document for which that is happening? (iTextSharp does not abbreviate anything during text extraction; it may happen, though, that information contained in a document tell a text extractor that the textual content of some glyphs is different from what you view in a viewer.) – mkl Jul 22 '14 at 07:06
  • To illustrate what mkl says, watch this video: https://www.youtube.com/watch?v=wxGEEv7ibHE Sometimes a PDF is constructed in such a way that extracting text is made impossible (on purpose, so that you wouldn't be able to mine for data in the document). – Bruno Lowagie Jul 22 '14 at 07:16
  • unfortunately, I cannot share the pdf document, but iTextsharp is not converting the pdf document line by line. Its very different than original pdf document. – user3862503 Jul 22 '14 at 15:48
  • This won't fix your problem but please see this post explaining why you should **never** use Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strLineText))) http://stackoverflow.com/a/10191879/231316 – Chris Haas Jul 22 '14 at 20:12
  • We understand confidentially problems but without the PDF we really can't help you. As mkl, iTextSharp doesn't abbreviate things. You say that "iTextSharp is not converting the PDF document line by line" but it is very important to understand that those lines might not actually exist with the PDF. Even if a PDF wasn't intentionally obfuscated, there are very legitimate reasons for this. – Chris Haas Jul 22 '14 at 20:15

0 Answers0