-1

I'm trying to build an application that can read PDF files. I use this guide:

http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET

but do not understand what it means by "file" is the entire url from your computer. Because when I try it as it says that it is in the wrong format.

String file = "C:/project/test2.pdf";
// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();

// extract the text
String result = pdfParser.ExtractText(file);

Wrong message:

Error 1 No overload for method 'ExtractText' takes 1 arguments

  • 1
    You missed a `:` after `C` in your path. Still, that's unrelated to your error message. – Danny Beckett Apr 17 '13 at 11:12
  • Don't helps, but thanks for feedback – Max Torstensson Apr 17 '13 at 11:12
  • 1
    Well, the message is telling you the `ExtractText` method takes more than 1 argument. What parameters does Intellisense say you need to supply? – Daniel Kelley Apr 17 '13 at 11:15
  • 1
    public bool ExtractText(string inFileName, string outFileName) – DarkBee Apr 17 '13 at 11:17
  • You might want to follow Ria's answer and use the PDF text parsing functionality already built into iTextSharp (if you are using a current version of it) because the codeproject solution is very naive and ignores much of the PDF specification, cf. [this answer](http://stackoverflow.com/a/13982550/1729265) discussing "Method 1" from the question which is your `PDFParser`. – mkl Apr 17 '13 at 11:34

3 Answers3

1

If you want to extract pdf text into a astring, try to use PdfTextExtractor.GetTextFromPage, a sampe code:

public string ReadPdfFile(string fileName)
{
    var text = new StringBuilder();

    if (File.Exists(fileName))
    {
        var pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}
Ria
  • 10,237
  • 3
  • 33
  • 60
0

I think the ExtractText have two arguments one is PDF Source file and Second is Text Destination file

So try like below and your error get resolved:

pdfParser.ExtractText(file,Path.GetFileNameWithoutExtension(file)+".txt");
Pandian
  • 8,848
  • 2
  • 23
  • 33
  • wrong message : Error 1 Cannot implicitly convert type 'bool' to 'string' – Max Torstensson Apr 17 '13 at 11:23
  • and add the exampels class. – Max Torstensson Apr 17 '13 at 11:32
  • Can you try like this `Console.WriteLine(pdfParser.ExtractText(file,Path.GetFileNameWithoutExtension(file)+".txt").ToString());` and tell us what message you got... – Pandian Apr 17 '13 at 11:38
  • @MaxTorstensson DarkBee already posted the method signature `public bool ExtractText(string inFileName, string outFileName)` --- so why do you try to assign `String result = pdfParser.ExtractText(...)`?? – mkl Apr 17 '13 at 11:48
0

First of all, you should correctly specify path. You can download test project from the link to codeproject which you have posted.

And you should use it like that:

string sourceFile =  "C:\\Folder\\File.pdf";
string outputFile =  "C:\\Folder\\File2.txt"

PDFParser pdfParser = new PDFParser();
pdfParser.ExtractText(sourceFile, outputFile);

UPD: You use it WRONG (and you certainly get your error: cannot implicitly convert bool to string):

string result = pdfParser.ExtractText(sourceFile, outputFile);

RIGHT WAY IS:

pdfParser.ExtractText(sourceFile, outputFile);
Fedor
  • 1,548
  • 3
  • 28
  • 38
  • WRONG : wrong message : Error 1 Cannot implicitly convert type 'bool' to 'string' – Max Torstensson Apr 17 '13 at 11:27
  • 1. Max, as far as I can see in this version of library there are no method ExtractText that use only one argument, so you should use output file. – Fedor Apr 17 '13 at 11:33
  • thanks for posting on pastebin. You could not use it like string result = pdfParser(file, output); Delete all from the left from pdfParser. – Fedor Apr 17 '13 at 11:35
  • i tried it but i found this wrong message : WRONG : wrong message : Error 1 Cannot implicitly convert type 'bool' to 'string' – Max Torstensson Apr 17 '13 at 11:37
  • you get this error because you call pdfParser (file, output) and trying to attach it to string result. delete "string result =" – Fedor Apr 17 '13 at 11:39
  • Thank you. It went in but gets it out in very strange encodning unfortunately. – Max Torstensson Apr 17 '13 at 11:47
  • That's strange. Try to change encoding of the file to UTF-8, for example. – Fedor Apr 17 '13 at 11:52
  • @MaxTorstensson As mentioned above, PDFParser is a very naive implementation of a PDF text parser. – mkl Apr 17 '13 at 11:52
  • @MaxTorstensson I completely agree with mkl, that codeproject lib is a rubbish, you better use his advice. – Fedor Apr 17 '13 at 11:56
  • @Fyodor Not that strange if you look at the `PDFParser` code --- it makes many assumptions which happen to be (mostly) true only in simply built PDFs for Western languages, foremost English. – mkl Apr 17 '13 at 11:56
  • @mkl, thanks a lot! I think author of this project should add [your useful comment](http://stackoverflow.com/questions/13977738/which-is-the-right-method-to-text-extraction-strategy/13982550#13982550) to the description at codeproject. :) – Fedor Apr 17 '13 at 12:11