Can't read pdf file

Question

I'm trying to build an application that can read PDF files. I use this guide:

http://www.codeproject.com/Articles/14170/Extract-Text-from-PDF-in-C-100-NET

but do not understand what it means by "file" is the entire url from your computer. Because when I try it as it says that it is in the wrong format.

String file = "C:/project/test2.pdf";
// create an instance of the pdfparser class
PDFParser pdfParser = new PDFParser();

// extract the text
String result = pdfParser.ExtractText(file);

Wrong message:

Error 1 No overload for method 'ExtractText' takes 1 arguments

You missed a `:` after `C` in your path. Still, that's unrelated to your error message. — Danny Beckett, Apr 17 '13 at 11:12
Well, the message is telling you the `ExtractText` method takes more than 1 argument. What parameters does Intellisense say you need to supply? — Daniel Kelley, Apr 17 '13 at 11:15
public bool ExtractText(string inFileName, string outFileName) — DarkBee, Apr 17 '13 at 11:17
You might want to follow Ria's answer and use the PDF text parsing functionality already built into iTextSharp (if you are using a current version of it) because the codeproject solution is very naive and ignores much of the PDF specification, cf. [this answer](http://stackoverflow.com/a/13982550/1729265) discussing "Method 1" from the question which is your `PDFParser`. — mkl, Apr 17 '13 at 11:34

Ria · Accepted Answer · 2013-04-17T11:32:50.433

1

If you want to extract pdf text into a astring, try to use PdfTextExtractor.GetTextFromPage, a sampe code:

public string ReadPdfFile(string fileName)
{
    var text = new StringBuilder();

    if (File.Exists(fileName))
    {
        var pdfReader = new PdfReader(fileName);

        for (int page = 1; page <= pdfReader.NumberOfPages; page++)
        {
            var strategy = new SimpleTextExtractionStrategy();
            string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

            currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
            text.Append(currentText);
        }
        pdfReader.Close();
    }
    return text.ToString();
}

edited Apr 17 '13 at 11:32

answered Apr 17 '13 at 11:17

Ria

10,237
3
33
60

i want call this function instead? – Max Torstensson Apr 17 '13 at 11:19
U want to read the documentation of the class u downloaded, the function u trying to use requires 2 arguments. Not 1 – DarkBee Apr 17 '13 at 11:21
in he's example so his take 1 argument, but function can takes two. – Max Torstensson Apr 17 '13 at 11:24
the `ExtractText` method, extract pdf to a **text file**, but htis method extract to a string. – Ria Apr 17 '13 at 11:26
is "filename" is fullpath? – Max Torstensson Apr 17 '13 at 11:27
BTW, you should try this without the `String result = pdfParser.ExtractText...` line --- no one could yet explain what it is good for; on the other hand it may break some strings. – mkl Apr 17 '13 at 12:00

score 0 · Answer 2 · answered Apr 17 '13 at 11:20

0

I think the ExtractText have two arguments one is PDF Source file and Second is Text Destination file

So try like below and your error get resolved:

pdfParser.ExtractText(file,Path.GetFileNameWithoutExtension(file)+".txt");

answered Apr 17 '13 at 11:20

Pandian

8,848
2
23
33

wrong message : Error 1 Cannot implicitly convert type 'bool' to 'string' – Max Torstensson Apr 17 '13 at 11:23
and add the exampels class. – Max Torstensson Apr 17 '13 at 11:32
Can you try like this `Console.WriteLine(pdfParser.ExtractText(file,Path.GetFileNameWithoutExtension(file)+".txt").ToString());` and tell us what message you got... – Pandian Apr 17 '13 at 11:38
@MaxTorstensson DarkBee already posted the method signature `public bool ExtractText(string inFileName, string outFileName)` --- so why do you try to assign `String result = pdfParser.ExtractText(...)`?? – mkl Apr 17 '13 at 11:48

Fedor · Answer 3 · 2013-04-17T11:41:09.397

0

First of all, you should correctly specify path. You can download test project from the link to codeproject which you have posted.

And you should use it like that:

string sourceFile =  "C:\\Folder\\File.pdf";
string outputFile =  "C:\\Folder\\File2.txt"

PDFParser pdfParser = new PDFParser();
pdfParser.ExtractText(sourceFile, outputFile);

UPD: You use it WRONG (and you certainly get your error: cannot implicitly convert bool to string):

string result = pdfParser.ExtractText(sourceFile, outputFile);

RIGHT WAY IS:

pdfParser.ExtractText(sourceFile, outputFile);

edited Apr 17 '13 at 11:41

answered Apr 17 '13 at 11:24

Fedor

1,548
3
28
38

WRONG : wrong message : Error 1 Cannot implicitly convert type 'bool' to 'string' – Max Torstensson Apr 17 '13 at 11:27
1. Max, as far as I can see in this version of library there are no method ExtractText that use only one argument, so you should use output file. – Fedor Apr 17 '13 at 11:33
thanks for posting on pastebin. You could not use it like string result = pdfParser(file, output); Delete all from the left from pdfParser. – Fedor Apr 17 '13 at 11:35
i tried it but i found this wrong message : WRONG : wrong message : Error 1 Cannot implicitly convert type 'bool' to 'string' – Max Torstensson Apr 17 '13 at 11:37
you get this error because you call pdfParser (file, output) and trying to attach it to string result. delete "string result =" – Fedor Apr 17 '13 at 11:39
Thank you. It went in but gets it out in very strange encodning unfortunately. – Max Torstensson Apr 17 '13 at 11:47
That's strange. Try to change encoding of the file to UTF-8, for example. – Fedor Apr 17 '13 at 11:52
@MaxTorstensson As mentioned above, PDFParser is a very naive implementation of a PDF text parser. – mkl Apr 17 '13 at 11:52
@MaxTorstensson I completely agree with mkl, that codeproject lib is a rubbish, you better use his advice. – Fedor Apr 17 '13 at 11:56
@Fyodor Not that strange if you look at the `PDFParser` code --- it makes many assumptions which happen to be (mostly) true only in simply built PDFs for Western languages, foremost English. – mkl Apr 17 '13 at 11:56
@mkl, thanks a lot! I think author of this project should add [your useful comment](http://stackoverflow.com/questions/13977738/which-is-the-right-method-to-text-extraction-strategy/13982550#13982550) to the description at codeproject. :) – Fedor Apr 17 '13 at 12:11

Can't read pdf file

3 Answers3