9

I need to convert a .pdf file to a .txt file

How can I do this in C#?

Boppity Bop
  • 9,613
  • 13
  • 72
  • 151
aharon
  • 7,393
  • 10
  • 38
  • 49

6 Answers6

5

I've had the need myself and I used this article to get me started: http://www.codeproject.com/KB/string/pdf2text.aspx

Don
  • 9,511
  • 4
  • 26
  • 25
4

Ghostscript could do what you need. Below is a command for extracting text from a pdf file into a txt file (you can run it from a command line to test if it works for you):

gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "test.pdf" -c quit >"test.txt"

Check here: codeproject: Convert PDF to Image Using Ghostscript API for details on how to use ghostscript with C#

serge_gubenko
  • 20,186
  • 2
  • 61
  • 64
  • tanks!!! it's working, but there is a problem, it's not saving to the txt file, it's just create it and it's remain empty..why isn't it work? i runned it like that: C:\>C:\gswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -d -c save -f ps2ascii.ps "C:\New Folder\2\test.pdf" -c quit >"c:\test.txt" – aharon Dec 23 '09 at 13:24
  • if you would run it like this: gswin32.exe "C:\New Folder\2\test.pdf" will it show you the file? also you might want to try running it from the bin folder of the gs, smth like this: C:\Program Files\gs\gs8.64\bin>gswin32c.exe .... in any case gs should give you an error if it can't find\parse your file, pls, post it up here if still no luck converting your file – serge_gubenko Dec 23 '09 at 14:31
  • i tried to do: C:\Program Files\gs\gs8.64\bin>gswin32.exe "C:\New Folder\2\test.pdf" and the program told me that it can't parse the file (but it showed me the pdf file) which is wierd, because when i did gswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" > "c:\test.txt" it did convert it, the only problen is that it create the file but don't write into it.... is this suppose to work in windows? – aharon Dec 23 '09 at 16:14
  • it has to work on windows and works fine for me; there are could be problems with parsing pdf files but ususally you get an error message from gs with an explanation of what is missing or broken; can you post up your pdf file somewere on file sharing service so I could try converting it – serge_gubenko Dec 23 '09 at 16:38
  • http://www.megafileupload.com/en/file/170875/test-pdf.html there is the link for the file i want to convert. i don't think u will have a problem to convert it, i succeeded to convert it, but the problem is that it not svaing it to the txt file there is the command again: gswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" > "c:\test.txt" – aharon Dec 23 '09 at 16:57
  • tested your file and it worked fine; the prblem is in the executable your're using which is gswin32.exe; whereas you have to use gswin32c.exe (c == console); here's how I called it: gswin32.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" -c quit >"c:\test.txt" – serge_gubenko Dec 23 '09 at 17:13
  • ups sorry; gswin32c.exe -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dSIMPLE -c save -f ps2ascii.ps "c:\test.pdf" -c quit >"c:\test.txt" – serge_gubenko Dec 23 '09 at 17:13
  • wow!! it works!! tnx!!! but there is still has a tiny problem if there is a bold word then in some pdf files they are not parset right and the word is cut in the middle or every word in seperate is there something to do with that? i uploaded an example file. u can see it clearly in the firs line but there are some other words like that in the other line (where there were a bold line): http://www.megafileupload.com/en/file/170969/test-txt.html and another question, i need to convert 15000 pdf files (for my project) it's ok if i'll do a loop in c# and run this program for each file from a cmd? – aharon Dec 23 '09 at 20:55
  • regarding the 15000 pdf files; check the link I gave you in the original reply http://www.codeproject.com/KB/cs/GhostScriptUseWithCSharp.aspx for the details on how you can use gsdll32.dll in your c# project. 15k files is a lot but shouldn't be a problem for gs, besides you never said is that a total number or you're going to receive it for instance per hour. As an alternative you can call 2..n instances of gswin32c.exe in parallel from different threads and point them to different files from your set, this shouldn't require a lot of coding to implement. I'll take a look at the file... – serge_gubenko Dec 23 '09 at 21:38
  • sorry, misunderstood your question regarding if it's ok to run the program from cmd for all your files set -- yes, I don't see any problem with it; should work fine regarding words separation; I don't think gs would be able to remove those; but I guess you can post process the txt file afterwords and remove those in your application – serge_gubenko Dec 23 '09 at 21:59
  • ok. anks alot! you realy helped me!! i'll mak a program that call gs from c#, thers is no need in what was said in your link because i can execute the cmd comman from c#. so i'll just make a loop. and the time is ok, it can tke 24 hours, i dont care. i can't post process the txt files, there is a lot of them... anyway tanks!!! – aharon Dec 24 '09 at 06:47
  • about the word seperation, i dont want to remove them-they are important, i want them in normal mode (unseperated) – aharon Dec 24 '09 at 06:49
  • i did'nt succeeded to run it through C# or java. is there an automatic way to run it in the parameters u gave me and change the input and output files? – aharon Dec 24 '09 at 16:22
  • check this thread for details on how you can run gswin32c.exe with parameters from your c# application: http://stackoverflow.com/questions/1941118/asp-converting-pdf-to-a-collection-of-images-on-the-server-using-ghostscript/1944348#1944348 – serge_gubenko Dec 24 '09 at 17:27
1

The concept of converting PDF to text is not really straight forward and you wont see anyone posting a code here that will convert PDF to text straight. So your best bet now is to use a library that would do the job for you... a good one is PDFBox, you can google it. You'll probably find it written in java but fortunately you can use IKVM to convert it to .Net....

Zaid Amir
  • 4,727
  • 6
  • 52
  • 101
1

As an alternative to Don's solution there I found the following:

Extract Text from PDF in C# (100% .NET)

Justin
  • 84,773
  • 49
  • 224
  • 367
0

Docotic.Pdf library can extract text from PDF files (formatted or not).

Here is a sample code that shows how to extract formatted text from a PDF file and save it to an other file.

public static void ExtractFormattedText(string pdfFile, string textFile)
{
    using (PdfDocument doc = new PdfDocument(pdfFile))
    {
        string text = doc.GetTextWithFormatting();
        File.WriteAllText(textFile, text);
    }
}

Also, there is an article on our site that shows other options for extraction of text from PDF files.

Disclaimer: I work for Bit Miracle, vendor of the library.

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
  • But its a simple conversion. I was needed converstion to articles were the PDF file is in diffrentes layouts... – aharon Dec 21 '11 at 09:04
-2
    public void PDF_TEXT()
    {
        richTextBox1.Text =  string.Empty;

        ReadPdfFile(@"C:\Myfile.pdf");  //read pdf file from location
    }


    public void ReadPdfFile(string fileName)
    {

 string strText = string.Empty;
 StringBuilder text = new StringBuilder();
   try
    {
    PdfReader reader = new PdfReader((string)fileName);
    if (File.Exists(fileName))
    {
    PdfReader pdfReader = new PdfReader(fileName);

   for (int page = 1; page <= pdfReader.NumberOfPages; page++)
      {

 ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

 string currentText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy);

          text.Append(currentText);

                }
                pdfReader.Close();
            }
        }
        catch (Exception ex)
        {
            MessageBox.Show(ex.Message);
        }
        richTextBox1.Text = text.ToString();

    }



    private void Save_TextFile_Click(object sender, EventArgs e)
    {
        SaveFileDialog sfd = new SaveFileDialog();

        DialogResult messageResult = MessageBox.Show("Save this file into Text?", "Text File", MessageBoxButtons.OKCancel);

        if (messageResult == DialogResult.Cancel)
        {

        }
        else
        {
            sfd.Title = "Save As Textfile";
            sfd.InitialDirectory = @"C:\";
            sfd.Filter = "TextDocuments|*.txt";


            if (sfd.ShowDialog() == DialogResult.OK)
            {
                if (richTextBox1.Text != "")
                {
                    richTextBox1.SaveFile(sfd.FileName, RichTextBoxStreamType.PlainText);
                    richTextBox1.Text = "";
                    MessageBox.Show("Text Saved Succesfully", "Text File");

                }
                else
                {
                    MessageBox.Show("Please Upload Your Pdf", "Text File",
                    MessageBoxButtons.OKCancel, MessageBoxIcon.Asterisk);
                }

            }

        }

    }
shuvo sarker
  • 881
  • 11
  • 20
  • 2
    Just pasting some code is not helpful. – mkl Sep 03 '15 at 08:47
  • I think here not too much difficult thing that need to be described. – shuvo sarker Sep 03 '15 at 09:06
  • 7
    *I think here not too much difficult thing that need to be described.* - Well, out of the box your code does not even compile for the simple reason that you did not mention the dependencies. Neither the question nor your answer mentions iTextSharp. Anyone not recognizing the classes in question will be instantly lost. Furthermore you have unnecessary code elements, if the OP wants to create a command line application, GUI element event listeners are inappropriate. As a good example look at @Bobrovsky's answer, he both mentioned the library dependency and presented only pivotal code. – mkl Sep 03 '15 at 10:20