How to program a text search and replace in PDF files

Question

How would I be able to programmatically search and replace some text in a large number of PDF files? I would like to remove a URL that has been added to a set of files. I have been able to remove the link using javascript under Batch Processing in Adobe Pro, but the link text remains. I have seen recommendations to use text touchup, which works manually, but I don't want to modify 1300 files manually.

I know it's really old, but I came along to this problem and you are the first result on google. What did you use at the end? — eri0o, Jan 28 '15 at 19:10
I used Perl, the CAM::PDF module and the sample changepagestring.pl program as suggested in Chris Dolan's answer. That was a one-time thing, so don't ask me how to do this now ;-) — rpilkey, Jan 29 '15 at 14:05

Chris Dolan · Accepted Answer · 2021-09-21T15:43:14.537

21

Finding text in a PDF can be inherently hard because of the graphical nature of the document format -- the letters you are searching for may not be contiguous in the file. That said, CAM::PDF has some search-replace capabilities and heuristics. Give changepagestring.pl a try and see if it works on your PDFs.

To install:

 $ cpan install CAM::PDF
 # start a new terminal if this is your first cpan module
 $ changepagestring.pl input.pdf oldtext newtext output.pdf

edited Sep 21 '21 at 15:43

answered Oct 21 '08 at 04:52

Chris Dolan

8,905
2
35
73

2

Thanks a lot Chris, for the answer, and for the module on CPAN. That worked nicely for me. Hopefully Google picks up this page, I didn't see the CAM::PDF module in my searches. Roger – rpilkey Oct 21 '08 at 18:26
for anyone else looking, I tried the trial version of http://www.verypdf.com/app/pdf-text-replacer/pdf-find-and-replace-assistant.html and it worked nicely. – RozzA Jan 07 '15 at 01:21
2

@rpilkey can anyone provide me with a sample example as iam new to perl and i am not aware of how to run that package. – Sundeep Pidugu Apr 29 '19 at 07:45
1

Only seems to work for simple text, not any TJ boxes with glyph offsets, which seems common ... https://stackoverflow.com/questions/220445/how-to-program-a-text-search-and-replace-in-pdf-files/67932076#67932076 – rogerdpack Sep 10 '21 at 04:47
I got `Warning: Cannot install CAM-PDF, don't know what it is.` using `cpan install CAM::PDF` worked though – Matthew Lock Sep 20 '21 at 06:56
1

@MatthewLock thanks, I changed the answer from "CAM-PDF" to "CAM::PDF". cpan must have changed somewhat in the intervening 13 years :-D As for 3 letters, yeah, there's probably kerning in your doc that breaks the text string up into pieces so CAM::PDF's rudimentary search/replace can't find it. – Chris Dolan Sep 21 '21 at 15:45
In Debian/unstable, `changepagestring` does not work at all (I've tried on a single word, so this is simpler than a regexp), even on a simple PDF file obtained with `pdflatex`, for which `pdftotext` can find the word. [Debian bug 1019979](https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1019979). – vinc17 Sep 18 '22 at 02:15
@vinc17 the letters you are searching for may not be contiguous in the file. LaTeX does rather sophisticated kerning so the words are likely not organized as simple strings. Other tools are using heuristics to decide which letters are part of the same word, and mine doesn't do that. – Chris Dolan Sep 20 '22 at 00:48
@ChrisDolan Yes, additional positive or negative spacing between letters may be needed between letters for justified text in paragraphs. After uncompressing the data streams with `qpdf --stream-data=uncompress`: `[(T)-0.200947(h)-0.599165(e)-333.387(f)-0.599165(ol)-0.800112(l)-0.800112(o)26.9967(w)-0.200947(i)-0.798886(n)-0.599165(g)-332.981(l)-0.798886(i)-0.801337(n)-0.597939(k)-333.785(w)27.8017(or)-0.698413(k)-0.801337(s)-334.415(:)-666.803([)-0.798886(1])-334.812(\()-0.90181(f)-0.597939(o)-26.9832(o:)-0.798886(b)-0.60039(ar)-0.698413(\))-0.90181]TJ` – vinc17 Sep 20 '22 at 09:30
@ChrisDolan Actually this paragraph has only one line. So this is just kerning. But it should be possible to detect the small values (see the values less than 1 vs something around 333 for a space). – vinc17 Sep 20 '22 at 09:35

score 9 · Answer 2 · edited Jan 19 '23 at 14:54

I have also become desperate. After 10 PDF Editor installations which all cost money, and no success:

pdftk + editor suffice:

Replace Text in PDF Files

Use pdftk to uncompress PDF page streams

pdftk original.pdf output original.uncompressed.pdf uncompress

Replace the text (sometimes this works, sometimes it doesn't) within original.uncompressed.pdf

Repair the modified (and now broken) PDF

pdftk original.uncompressed.pdf output original.uncompressed.fixed.pdf

(from Joel Dare)

score 1 · Answer 3 · answered Dec 18 '12 at 14:07

This is just half a solution but I used Touch up combined with AppleScript's support for sending keystrokes to replace a string in thousands of table cells. Depending on how your pages are layout it could work for you. In my case I had to manually insert the cursor in the beginning of every table (tens of tables - quite manageable for a manual process) but after that i replaced thousands of cells automatically.

score 1 · Answer 4 · edited Feb 29 '12 at 05:52

1

You can use the 'redaction' feature in Adobe Acrobat Pro to find & replace all references in a single document in one step...not sure if it can be automated to multiple steps.

http://help.adobe.com/en_US/Acrobat/9.0/Professional/WS5E28D332-9FF7-4569-AFAD-79AD60092D4D.w.html

edited Feb 29 '12 at 05:52

Calydon

251
1
9

answered Jul 28 '10 at 17:44

davr

18,877
17
76
99

score 1 · Answer 5 · answered Nov 12 '11 at 13:54

I just finished trying out infix for a text that is comprised of text ladened with diacritics with the hope of generating another text where characters with double and composed diacritics are replaced by alternate with single diacritics. Infix is such definitely a good solution for someone who does not care for the trouble of understanding the working of programmatic solutions. All the request changes were effected. Still need to understand how to effect reflow of words that change the layout of text.

Dimitar · Answer 6 · 2015-01-17T15:30:34.420

0

The question is for a programmatic solution, but I will still share this free online tool which helped me mass replace text in some PDF files:

http://www.pdfdu.com/pdf-replace-text.aspx

I did not notice any ads or other modifications in the resulting PDF files after replacing the text.

I was not able to make the changes locally with the software I tried. I think the main problem was that I was missing the font used in the PDF and it did not work properly, even with Acrobat Pro. The online tool did not complain and produced a great result.

edited Jan 17 '15 at 15:30

answered Jan 14 '15 at 22:26

Dimitar

917
1
8
9

3

The OP asked for a **programmatical** solution, not a manual one. – mkl Jan 15 '15 at 09:53
@mkl You're right, thanks for pointing this out. I edited my answer to make that more clear. I came upon this question in my search for a one-time solution of mass-replacing some text in PDFs. I was okay with a programmatical solution, but nothing I tried worked. That online tool did work, howerver, so I decided to share it anyway. – Dimitar Jan 17 '15 at 15:35

score 0 · Answer 7 · answered Jul 14 '17 at 00:30

I suggest you may use VeryPDF PDF Text Replacer Command Line software to batch replace text in PDF pages, you can run pdftr.exe to replace text in PDF pages easily, for example,

pdftr.exe -contentreplace "My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "My Name=>D:\temp\myname.png*20*20" D:\in.pdf D:\out.pdf

pdftr.exe -pagerange 1-3 -contentreplace "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchtext "string" C:\in.pdf

pdftr.exe -pagerange 1 -searchtext "string" C:\in.pdf

pdftr.exe -pagerange 1 -searchandoverlaytext "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -overlaytextfontname "Arial" -overlaytextcolor FF0000 -overlaybgcolor 00FF00 -searchandoverlaytext "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -opw 123 -upw 456 -contentreplace "Old Text=>New Text||VeryPDF=>VeryDOC||My Name=>Your Name" D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "PDFcamp Printer=>VeryPDF Printer" -overlaytextfontsize 8 D:\in.pdf D:\out.pdf

pdftr.exe -searchandoverlaytext "PDFcamp Printer=>VeryPDF Printer" -overlaytextfontsize 80% D:\in.pdf D:\out.pdf

Doesn't seem to be free, windows only. – rogerdpack Jun 11 '21 at 06:05 — rogerdpack, Jun 11 '21 at 06:05

score 0 · Answer 8 · answered Jan 07 '11 at 04:13

Not sure I would want to do all the work to write the code to modify your 1300 files when there is a program that can do it for you. The other day, I used the Professional version of Infix to batch modify almost 100 files using its "Find and Replace in Files" feature. It works great. I have evaluated other programs in hopes finding an find and replace functionality similar to Microsoft Word. Infix was the only one I found that can do it. Check out: http://www.iceni.com/infix-pro.htm

rogerdpack · Answer 9 · 2023-03-30T02:47:42.730

0

It appears that even with uncompressed pdf's, text is sometimes formatted funky. This makes "normal" text replacement, a la sed, not work or not be trivial.

I couldn't find anything that seemed to work with glyph spacing offsets, i.e. text that looks like this (which seems very common in pdf's), in this example, the word "Other information" is stored like this:

 [(O)-16(ther i)-20(nformati)-11(on )]TJ

I have attempted to write a tool that satisfies this myself. It works OK for common use cases. Check it out here.

First uncompress your pdf, then cd to the checked out git code and:

Syntax

 $ crystal replaceinpdf.cr input_filename.pdf "something you want replaced" "what you want it replaced with" output_filename.pdf

Enjoy! Requests welcome.

edited Mar 30 '23 at 02:47

answered Jun 11 '21 at 06:16

rogerdpack

62,887
36
269
388

I tried this and it said, "no changes, is pdf compressed perhaps?" with exit code 0. Command was `replaceinpdf SC13R0J_HTML/MANUAL.HTM/rm13r0j/ewd/contents/relay/pdf/JC_01.pdf 黒色 black out.pdf` – Douglas Held Mar 06 '23 at 23:20
File an issue with all details :) – rogerdpack Mar 07 '23 at 04:53

score -1 · Answer 10 · answered May 07 '21 at 17:36

Although it is quite an old thread. Just wanted to share a Node.js package option to search and replace text in PDF: Aspose.PDF Cloud SDK for Node.js. It is paid product but it provides 150 free monthly API calls.


const { PdfApi } = require("asposepdfcloud");
const { TextReplaceListRequest }= require("asposepdfcloud/src/models/textReplaceListRequest");
const { TextReplace }= require("asposepdfcloud/src/models/textReplace");

// Get Client ID and Client Secret from https://dashboard.aspose.cloud/
pdfApi = new PdfApi("xxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxx");
var fs = require('fs');

const name = "02_pages.pdf";
const remoteTempFolder = "Temp";
//const localTestDataFolder = "C:\\Temp";
//const path = remoteTempFolder + "\\" + name;
//const outputFile= "Replace_output.pdf";


// Upload File
//pdfApi.uploadFile(path, fs.readFileSync(localTestDataFolder + "\\" + name)).then((result) => {  
//                     console.log("Uploaded File");    
//                    }).catch(function(err) {
    // Deal with an error
//    console.log(err);
//});
    
const textReplace= new TextReplace();
        textReplace.oldValue= "origami"; 
        textReplace.newValue= "aspose";
        textReplace.regex= false;

const textReplace1= new TextReplace();
        textReplace1.oldValue= "candy"; 
        textReplace1.newValue= "biscuit";
        textReplace1.regex= false;
    
const trr = new TextReplaceListRequest();
            trr.textReplaces = [textReplace,textReplace1];


// Replace text
pdfApi.postDocumentTextReplace(name, trr, null, remoteTempFolder).then((result) => {    
    console.log(result.body.code);                  
}).catch(function(err) {
    // Deal with an error
    console.log(err);
});

//Download file
//const outputPath = "C:/Temp/" + outputFile;

//pdfApi.downloadFile(path).then((result) => {    
//  fs.writeFileSync(outputPath, result.body);
//    console.log("File Downloaded");    
//}).catch(function(err) {
    // Deal with an error
//    console.log(err);
//});

score -1 · Answer 11 · answered Mar 16 '22 at 06:59

-1

This library has an extensive support. Check it out.

PDF-LIB

answered Mar 16 '22 at 06:59

Ahmet Firat Keler

2,603
2
11
22

Does it do text replacement? – rogerdpack Mar 30 '23 at 04:21

How to program a text search and replace in PDF files

11 Answers11

Linked