3

Is there a way to edit the raw text from a PDF without any special paid software? So there are PDFs with highlightable text. I assume that the text is stored somewhere in the file.

I tried to just drag & drop a PDF into vscode but it just showed me unknown characters; even a little of meta text but if I edit the meta-infos, the file gets mostly corrupted. Apart from that, I could not find any of the text contents of my desired PDF in vscode-editor.

Does someone know if there is a solution like inspecting and changing the source code somehow without a special software? I want to edit the contents; not the meta-infos.

(I use macOS)

Ljonja
  • 197
  • 9

1 Answers1

1

The text you see on a pdf page can be constructed in dozens of different ways, actually there are millions of users, using potentially hundreds if not thousands of different methods.

Update The question is MacOS but for native cross platform you need to work in mime text/pdf to be universally useful. But by way of example how that's possible specifically in windows its possible to write line by line using say cmd here is a snippet of what was a few dozen lines :-)

echo %%PDF-1.0>demo.pdf
echo %%µ¶µ¶>>demo.pdf
echo/>>demo.pdf

for %%Z in (demo.pdf) do set "FZ1=%%~zZ"
echo 1 0 obj>>demo.pdf
echo ^<^</Type/Catalog/Pages 2 0 R^>^>>>demo.pdf
echo endobj>>demo.pdf
echo/>>demo.pdf

For the fuller "Feature Creep"ing of now over more than a 100 lines and counting see
https://github.com/GitHubRulesOK/MyNotes/raw/master/MAKE-PDF.cmd

Here is a JScript variant where "Hello World" can be read by your Phone or a Robot, HOWEVER due to web corruption the QRC will be distorted string characters, so a raw download is available at https://github.com/GitHubRulesOK/MyNotes/tree/master/JScriptSamples

var ByteStream = new ActiveXObject("ADODB.Stream");
ByteStream.Type = 2; // Writer
ByteStream.Charset = "Windows-1252"; //Best for PDF writer
var BS = ByteStream; // Abreviate for ease of edit
BS.Open();
BS.Position = 0;

BS.WriteText("%PDF-1.0\n");
BS.WriteText("%Åѧ¡\n");

BS.WriteText("1 0 obj <</Type/Catalog/Pages 2 0 R>> endobj\n");
BS.WriteText("2 0 obj <</Type/Pages/Count 1/Kids[3 0 R]>> endobj\n");
BS.WriteText("3 0 obj <</Type/Page/MediaBox[0 0 144 144]/Rotate 0/Resources<</XObject<</Img0 4 0 R>>>>/Contents 5 0 R/Parent 2 0 R>> endobj\n");
BS.WriteText("4 0 obj <</Type/XObject/Subtype/Image/Height 25/Width 24/BitsPerComponent 1/Length 75/ColorSpace[/Indexed/DeviceRGB 1<FF0000FFFFFF>]>> stream\n");
BS.WriteText('ÿÿÿÿÿÿÀmß[}ÑoEÑ[EÑqEßE}ÀUÿñÿÁ«Á¬ÛZcýÖÇÈ"}ÿÕïÀMsß`§Ñ]9ÑNÑE·ßLÇÀA[ÿÿÿÿÿÿ');
BS.WriteText("\nendstream\nendobj\n");
var Pos1 = "000000000"+BS.Position
BS.WriteText("5 0 obj <</Length 101>> stream\n");
BS.WriteText("q\n1 0 0 -1 18 54 cm\n35 0 0 -36 0 36 cm\n/Img0 Do\nQ\nq\n1 0 0 -1 71 144 cm\n70 0 0 -72 0 72 cm\n/Img0 Do\nQ\n");
BS.WriteText("\nendstream\nendobj\n\n");
var Pos2 = BS.Position
BS.WriteText("xref\n0 6\n");
BS.WriteText("0000000000 00001 f \n0000000015 00000 n \n0000000060 00000 n \n0000000111 00000 n \n0000000237 00000 n \n"+Pos1.slice(-10)+" 00000 n \n");
BS.WriteText("\ntrailer\n<</Size 6/Info<</Producer(JScrip2pdf)>>/Root 1 0 R>>\nstartxref\n"+Pos2+"\n%%EOF\n");

BS.SaveToFile("HelloWorldR&W.pdf", 2);
BS.Close();

However although plain text could be the simplest it is rarely used except to prove a conceptual point that it is possible. The rest of the time "Special Software" as you call it (a pdf generator/editor) will be used to compress the file objects, most frequently as different optimal binary streams.

So some text may be scanned pixels whilst other text may be line shapes that look like letters, or at other times plain letters without fonts but a named style, or even letters with the font included (embedded) in the file (the preferred option).

In many ways each page may be built different to the others and thus no two pdfs generally will use the same structure unless like a bank statement using a format that does not change much from month to month, even if the balance wobbles about.

So in summary the tool that will work best is the one that covers every single permutation that Adobe dreamed of, and still keep the result a valid Adobe PDF.

Thus Acrobat PRO 3D is on my shelf (even if not used from one year to the next)

There are many cheaper editors and ones I will use more often for small mods are Tracker Xchange and FreePDF PRO and both have different limitations.

Your choices for MacOS will be more limited thus search for the best you are willing to pay for.

K J
  • 8,045
  • 3
  • 14
  • 36
  • 1
    Thank you, I see your point with there beeing many possibilities. I was hoping to edit the PDF like a HTML-file (I'm clearly not that deep into understanding how PDF works). But you don't know a techy solution with changing source code (if possible), or are there only applications that do this entirely for you without beeing able to do that in a coding-like way? – Ljonja Mar 09 '22 at 07:20
  • Gonna check it out! Apache PDFBox also looks interesting to manipulate PDF, it seems to work with Java – Ljonja Mar 09 '22 at 10:40