Can ghostscript.net divide a PDF file to multiple sections?

Question

I have a very long PDF file (58x500 inches). The goal is to divide one large vector pdf file to a certain percentage. For example %25 = 125 inches in height while the width stay the same. So one large pdf will be divided into 4 pages.

ImageMagick was able to do this but it crashes if I changed the dpi to 300. Is it possible to do this with Ghostscript? I am currenlty using Ghostscipt.net and C#.

Can someone point me to the right direction?

ImageMagick uses Ghostscript, which always rasterizes your PDF. That is why you are running out of RAM at 300 dpi with such a large PDF. You can adjust the ImageMagick resources to use disk space if you run out of RAM. But I doubt you want a rasterized PDF output. So neither ImageMagick nor Ghostscript will preserve the vector data. — fmw42, Jan 12 '20 at 22:56
netvips https://github.com/kleisauke/net-vips will do progressive PDF rendering (it uses poppler rather than ghostscript), so you can render the whole page at 300 DPI and write it out as four huge raster files. If four huge rasters is OK. As fmw42 says, you might prefer vector images. — jcupitt, Jan 13 '20 at 08:42
Its possible to do this, and retain the content as vectors, but you need to run the PDF 4 times to achieve it. Basically each time you need to set a fixed media size, translate the input PDF content that you want to be in the output onto the fixed media, and run the PDF file. Repeat once for each segment. I can't post an answer as I'm on vacation but if you search in the Ghostscript tag I've posted programs previously to extract portions of a PDF. — KenS, Jan 13 '20 at 17:20
Thanks everyone. @jcupitt I will definitely check this one out. That sounds exactly what I need it to do. — Lestrin, Jan 13 '20 at 18:34
@KenS This sounds promising as well. I wonder if its possible to read parts of the large PDF file and than writes the rasterize file? That way it will only need to read the PDF once. — Lestrin, Jan 13 '20 at 18:36
No sorry you can't do that, and as everyone says, I really think you would be better to avoid rendering the file to a bitmap. — KenS, Jan 13 '20 at 20:50

score 0 · Accepted Answer · answered Jan 14 '20 at 09:04

I mentioned netvips in a comment -- it will do progressive PDF rendering (it uses poppler rather than ghostscript), so you can load the whole page at 300 DPI and write it out as four huge raster files.

I don't actually have C# on this laptop, but here's what you'd do in Python. The C# code would be almost the same.

import sys
import pyvips

image = pyvips.Image.image_new_from_file(sys.argv[1], dpi=300, access="sequential")
n_pages = 4

for n in range(n_pages):
    filename = f"page-{n}.tif"
    print(f"rendering {filename} ...")

    y = int(n * image.height / n_pages)
    page_height = int(min(image.height / n_pages, image.height - y))
    page = image.crop(0, y, image.width, page_height)
    page.write_to_file(filename)

The access="sequential" puts libvips into sequential mode -- pixels will only be computed on demand from the final write operation. You should be able to render your 200,000 pixel high image using only a modest amount of memory.

You don't need to use tif of course, jpg might be more sensible, and if this is for printing, few people will notice.

As everyone said, it would be better to keep as a vector format for as long as you can.

This works like a charm. Thank you so much. Good to know there are alternatives out there besides Ghostscript. — Lestrin, Jan 15 '20 at 19:43

score 0 · Answer 2 · edited Jun 19 '22 at 09:52

See this previous answer of mine. It demonstrates how to render a portion of the original input file to a bitmap. I'd suggest you use the exact same technique, but use the pdfwrite device instead of the png16m device, so that you get a PDF file as the output, thus maintaining the vector nature of the input.

So to paraphrase the answer there, this:

gs -sDEVICEWIDTHPOINTS=72 -dDEVICEHEIGHTPOINTS=144 -dFIXEDMEDIA -r300 -sDEVICE=pdfwrite -o out.pdf -c "<</PageOffset [-180 -108]>> setpagedevice" -f input.pdf

Will create a 'window' 1 inch wide by 2 inches high, starting 2.5 inches from the left of the original and 1.5 inches up from the bottom. It then runs the input and every position of it which lies within that window is preserved, everything which lies outside it is dropped.

You'd need to do that multiple times, once for each section you want.

I should mention that Ghostscript itself is perfectly capable of rendering the entire PDF file to a document. It uses the same kind of display list approach to very large output files where it creates a (simplified) representation of the original input, and runs that description multiple times. Each time it renders one horizontal band of the final output, then moves down to the next band and so on.

In my opinion, it's likely that the limiting factor of 300 dpi in your original experience is ImageMagick rather than Ghostscript, I know that Ghostscript is able to render input which is several metres in each dimension at 1200 dpi or more, though it does, of course, take a long time to produce the gigabytes of data.

Can ghostscript.net divide a PDF file to multiple sections?

2 Answers2

Linked