6

Given a directory with several jpg files (photos), I would like to create a single pdf file with one photo per page. However, I would like the photos to be stored in the pdf file unchanged; i.e., I would like to avoid decoding and recoding. So ideally I would like to be able to extract the original jpg files (maybe minus the metadata) from the pdf file, using, e.g., a linux command line too like pdfimages.

My ideas so far:

  • imagemagick convert. However, I am confused by the compression options: If I choose 100% quality, does it mean that the jpg is internally decoded, and then encoded lossless? (Which is obviously not what I want?)
  • pdflatex. Some people claim that the graphics package includes images lossless, while other dispute that. In any case, pdflatex would be slightly more cumbersome (I would first have to find out the dimensions of the photos, then set the page size accordingly, make sure that ther are no margins, headers etc etc).
Jakob
  • 238
  • 3
  • 12
  • 1
    Imagemagick will decode then wrap in a PDF vector shell. It will not be lossless. quality 100 for normal JPG is still lossy. – fmw42 Sep 29 '17 at 20:10
  • 1
    You could follow any method you like and then *attach* the JPGs to the PDF. The PDF will provide a visual of the images, while the attachments should be lossless and can be downloaded/extracted. [The PDF Toolkit](https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/) can help with that. – Werner Sep 30 '17 at 16:30

4 Answers4

11

img2pdf (PyPI page):

Losslessly convert raster images to PDF without re-encoding PNG, JPEG, and JPEG2000 images. This leads to a lossless conversion of PNG, JPEG and JPEG2000 images with the only added file size coming from the PDF container itself. Other raster graphics formats are losslessly stored using the same encoding that PNG uses. Since PDF does not support images with transparency and since img2pdf aims to never be lossy, input images with an alpha channel are not supported.

(pdfimages -all does the exact opposite.)

Geremia
  • 4,745
  • 37
  • 43
  • *"Since PDF does not support images with transparency"* - this is not entirely correct, certain soft masks in JPEG2000 are supported directly, see the **SMaskInData** property. In other cases one simply has to separate the alpha channel into a separate gray scale image used as a soft mask which in general also is lossless, merely not as fast. – mkl Feb 13 '20 at 17:54
  • Great software! Adds only a few bytes of PDF markup. And the defaults perfectly did the job: PDF exactly with dimension and the color space from the input image (JPEG metadata has had resolution and color space flag). Installed on macOS with `pipx` and then run the command. Installation and conversion both worked on my first invocation attempt! No errors or unintended results. Very well done! – porg May 16 '23 at 08:39
2

You could use the following small script which relies on HexaPDF (note: I'm the author of HexaPDF) to do this.

Note: Make sure you have Ruby 2.4 installed, then run gem install hexapdf to install hexapdf.

Here is the script:

require 'hexapdf'

doc = HexaPDF::Document.new

ARGV.each do |image_file|
  image = doc.images.add(image_file)
  page = doc.pages.add
  iw = image.info.width.to_f
  ih = image.info.height.to_f                                                                                                                             
  pw = page.box(:media).width.to_f
  ph = page.box(:media).height.to_f
  rw, rh = pw / iw, ph / ih
  ratio = [rw, rh].min
  iw, ih = iw * ratio, ih * ratio
  x, y = (pw - iw) / 2, (ph - ih) / 2
  page.canvas.image(image, at: [x, y], width: iw, height: ih)
end

doc.write('images.pdf')

Just supply the images as arguments on the command line, the output file will be named images.pdf. Most of the code deals with centering and scaling the images to nicely fit onto the pages.

gettalong
  • 735
  • 3
  • 10
  • thank you so much. I have never used ruby so far, but will install it now and try the script. I will accept the answer if it works for me. – Jakob Sep 30 '17 at 11:09
  • Works great, thank you so much! I will modify the script to disable scaling etc and instead adjust each paper size to match the according jpg. (It seems that I have to do that at the doc.pages.add step; I will play around with it a bit) – Jakob Sep 30 '17 at 11:25
  • Adjusting the size of the page can be done by providing an array with four values -- the media box -- to `doc.pages.add`. In your case you would do something like `doc.pages.add([0, 0, iw, ih])`. Then leave everything out until `page.canvas...` and change it to `page.canvas.image(image, at: [0, 0])`. – gettalong Sep 30 '17 at 19:58
  • Thanks again! Some photos are rotated (a metadata information), e.g.: the jpg is 100x10 pixels, but the metadata implies it should be interpreted as 10x100. So I guess in that case I should: (a) create a 10x100 pt page, and then (b) do something like canvas.rotate(90) do canvas.image(image, at:[0,0]) end ? (Hope the follow up questions are not impolite; in any case I am very grateful for the information given so far, and, of course, for HexaPDF!) – Jakob Oct 01 '17 at 09:23
  • Yeah, (a) is correct, but in (b) you would need -90 degrees since the positive y-axis is upwards in the PDF coordinate system (i.e. the origin is in the lower-left corner of the page). However, you would still need a method for determining the JPEG meta information since this is not something HexaPDF currently provides. – gettalong Oct 01 '17 at 16:49
  • Thanks again! I am already reading the meta information (in a very primitive way, using imagemagick's identify -verbose and then grep for exif:Orientation:) – Jakob Oct 02 '17 at 07:20
2

Another possibility for storing jpg images into a pdf file in a "lossless" way is provided by PoDoFo:

podofoimg2pdf is able to perform lossless conversion from JPEG to PDF by embedding the jpg file into the pdf container.

podofoimg2pdf
Usage: podofoimg2pdf [output.pdf] [-useimgsize] [image1 image2 image3 ...]

Options:
 -useimgsize    Use the imagesize as page size, instead of A4
c72578
  • 45
  • 7
1

Depending on what you wish to do with the files, on windows, if the images are simpler jpeg/gif/tif/png you can store in a cbz, zip, folder or zipped folder and view with SumatraPDF which has the SaveAs PDF option thus all done with one exe.

enter image description here

It will fail with files that are viewable but not acceptable as PDF inputs such as webp or heic, so check in the viewer what the filename extension is before.

It should in practically all cases be lossless, however you should roundtrip with pdfimage -all to do a file compare between input and output to check there was no need to convert any bytes.

K J
  • 8,045
  • 3
  • 14
  • 36