1

I'd like to run pdftohtml for a pdf file and write its output to /dev/stdout or something that permits me to catch output direct from subprocess.

My code:

cmd = ['pdftohtml', '-c', '-s', '-i', '-fontfullname', filename, '-stdout', '/dev/stdout']

result = subprocess.run(cmd, stdout=PIPE, stderr=STDOUT, text=True)

The code above exits with code -11.

I'm running it with Ubuntu 18.04 inside WSL 2.

I've tried to execute the same command in bash:

[1]    14041 segmentation fault (core dumped)  pdftohtml -c -s -i -fontfullname  -stdout /dev/stdout

It's also not possible to pass "-" to stdout value.

What can I do to get html output direct from subprocess.run?

I know it's possible to pipe cat and output filename to command, but it's not what I looking for.

The solution must be compatible with WSL2 and python stretch docker image. However, any clarification would be helpful : )

Kfcaio
  • 442
  • 1
  • 8
  • 20
  • Why are you adding `/dev/stdout` after `-stdout`? Also, I don't see the _input file_ listed in your segfault message; if it's an empty string being passed in that position, that would explain things. If it's treating `/dev/stdout` as the input file to read from -- as one would expect where it's passed as the first/only positional argument), well, a crash is not surprising. – Charles Duffy Jul 22 '21 at 19:18
  • I've added `/dev/stdout` after `-stdout` as a try to "print" output to screen rather than write to file. The input file passed is a valid one-page pdf page. I can convert it to html if I pass a regular html path to pdftohtml, but pass `/dev/stdout` does not work, as explained – Kfcaio Jul 22 '21 at 19:24
  • Yes, I don't _expect_ passing `/dev/stdout` to work. Why should it? `-stdout` means that FD 1, which is the same thing as `/dev/stdout`, should be used. – Charles Duffy Jul 22 '21 at 19:26
  • ...so, who would design a program to require a filename to be given after the `-stdout` argument, and why would they do that? It serves no purpose, because "stdout" is an instruction to use a propened file descriptor number; if it's preopened, you don't need a filename. – Charles Duffy Jul 22 '21 at 19:26
  • Now, the thing that needs a filename passed is the _input file_, especially when a multi-pass parsing strategy is going to be used, because stdin is frequently (albeit not always) a non-seekable FIFO, so the safe course of action is to assume that its contents can only be read once, front-to-back, with no seeking/rewinding/etc. – Charles Duffy Jul 22 '21 at 19:28
  • BTW, there's more than one program named `pdftotext`. There's the one that ships with `xpdf`; the one that ships with `poppler`; one that's implemented as a Python module; and probably others. It would be helpful to unambiguously specify which one you're asking about. (Folks with the same version of Ubuntu can just install the package and see what they get, but not everyone is going to be running the same OS You have). – Charles Duffy Jul 22 '21 at 19:30
  • (BTW, this isn't really a bash question; when you don't use `shell=True`, `subprocess.Popen` doesn't start any shell at all; and when you _do_ use `shell=True`, the shell it starts by default is `/bin/sh` rather than bash). – Charles Duffy Jul 22 '21 at 19:31
  • ...the poppler version of xpdf's `pdftotext` (from poppler-utils 21.02.0) _does_ support `-` as an output filename and doesn't have a `-stdout`, so that's clearly not the same one you're having trouble with. But I'm not much inclined to go through the others to find a version that matches the behavior reported here. – Charles Duffy Jul 22 '21 at 19:34
  • 1
    Ok, I'm using the pdftohtml from poppler. If I get it write, `-stdout` flag is not meant to receive any pass, because it will write to /dev/stdout no matter what. Hence, by correcting the call from python I could get it to work: `CompletedProcess(args=['pdftohtml', '-c', '-s', '-i', '-fontfullname', '/tmp/split-diario-oficial-dje-tjgo-tjgo-i-3254-2021-06-18-80.pdf', '-stdout'], returncode=0, stdout='Page-1\n')`. If it's right as it seems, where is the html output? – Kfcaio Jul 22 '21 at 19:35
  • I'm using pdftohtml version 0.90.1 from poppler – Kfcaio Jul 22 '21 at 19:35
  • Interesting. I installed poppler-utils with [Nix](https://nixos.org/), and the version it provided (as aforementioned, 21.02) explicitly supports `-` as an output filename, and neither its man page nor its `--help` output has any mention of `-stdout` even being a thing that exists. – Charles Duffy Jul 22 '21 at 19:38
  • Well, I'll try to upgrade poppler-utils to the current version. Anyway, thank you for your time, sir – Kfcaio Jul 22 '21 at 19:40
  • Ahh! I was looking at `pdftotext` instead of `pdftohtml`; there is indeed a `-stdout` even in the newer `pdftohtml`. My apologies for that error. – Charles Duffy Jul 22 '21 at 19:48
  • 1
    FWIW, `pdftohtml -s -i -stdout filename.pdf` does Work For Me. (I'm pretty skeptical about using `-c` together with `-s`; "complex output" mode I would expect to only work when writing to multiple files, since it's setting up multiple frames). – Charles Duffy Jul 22 '21 at 19:50
  • 1
    Great, it got to work here too. `-c` is the culprit. If you submit it as a answer, I'll accept it. Thank you for helping me – Kfcaio Jul 22 '21 at 19:54

1 Answers1

2

"Complex output mode", -c, specifies output using frames. This only works when writing to files.

If you want to write to stdout, stick to only -s without -c -- and leave out /dev/stdout as an argument ("stdout" is a pre-opened file descriptor; because it's already opened, there's no reason to use a name to open it, so -stdout is a flag-type option, rather than an option that takes an option-argument).

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441