I am working on project on virtual printer, and i want to convert ps file to txt and pdf. I am using ps2pdf and it converts well to pdf, but when I want to convert ps file to txt, I use ps2ascii , and then got problem. ps file contains russian symbols. how can I convert ps file to txt (russian language)? I read on web that it is unicode problem.
1 Answers
ps2ascii only handles ASCII (the clue is, obviously, in the name). The ps2ascii shell script and PostScript program was removed from the standard Ghostscript source tree some time back, because it was too limited and there is a better option.
The problem with using PostScript is that there is no guaranteed way to relate the character codes used to render the text to Unicode, or any other standard text encoding. PostScript is a language intended for printing, not for editing.
You may be lucky, it depends entirely on the fonts and Encoding/CMap the PostScript program you produce uses. I note that you are talking about a 'virtual printer' is this on Windows ? If so you may be in luck, the Windows PostScript printer driver adds extra (entirely non-standard) information to at least some fonts when it embeds them in the PostScript program. This additional information can be used to retrieve Unicode code points.
I would start by trying the txtwrite device from Ghostscript (and you should use Ghostscript directly instead of using pre-baked scripts) on the PostScript and see if that is able to extract the text.
If not, then try creating a PDF file from the PostScript, and then use the txtwrite device on the PDF file. I'm not absolutely certain if the txtwrite device has all the bells and whistles of the pdfwrite device, it may not be able to use the Unicode information from the font directly, but it can certainly use it from the PDF file.
I should probably direct you to read the licence for Ghostscript as well, it's the AGPL version 3, just so you don't end up wasting time on something you then discover you can't use for legal reasons.
Edit
After a quick check, it seems we removed the ps2ascii PostScript program, but changed the ps2ascii script to use the txtwrite device instead. So if you use a reasonably recent version of Ghostscript that's what will be happening. If that's not producing acceptable text then try creating a PDF file and running ps2ascii on that. If that doesn't work then most likely you simply can't do what you want, the information has gone in the process of printing.
If you make an example PostScript file available which doesn't work, I could say more definitely.
-
Thank you KenS, I will try txtwriter, printer must work on windows, in linux as well. – ant_dev Nov 09 '19 at 20:14
-
It converts perfectly ps to PDF with command ps2pdf, but ps to txt I can't convert,unreadable characters in converted file, ps contains russian text, I think it is Unicode problem – ant_dev Nov 09 '19 at 21:33
-
which alternative tools can I use for converting. I tried to convert that ps file even in online PStoTXT converters , but none of them worked .. I read this article and I Think it is -> ** Usually text files with Russian (Cyrillic) text are created in Windows with Windows-1251 (or CP-1251) encoding. Less often they use ISO 8859-5. While modern systems use UTF-8.** – ant_dev Nov 09 '19 at 22:51
-
As I said, PostScript is not intended as a portable format, its intended for printing **only**. Its entierly possible that the information you want simply is not present in the PostScript program. Windows code pages are irrelvant here, the information has been lost in the process of printing. While your approach may work with Windows, it certainly will not work reliably with Linux, PostScript produced by Linux applications does not contain the (non-standard) information that the Windows driver emits. Essentially I do not think your goal is reliably possible. – KenS Nov 10 '19 at 09:58