How do I convert LaTeX to plain-text (ASCII)?

Question

Scenario:
I have a document I created using LaTeX (my resume in this case), it's compiling in pdflatex correctly and outputting exactly what I'd like. Now I need the same document to be converted to plain old ASCII.

Example:
I have seen this done (at least once) here, where the author has a PDF version and an ASCII version that matches the PDF version in almost every way, including margins, spacing and bullet points.

I realize this type of conversion cannot be exact due to limitations in the ASCII format, but a very close approximation does seem possible based on what I have found so far. What is the process for doing this?

From the second-to-last paragraph of the Todd C. Miller page you linked to (emphasis mine): "Please note that **the ASCII version was hand-formatted**. I'm not aware of a latex to ascii translator that preserves formatting, though detex can be used to extract the actual text." — Kevin J. Chase, Apr 07 '17 at 23:18

score 46 · Answer 1 · edited Dec 05 '18 at 19:16

46

Opendetex is available both for Windows and Linux (compiles fine on a Mac as well). It can be downloaded from https://github.com/pkubowicz/opendetex

Usage:

detex project

opens project.tex, reads all files included using \include or \includeonly commands, outputs resulting text to standard output.

detex -n project > out.txt

opens project.tex, does not follow \include or \includeonly commands, outputs resulting text to out.txt

detex --help

shows full help

Extract it to any directory of your choice. Say you extracted it to your Downloads directory.

Create another directory of any name in that (this is optional but recommended). Let's say the directory name is “my_paper”. Put your paper in the “my_paper” directory. Assume your paper name is project.tex.

Navigate to the path

    cd ~/Downloads/opendetex

Run the command

    detex my_paper/project.tex  > out.txt

generic form

    detex -n full_path_to_tex_file.tex > output_text_file.txt

edited Dec 05 '18 at 19:16

Flow

23,572
15
99
156

answered Jan 14 '13 at 14:25

Mayank Agarwal

885
1
9
7

3

This is the best answer, except you probably shouldn't be using the `-n` flag by default. – naught101 Feb 10 '13 at 02:08
1

Hi, is there a way to fix this error? `detex: warning: can't open file` – Wet Feet Jan 07 '14 at 02:15
@WetFeet I guess you have given the wrong input file-name. Or your in a directory where write permission are not there. Ensure you can create files in that directory. – Mayank Agarwal Apr 23 '14 at 23:49
@WetFeet you have to use linux style filenames, meaning forward slashes and no drive letters on windows – stryba Jan 27 '15 at 14:43
3

This gives my an empty text file as output. (Mac OSX, opendetex installed via Homebrew; .tex file gets digested fine by Pandoc). – eric_kernfeld Jun 17 '16 at 17:43
1

Just tried opendetex and didn't work either on OSX 10.11, pandoc worked fine. – Josep Valls Jun 07 '17 at 14:13
Can it also ignore whitespaces and nexlines? – alper May 27 '20 at 12:45
It can't parse macro, not as good as pdflatex – tribbloid Sep 05 '22 at 01:35

score 17 · Accepted Answer · answered Feb 09 '09 at 21:45

17

CatDVI can convert DVI to text and attempts to preserve the formatting.

answered Feb 09 '09 at 21:45

Beardo

1,542
1
14
27

1

Do you know how to turn off "justified" alignment? – chuckg Feb 09 '09 at 22:35
1

Try piping it through fmt(1) with the `-u` option. – Nietzche-jou Jan 20 '10 at 19:36
1

Just remove the excess spacing, e.g. like this `catdvi foo.dvi | perl -pe 's/[ ]+/ /g'` gives me more reasonable output than `fmt` – Frank May 13 '10 at 18:44
it does not have any link to binary installation. compilation of source code: `caanot find -lkpathsea` – ar2015 Apr 13 '16 at 02:06

Diego Sevilla · Answer 3 · 2015-09-23T16:44:07.460

14

You can try some of the programs proposed here:

TeX to ASCII

edited Sep 23 '15 at 16:44

answered Feb 09 '09 at 21:45

Diego Sevilla

28,636
4
59
87

ahcox · Answer 4 · 2022-12-11T12:15:10.387

13

pdftotext can preserve layout

If you are using pdflatex, you probably don't want to mess around with your package options to switch to latex to generate a DVI.

Instead, take your pdf file and convert that. This worked for my CV/resume made with the Curve package:

pdftotext  -layout MyResume.pdf

Note the -layout to give a result for human eyeballing that resembles the structure of the original pdf but does break lines to achieve that. Leave off the layout for a result that is more suitable for further processing and doesn't break lines.

edited Dec 11 '22 at 12:15

answered Mar 09 '15 at 13:41

ahcox

9,349
5
33
38

1

This solution works great for me - thank you! I tried the python script above, and got an error, and pandoc.org/try didn't return anything while the console listed a 500 error for a GET request. I didn't have much time to debug either one, but this works great! – modulitos Sep 15 '17 at 04:19
1

One issue with this solution is, that it includes line-wraps. In case that is not wanted, you should leave out `-layout`. – darthn Dec 11 '22 at 11:15

score 9 · Answer 5 · edited Apr 28 '17 at 08:31

9

You can also try Pandoc, it can transform latex to many other formats. I suggest reading its documentation, for there may be some tricky cases that you need pass some arguments to handle.

edited Apr 28 '17 at 08:31

MajidL

731
6
11

answered Apr 27 '13 at 01:22

LittleSweet

534
2
6
9

1

Pandoc is superb. For programmatic conversion in Python, including automatic conversion to plain text of many mathematical constructs with reasonable plain text equivalents, I made a little hacky function which might be useful: http://pastebin.com/z7EMvfkZ – andybuckley Jun 10 '13 at 13:30

score 8 · Answer 6 · answered Feb 09 '09 at 23:44

Another option is to use htlatex to create a web page from the LaTeX sources, then use links to convert to plain text. I used the command line

links -dump -no-numbering -no-references input.html > output.txt

in the past which gave a rather nice result. This will of course rather match the view of the rendered HTML than the original PDF, thus maybe not exactly what you want.

score 3 · Answer 7 · edited Oct 29 '12 at 17:14

3

The solution that works best for me is the following. Assuming you have the latex document name (without extension) stored in ${BASENAME} you apply these 3 steps:

htlatex ${BASENAME}.tex

iconv -f iso-8859-1 -t utf-8 ${BASENAME}.html > ${BASENAME}-utf8.html

html2markdown ${BASENAME}-utf8.html > ${BASENAME}.txt

Apparently, you need to have tex4ht and python-html2text installed.

edited Oct 29 '12 at 17:14

Bo Persson

90,663
31
146
203

answered Oct 29 '12 at 16:46

Jannis Weide

31
1

score 3 · Answer 8 · answered Jan 20 '10 at 19:24

Try the steps here: http://zanedp.livejournal.com/201222.html

Here is a sequence that converts my LaTeX file to plain text:

$ latex file.tex
$ catdvi -e 1 -U file.dvi | sed -re "s/\[U\+2022\]/*/g" | sed -re "s/([^^[:space:]])\s+/\1 /g" > file.txt

The -e 1 option to catdvi tells it to output ASCII. If you use 0 instead of 1, it will output Unicode. Unicode will include all the special characters like bullets, emdashes, and Greek letters. It also include ligatures for some letter combinations like "fi" and "fl." You may not like that. So, use -e 1 instead. Use the -U option to tell it to print out the unicode value for unknown characters so that you can easily find and replace them.

The second part of the command finds the string [U+2022] which is used to designate bullet characters (•) and replaces them with an asterisk (*).

The third part eats up all the extra whitespace catdvi threw in to make the text full-justified while preserving spaces at the start of lines (indentation).

After running these commands, you would be wise to search the .txt file for the string [U+ to make sure no Unicode characters that can't be mapped to ASCII were left behind and fix them.

This answer may still be useful?? But for me this messes up many letter combinations including all double-"f"s. Should I specify some non-proportional font or etc first to avoid those problems? — CPBL, Jan 26 '22 at 15:24

score 3 · Answer 9 · answered Feb 09 '09 at 21:55

3

My usual strategy is to use hyperlatex to turn it into a web page, and then cope and paste from a web browser. I find that this gives the best formatting.

I usually then have to go through and manually fix some line-wrapping...

answered Feb 09 '09 at 21:55

Brian Postow

11,709
17
81
125

1

I tried this out, but unfortunately it doesn't support using an external `cls` file. I'm using a class file to handle repetitive formatting tasks, along with the enumitem class. Thanks though! – chuckg Feb 09 '09 at 22:02
hmmm, I don't think I've had problems with that... but it's been a while since I've used it... and I don't have any of my files at work... – Brian Postow Feb 10 '09 at 14:48

score 3 · Answer 10 · answered Feb 12 '12 at 16:08

3

When I needed to get the plain text from my TEX file for indexing and searching, I found LaTeX2RTF to be a good solution - it has an installer and GUI for windows, and it produced a RTF file of my 50 pages thesis that I could open in Word.

answered Feb 12 '12 at 16:08

tsvikas

16,004
1
22
12

1

A RTF document still is not really *plain text*. though. – Paŭlo Ebermann Feb 12 '12 at 17:38
I agree. I posted it since it might still be useful to others, looking (as I did) to extract the text in such manner. – tsvikas Feb 22 '12 at 12:03

score 2 · Answer 11 · answered Oct 31 '17 at 06:22

2

Pandoc allows you to convert files from one format to other Use following pandoc command:

pandoc -s /path/to/foobar.tex -o foobar.txt

If you want your lines to break at a certain column use --column flag. Use --columns 10000 for non-breaking line.

You can convert -o foobar.txt to a number of other formats like markdown (.md) etc. If you don't specify the -o foobar.txt, pandoc will print the html that you can render in any online tool.

To install pandoc follow this official documentation

answered Oct 31 '17 at 06:22

Shubham Chaudhary

47,722
9
78
80

Pandoc does not include bibliography – scs Sep 13 '19 at 15:27
it's actually the worst in terms of macro compatibility – tribbloid Sep 05 '22 at 01:35

score 2 · Answer 12 · answered Jul 11 '11 at 02:28

2

I've tried LyX and it works pretty well. The only nuance is that if you have a TeX file that is including other TeX files, you will need to export them all separately, unless I'm missing something.

answered Jul 11 '11 at 02:28

literal jdm

21
2

score 0 · Answer 13 · answered Nov 01 '09 at 19:09

you can import into lyx and use lyx's export to text feature.

kind of silly if you don't use lyx but if you already have it, very quick and easy solution. Good result for me, although to be fair my files are pretty simple. Not sure how more elaborate files get converted.

score 0 · Answer 14 · answered May 10 '14 at 17:28

0

Emacs has the commands iso-iso2tex and iso-tex2iso that work very well, except it doesn't convert single commands like \OE to Œ.

answered May 10 '14 at 17:28

Geremia

4,745
37
43

How do I convert LaTeX to plain-text (ASCII)?

14 Answers14

pdftotext can preserve layout

Linked