6
from pybtex.database.input import bibtex
parser = bibtex.Parser()
bibdata = parser.parse_file("sample.bib")

The above code snippet works really well in parsing a .bib file but it seems not to support accent characters, like {\"u} or \"{u}(From LaTeX). Just like to confirm if pybtex support that or not.

For example, according to LaTeX/Special Characters and How to write “ä” and other umlauts and accented letters in bibliography?, \"{o} should convert to ö, and so does {\"o}.

Community
  • 1
  • 1
Drake Guan
  • 14,514
  • 15
  • 67
  • 94

2 Answers2

4

Update: this feature is now supported by pybtex since version 0.20.

It does not at the moment. But you can read the bib file using a latex codec before you process it with pybtex, e.g. with https://pypi.python.org/pypi/latexcodec/ This codec will convert (a wide range of) LaTeX commands to unicode for you.

However, you'll have to remove brackets in a post-processing stage. Why? In order to handle bibtex code gracefully, \"{U} has to be converted into {Ü} rather than into Ü to prevent it from being lower cased in titles. The following example demonstrates this behaviour:

import pybtex.database.input.bibtex
import pybtex.plugin
import codecs
import latexcodec

style = pybtex.plugin.find_plugin('pybtex.style.formatting', 'plain')()
backend = pybtex.plugin.find_plugin('pybtex.backends', 'latex')()
parser = pybtex.database.input.bibtex.Parser()
with codecs.open("test.bib", encoding="latex") as stream:
    # this shows what the latexcodec does to the source
    print stream.read()
with codecs.open("test.bib", encoding="latex") as stream:
    data = parser.parse_stream(stream)
for entry in style.format_entries(data.entries.itervalues()):
    print entry.text.render(backend)

where test.bib is

@Article{test,
  author =       {John Doe},
  title =        {Testing \"UTEST \"{U}TEST},
  journal =      {Journal of Test},
  year =         {2000},
}

This will print how the latexcodec converted test.bib into unicode (edited for readability):

@Article{test,
   author = {John Doe}, title = {Testing ÜTEST {Ü}TEST},
   journal = {Journal of Test}, year = {2000},
}

followed by the pybtex rendered entry (in this case, producing latex code):

John Doe.
\newblock Testing ütest {Ü}test.
\newblock \emph{Journal of Test}, 2000.

If the codec were to strip the brackets, pybtex would have converted the case wrongly. Further, in (pathological) cases like journal = {\"u} clearly the brackets cannot be removed either.

An obvious downside is that if you render to a non-LaTeX backend, then you have to remove the brackets in a post-processing stage. But you may want to do that anyway to process any special LaTeX commands (such as \url). It would be nice if pybtex could somehow do that for you, but it doesn't at the moment.

xuhdev
  • 8,018
  • 2
  • 41
  • 69
  • Thanks for this great information. I did a quick test on `"Heged\"{u}s".decode("latex")` and it returns `Heged{ü}s` instead of `Hegedüs`. Kinda confusing to me now. – Drake Guan Nov 03 '13 at 16:10
  • At the moment, the codec does not remove brackets because they matter for bibtex: brackets are used to prevent decapitalization of letters in titles. You can simply remove brackets in a post-processing stage if so desired: ``text.decode('latex').replace('{', '').replace('}', '')``. Does this make sense? – Matthias C. M. Troffaes Nov 04 '13 at 17:25
  • Interesting! Do you mean `\"u` equals to `\"{u}`? – Drake Guan Nov 05 '13 at 03:17
  • @Drake That's not exactly what I meant, but you're correct in that in plain LaTeX, \"u is indeed identical to \"{u}. This all said, it should be fairly easy to add an option to the codec to drop brackets whenever it can. For pybtex I would not recommend this approach, though, because you risk ending up with lower case symbols where upper case symbols were intended. For example, for bibtex (or pybtex), in titles, ``\"U`` is not the same as ``\"{U}`` because the former would be converted to ``\"u`` and the latter would not. Therefore, the codec takes a safe approach and always keeps brackets. – Matthias C. M. Troffaes Nov 07 '13 at 15:23
  • I've added a full example trying to explain why latexcodec does not remove brackets at the moment; in essence, due to the fact that these brackets are not always redundant in the bibtex format. – Matthias C. M. Troffaes Nov 07 '13 at 16:09
  • According to [LaTeX/Special Characters](http://en.wikibooks.org/wiki/LaTeX/Special_Characters#Escaped_codes) and [How to write “ä” and other umlauts and accented letters in bibliography?](http://tex.stackexchange.com/a/57745/13513), `\"{o}` should convert to `ö`, and so does `{\"o}`. I think I get confused right now. – Drake Guan Nov 07 '13 at 16:36
3

pylatexenc (https://pypi.org/project/pylatexenc/)

from pylatexenc.latex2text import LatexNodes2Text 

latex_text = 'Gl{\\"o}ckner'
text = LatexNodes2Text().latex_to_text(latex_text)

print(text) # Glöckner


oms004
  • 31
  • 1
  • What is the relationship to [latexcodec](https://github.com/mcmtroffaes/latexcodec), with which it seems to share contributors? Oh, I see now the latter recommends the former. – Alan Aug 09 '21 at 21:15