1

I'm trying to extract Hebrew text from a Bible in OpenDocument Text (odt) format with the following code:

from odf import text, teletype
from odf.opendocument import load

textdoc = load("Heb-OT.odt")
texts = textdoc.getElementsByType(text.P)
alltext=teletype.extractText(texts[0])
print alltext

This does not print anything I don't know what's wrong. The document is very long (1000 pages) but I need to search it all.

Myk Willis
  • 12,306
  • 4
  • 45
  • 62
  • I found the code that seems to correspond to the implementation of teletype here:http://bit.ly/1pfgWn7 but it does not help me. here is the original document I use: http://bit.ly/1Ll99hs (converted to odt by LibreOffice) – Raphaël Poli Feb 27 '16 at 14:41
  • Apparently the text extraction stops at newline... I still don't know how to change that – Raphaël Poli Feb 27 '16 at 15:06
  • I converted the whole file to utf-8 txt with odt2txt then I have been able to extract characters with codecs.open ... but if someone can answer the question I am still interested just for knowledge. – Raphaël Poli Feb 27 '16 at 15:40

0 Answers0