Extracting Hebrew text from an OpenDocument Text Bible

Asked Feb 26 '16 at 16:53

Active Feb 26 '16 at 17:15

Viewed 100 times

I'm trying to extract Hebrew text from a Bible in OpenDocument Text (odt) format with the following code:

from odf import text, teletype
from odf.opendocument import load

textdoc = load("Heb-OT.odt")
texts = textdoc.getElementsByType(text.P)
alltext=teletype.extractText(texts[0])
print alltext

This does not print anything I don't know what's wrong. The document is very long (1000 pages) but I need to search it all.

edited Feb 26 '16 at 17:15

Myk Willis

12,306
4
45
62

asked Feb 26 '16 at 16:53

Raphaël Poli

I found the code that seems to correspond to the implementation of teletype here:http://bit.ly/1pfgWn7 but it does not help me. here is the original document I use: http://bit.ly/1Ll99hs (converted to odt by LibreOffice) – Raphaël Poli Feb 27 '16 at 14:41
Apparently the text extraction stops at newline... I still don't know how to change that – Raphaël Poli Feb 27 '16 at 15:06
I converted the whole file to utf-8 txt with odt2txt then I have been able to extract characters with codecs.open ... but if someone can answer the question I am still interested just for knowledge. – Raphaël Poli Feb 27 '16 at 15:40

Extracting Hebrew text from an OpenDocument Text Bible

0 Answers0