0

How can i extract text from '.odt' and '.doc' format file from url using python ? I tried searching for it but couldn't find anything.

Any lead will be helpful.

from odf import text, teletype
from odf.opendocument import load
 
textdoc = load(r"C:\Users\OMS\Downloads\sample1.odt")
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
    a=teletype.extractText(allparas[i])
    print(a)

this works for local .odt file but now i need to extract from an

"https://abc.s3.ap-south-1.amazonaws.com/sample1.odt"

Assume connection to aws s3 has been done using boto3 .

  • If you have a URL to download the file then you can use the `requests` module like [this](https://www.geeksforgeeks.org/downloading-files-web-using-python/) – Girish Jan 21 '21 at 11:39
  • Also, see [this](https://stackoverflow.com/questions/50100221/download-file-from-aws-s3-using-python) to download files from S3 using python – Girish Jan 21 '21 at 11:55
  • i don't have to download the file from S3 i need to render it directly . thats the case !!!@Girish –  Jan 21 '21 at 12:36

1 Answers1

1

Following is tested with Python3.6 and with this test odt file;

import boto3
import io
from odf import text, teletype
from odf.opendocument import load

s3_client = boto3.resource('s3') #TODO: change aws connection logic as per your setup


# TODO: refactor name, readability
def get_contents(file_name):
    obj = s3_client.Object('s3_bucket_name', file_name)  # TODO: change aws s3 bucket name as per your setup
    body = obj.get()['Body'].read()
    return load(io.BytesIO(body))


textdoc = get_contents("test.odt")  # TODO: change odt file name as per your setup
allparas = textdoc.getElementsByType(text.P)
for i in range(len((allparas))):
    a = teletype.extractText(allparas[i])
    print(a)


amitd
  • 1,497
  • 4
  • 11