2

I am trying to convert a PDF into a text file using scraperwiki and bs4. I am getting a TypeError. I am very new at Python and would really appreciate a little assistance.

Error occurs here:

File "scraper_wiki_download.py", line 53, in write_file
f.write(soup)

This is my code:

# Get content, regardless of whether an HTML, XML or PDF file
def send_Request(url):        
    response = http.urlopen('GET', url, preload_content=False)
    return response

# Use this to get PDF, covert to XML
def process_PDF(fileLocation):
    pdfToProcess = send_Request(fileLocation)
    pdfToObject = scraperwiki.pdftoxml(pdfToProcess.read())
    return pdfToObject

# returns a navigatibale tree, which you can iterate through
def parse_HTML_tree(contentToParse):
    soup = BeautifulSoup(contentToParse, 'lxml')
    return soup

pdf = process_PDF('http://www.sfbos.org/Modules/ShowDocument.aspx?documentid=54790')
pdfToSoup = parse_HTML_tree(pdf)
soupToArray = pdfToSoup.findAll('text')

def write_file(soup_array):
    with open('test.txt', "wb") as f:
        f.write(soup_array)

write_file(soupToArray)
Bono
  • 4,757
  • 6
  • 48
  • 77
tonestrike
  • 320
  • 6
  • 22

2 Answers2

1

I guess soupToArray = pdfToSoup.findAll('text') returns some kind of list, but f.write() works only on string, so you have to iterate on it and transform each element to a string in some way. Print soupToArray to see exactly what it looks like.

polku
  • 1,575
  • 2
  • 14
  • 11
  • It looks like you are right. Unfortunately, I am getting an empty list. It doesn't seem like pdfToSoup is doing its job. – tonestrike May 16 '16 at 12:14
1

Never used scraperwiki till now but this gets the text:

import scraperwiki
import requests
from bs4 import BeautifulSoup

pdf_xml = scraperwiki.pdftoxml(requests.get('http://www.sfbos.org/Modules/ShowDocument.aspx?documentid=54790').content)
print(BeautifulSoup(pdf_xml, "lxml").find_all("text"))
Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321