Gathering Formatted Content From Multiple Webpages

Question

I'm doing a research project, and need the contents of a show's transcript for the data. The problem is, the transcripts are formatted for the particular wiki (Arrested Development wiki), whereas I need them to be machine readable.

What's the best way to go about downloading all of these transcripts and reformatting them? Is Python's HTMLParser my best bet?

score 2 · Answer 1 · answered Mar 31 '14 at 17:51

2

I wrote a script in python that takes the link of the wiki transcript as an input and then gives you a plaintext version of the transcript in a text file as the output. I hope this helps with your project.

from pycurl import *
import cStringIO
import re

link = raw_input("Link to transcript: ")
filename = link.split("/")[-1]+".txt"

buf = cStringIO.StringIO()

c = Curl()
c.setopt(c.URL, link)
c.setopt(c.WRITEFUNCTION, buf.write)
c.perform()
html = buf.getvalue()
buf.close()

Speaker = ""
SpeakerPositions = [m.start() for m in re.finditer(':</b>', html)]

file = open(filename, 'w')

for x in range(0, len(SpeakerPositions)):
    if html[SpeakerPositions[x] + 5] != "<":

        searchpos = SpeakerPositions[x] - 1
        char = ""
        while char != ">":
            char = html[searchpos]
            searchpos = searchpos - 1
            if char != ">":
                Speaker += char

        Speaker = Speaker[::-1]
        Speaker += ": "

        searchpos = SpeakerPositions[x] + 5
        char = ""
        while char != "<":
            char = html[searchpos]
            searchpos = searchpos + 1
            if char != "<":
                Speaker += char

        Speaker = Speaker.replace("&#160;", "")
        file.write(Speaker + "\n")
        Speaker = ""

file.close()

answered Mar 31 '14 at 17:51

austin-schick

1,225
7
11

No problem! If you don't mind me asking, what are you researching? – austin-schick Mar 31 '14 at 21:57
Not at all. I'm studying double entendres, and figure Arrested Development has tons of them. – Adam_G Apr 01 '14 at 10:24
I seem to be having trouble getting pycurl to work in PyCharm. I've created a new SO question for that. Any idea what's going on? http://stackoverflow.com/questions/23089718/using-pycurl-in-pycharm – Adam_G Apr 15 '14 at 17:30
Hi - I'm still unable to use pycurl. – Adam_G Apr 16 '14 at 20:45
This does not appear to be working. When I enter `link = raw_input("http://arresteddevelopment.wikia.com/wiki/Transcript_of_Pilot")`, all I get is `http://arresteddevelopment.wikia.com/wiki/Transcript_of_Pilot` as output. – Adam_G Apr 17 '14 at 02:02
@Adam_G, as Schickmeister says, just run this script as he's written it. Then at the prompt that will say `Link to transcript: ` you type in yourself, `http://arresteddevelopment.wikia.com/wiki/Transcript_of_Pilot`. To automate this, make a list of all the transcript urls and iterate over them with a for loop. – Matthew Turner Apr 17 '14 at 19:54
Oh, it's a prompt?? Thanks! – Adam_G Apr 17 '14 at 20:55

Gathering Formatted Content From Multiple Webpages

1 Answers1

Linked