scrapy
, urllib
, urllib2
and BeautifulSoup
are your friends when it comes to munging data off websites.
It depends on the individual site and where the author(s) of the site puts the text on the page. Mostly you are able to find text in <p>...</p>
.
For example in this site (http://www.yoursingapore.com/content/traveller/en/browse/see-and-do/nightlife/dance-clubs/zouk.html), the text you want is:
If you only have time for one club in Singapore, then it simply has to
be Zouk. Probably Singapore’s only nightspot of international repute,
Zouk remains both an institution and a rite of passage for young
people in the city-state.
It has spawned several other clubs in neighbouring countries like
Malaysia, and even has its own dance festival – Sentosa’s ZoukOut.
Zouk is made up of three clubs and a wine bar, with the main room
showcasing techno and house music. Velvet Underground is more relaxed
and exclusive, while Phuture is experimental and racier than the rest,
just as its name suggests.
Zouk’s global reputation means it’s home to all manner of leading
world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers
and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights
on Wednesdays, another reason why a night at Zouk is one to savour.
There are other texts on the page but normally, you would only want the main text and not the navigate bars and the boilerplates on the page.
You can get it simply by:
>>> import urllib2
>>> from bs4 import BeautifulSoup as bsoup
>>> url = "http://www.yoursingapore.com/content/traveller/en/browse/see-and-do/nightlife/dance-clubs/zouk.html"
>>> page = urllib2.urlopen(url).read()
>>> for i in bsoup(page).find_all('p'):
... print i.text.strip()
...
If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state.
It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests.
Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour.
Find us on Facebook Twitter Youtube Wikipedia Singapore Reviews
Copyright © 2013 Singapore Tourism Board. Website Terms of Use | Privacy Statement | Photo Credits
You realized that you got more than what you really need so you can sift the bsoup(page).find_all()
it even further by getting the <div class="paragraph section">...</div>
before accessing the paragraph inside it:
>>> for i in bsoup(page).find_all(attrs={'class':'paragraph section'}):
... print i.text.strip()
...
If you only have time for one club in Singapore, then it simply has to be Zouk. Probably Singapore’s only nightspot of international repute, Zouk remains both an institution and a rite of passage for young people in the city-state.
It has spawned several other clubs in neighbouring countries like Malaysia, and even has its own dance festival – Sentosa’s ZoukOut. Zouk is made up of three clubs and a wine bar, with the main room showcasing techno and house music. Velvet Underground is more relaxed and exclusive, while Phuture is experimental and racier than the rest, just as its name suggests.
Zouk’s global reputation means it’s home to all manner of leading world DJs, from Carl Cox and Paul Oakenfold to the Chemical Brothers and Primal Scream. Zouk also holds its famous Mambo Jambo retro nights on Wednesdays, another reason why a night at Zouk is one to savour.
And voila, there you have the text. But as said before, how to munge the main text from the page depends on how the page is written.
Here's the full code:
>>> import urllib2
>>> from collections import Counter
>>> from nltk import word_tokenize
>>> from bs4 import BeautifulSoup as bsoup
>>> page = urllib2.urlopen(url).read()
>>> text = " ".join([i.text.strip() for i in bsoup(page).find_all(attrs={'class':'paragraph section'})])
>>> word_freq = Counter(word_tokenize(text))
>>> word_freq['Zouk'] 4
>>> word_freq.most_common() [(u',', 8), (u'and', 8), (u'to', 4), (u'of', 4), (u'Zouk', 4), (u'is', 4), (u'the', 4), (u'its', 3), (u'has', 3), (u'in', 3), (u'a', 3), (u'only', 2), (u'for', 2), (u'one', 2), (u'clubs', 2), (u'exclusive', 1), (u'all', 1), (u'Velvet', 1), (u'just', 1), (u'dance', 1), (u'global', 1), (u'rest', 1), (u'Chemical', 1), (u'Oakenfold', 1), (u'it\u2019s', 1), (u'young', 1), (u'passage', 1), (u'main', 1), (u'neighbouring', 1), (u'then', 1), (u'than', 1), (u'means', 1), (u'famous', 1), (u'made', 1), (u'world', 1), (u'like', 1), (u'DJs', 1), (u'bar', 1), (u'name', 1), (u'countries', 1), (u'night', 1), (u'showcasing', 1), (u'Paul', 1), (u'people', 1), (u'house', 1), (u'ZoukOut.', 1), (u'up', 1), (u'\u2013', 1), (u'Underground', 1), (u'home', 1), (u'even', 1), (u'Singapore', 1), (u'city-state.', 1), (u'retro', 1), (u'international', 1), (u'rite', 1), (u'be', 1), (u'institution', 1), (u'reason', 1), (u'techno', 1), (u'both', 1), (u'nightspot', 1), (u'festival', 1), (u'experimental', 1), (u'Singapore\u2019s', 1), (u'own', 1), (u'savour', 1), (u'suggests.', 1), (u'Zouk\u2019s', 1), (u'simply', 1), (u'another', 1), (u'Probably', 1), (u'Jambo', 1), (u'spawned', 1), (u'from', 1), (u'Brothers', 1), (u'remains', 1), (u'leading', 1), (u'.', 1), (u'Phuture', 1), (u'Carl', 1), (u'more', 1), (u'on', 1), (u'club', 1), (u'relaxed', 1), (u'If', 1), (u'with', 1), (u'Wednesdays', 1), (u'room', 1), (u'Primal', 1), (u'while', 1), (u'three', 1), (u'at', 1), (u'racier', 1), (u'it', 1), (u'an', 1), (u'Zouk.', 1), (u'as', 1), (u'manner', 1), (u'have', 1), (u'nights', 1), (u'Malaysia', 1), (u'holds', 1), (u'also', 1), (u'other', 1), (u'repute', 1), (u'you', 1), (u'several', 1), (u'Sentosa\u2019s', 1), (u'Cox', 1), (u'Mambo', 1), (u'why', 1), (u'It', 1), (u'reputation', 1), (u'time', 1), (u'Scream.', 1), (u'music.', 1), (u'wine', 1)]
The above example comes from:
Liling Tan and Francis Bond. 2011. Building and annotating the
linguistically diverse NTU-MC (NTU-multilingual corpus). In
Proceedings of the 25th Pacific Asia Conference on Language,
Information and Computation (PACLIC 25). Singapore.