I am trying to write a program that pulls the urls from each line of a .txt file and performs a PyQuery to scrape lyrics data off of LyricsWiki, and everything seems to work fine until I actually put the PyQuery stuff in. For example, when I do:
full_lyrics = ""
#open up the input file
links = open('links.txt')
for line in links:
full_lyrics += line
print(full_lyrics)
links.close()
It prints everything out as expected, one big string with all the data in it. However, when I implement the actual html parsing, it only pulls the lyrics from the last url and skips through all the previous ones.
import requests, re, sqlite3
from pyquery import PyQuery
from collections import Counter
full_lyrics = ""
#open up the input file
links = open('links.txt')
output = open('web.txt', 'w')
output.truncate()
for line in links:
r = requests.get(line)
#create the PyQuery object and parse text
results = PyQuery(r.text)
results = results('div.lyricbox').remove('script').text()
full_lyrics += (results + " ")
output.write(full_lyrics)
links.close()
output.close()
I writing to a txt file to avoid encoding issues with Powershell. Anyway, after I run the program and open up the txt file, it only shows the lyrics of the last link on the links.txt document.
For reference, 'links.txt' should contain several links to lyricswiki song pages, like this: http://lyrics.wikia.com/Taylor_Swift:Shake_It_Off http://lyrics.wikia.com/Maroon_5:Animals
'web.txt' should be a blank output file.
Why is it that pyquery breaks the for loop? It clearly works when its doing something simpler, like just concatenating the individual lines of a file.