1

I am trying to write a program that pulls the urls from each line of a .txt file and performs a PyQuery to scrape lyrics data off of LyricsWiki, and everything seems to work fine until I actually put the PyQuery stuff in. For example, when I do:

full_lyrics = ""        
#open up the input file
links = open('links.txt')

for line in links:
    full_lyrics += line

print(full_lyrics)
links.close()

It prints everything out as expected, one big string with all the data in it. However, when I implement the actual html parsing, it only pulls the lyrics from the last url and skips through all the previous ones.

import requests, re, sqlite3
from pyquery import PyQuery
from collections import Counter

full_lyrics = ""        
#open up the input file
links = open('links.txt')
output = open('web.txt', 'w')
output.truncate()

for line in links:
    r = requests.get(line)
    #create the PyQuery object and parse text
    results = PyQuery(r.text)
    results = results('div.lyricbox').remove('script').text()
    full_lyrics += (results + " ")

output.write(full_lyrics)
links.close()
output.close()

I writing to a txt file to avoid encoding issues with Powershell. Anyway, after I run the program and open up the txt file, it only shows the lyrics of the last link on the links.txt document.

For reference, 'links.txt' should contain several links to lyricswiki song pages, like this: http://lyrics.wikia.com/Taylor_Swift:Shake_It_Off http://lyrics.wikia.com/Maroon_5:Animals

'web.txt' should be a blank output file.

Why is it that pyquery breaks the for loop? It clearly works when its doing something simpler, like just concatenating the individual lines of a file.

Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
thenorm
  • 35
  • 7

1 Answers1

1

The problem is the additional newline character in every line that you read from the file (links.txt). Try open another line in your links.txt and you'll see that even the last entry will not be processed.

I recommend that you do a right strip on the line variable after the for, like this:

for line in links:
    line = line.rstrip()
    r = requests.get(line)
    ...

It should work.

I also think that you don't need requests to get the html. Try results = PyQuery(line) and see if it works.

jheyse
  • 489
  • 1
  • 7
  • 21