1
from bs4 import BeautifulSoup
import requests
import os


url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
r  = requests.get(url)
soup = BeautifulSoup(r.content.decode('utf-8', 'ignore'))
data = soup.find_all("article", {"class": "article"})

with open("data1.txt", "wb") as file:
   content=‘utf-8’
for item in data:
    content+='''{}\n{}\n\n{}\n{}'''.format( item.contents[0].find_all("time", {"datetime": "2016-03-16T09:50:30+0100"})[0].text,
                                            item.contents[0].find_all("a", {"class": "link-grey"})[0].text,
                                            item.contents[0].find_all("img", {"class": "media-full"})[0],
                                            item.contents[1].find_all("div", {"class": "article_textwrap"})[0].text,
                                            )
with open("data1.txt".format(file_name), "wb") as file:
    file.write(content)

Recently solved a utf/Unicode problem but now it isn't saving it as a .txt file nor saving it at all. What do I need to do?

Danisk
  • 113
  • 1
  • 1
  • 9
  • 1
    A: you are opening the file for writing in bytes then trying to write a string to it, B: `"data1.txt".format(file_name)` isn't really doing much with `file_name`, and it isn't defined so I'm really confused to what you are trying to do... – Tadhg McDonald-Jensen Mar 16 '16 at 19:10
  • 1
    what do you think `"data1.txt".format(file_name)` is doing? Also why are you opening in `wb` mode? – styvane Mar 16 '16 at 19:11
  • 1
    I'm trying to save my content from all the item.contents to a .txt file. (http://stackoverflow.com/questions/36039919/beautifulsoup-output-to-txt-file) – Danisk Mar 16 '16 at 19:12

1 Answers1

1

If you want to write the data as UTF-8 to the file try codecs.open like:

from bs4 import BeautifulSoup
import requests
import os
import codecs


url = "http://nos.nl/artikel/2093082-steeds-meer-nekklachten-bij-kinderen-door-gebruik-tablets.html"
r  = requests.get(url)
soup = BeautifulSoup(r.content)
data = soup.find_all("article", {"class": "article"})

with codecs.open("data1.txt", "wb", "utf-8") as filen:
    for item in data:
        filen.write(item.contents[0].find_all("time", {"datetime": "2016-03-16T09:50:30+0100"})[0].get_text())
        filen.write('\n')
        filen.write(item.contents[0].find_all("a", {"class": "link-grey"})[0].get_text())
        filen.write('\n\n')
        filen.write(item.contents[0].find_all("img", {"class": "media-full"})[0].get_text())
        filen.write('\n')
        filen.write(item.contents[1].find_all("div", {"class": "article_textwrap"})[0].get_text())

I'm unsure about filen.write(item.contents[0].find_all("img", {"class": "media-full"})[0]) because that returned a Tag instance for me.

jayme
  • 1,241
  • 11
  • 24
  • What is the best way to put this into a .txt file? – Danisk Mar 16 '16 at 20:05
  • It is a ".txt" file actually (UTF-8 Unicode text). If you want that to be ASCII you will have to replace those ` ` (`\xa0`) characters with something ASCII. Have a look at: http://stackoverflow.com/questions/19508442/beautiful-soup-and-unicode-problems – jayme Mar 17 '16 at 07:03