I am using kafka-python and BeautifulSoup to Scrape website that I enter often, and send a message to kafka broker with python producer.
What I want to do is whenever new post is uploaded on website (actually it is some kind of community like reddit, usually korean hip-hop fans are using to share information etc), that post should be send to kafka broker.
However, my problem is within while loop, only the lateset post keeps sending to kafka broker repeatedly. This is not I want.
Also, second problem is when new post is loaded,
HTTP Error 502: Bad Gateway error occurs on
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
and message is not send anymore.
this is dataScraping.py
from bs4 import BeautifulSoup
import re
import urllib.request
pattern = re.compile('[0-9]+')
def parseContent():
soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
for div in soup.find_all("tr", class_="notice"):
div.decompose()
key_num = pattern.findall(soup.find_all("td", class_="no")[0].text)
category = soup.find_all("td", class_="categoryTD")[0].find("span").text
author = soup.find_all("td", class_="author")[0].find("span").text
title = soup.find_all("td", class_="title")[0].find("a").text
link = "http://hiphople.com" + soup.find_all("td", class_="title")[0].find("a").attrs["href"]
soup2 = BeautifulSoup(urllib.request.urlopen(link).read(), "html.parser")
content = str(soup2.find_all("div", class_="article-content")[0].find_all("p"))
content = re.sub("<.+?>","", content,0).strip()
content = re.sub("\xa0","", content, 0).strip()
result = {"key_num": key_num, "catetory": category, "title": title, "author": author, "content": content}
return result
if __name__ == "__main__":
print("data scraping from website")
and this is PythonWebScraping.py
import json
from kafka import KafkaProducer
from dataScraping import parseContent
def json_serializer(data):
return json.dumps(data).encode("utf-8")
producer = KafkaProducer(acks=1, compression_type = "gzip", bootstrap_servers=["localhost:9092"],
value_serializer = json_serializer)
if __name__ == "__main__":
while (True):
result = parseContent()
producer.send("hiphople",result)
Please let me know how to fix my code so I can send newly created post to kafka broker as I expected.