2

I am using kafka-python and BeautifulSoup to Scrape website that I enter often, and send a message to kafka broker with python producer.

What I want to do is whenever new post is uploaded on website (actually it is some kind of community like reddit, usually korean hip-hop fans are using to share information etc), that post should be send to kafka broker.

However, my problem is within while loop, only the lateset post keeps sending to kafka broker repeatedly. This is not I want.

Also, second problem is when new post is loaded,

HTTP Error 502: Bad Gateway error occurs on

soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")

and message is not send anymore.

this is dataScraping.py

from bs4 import BeautifulSoup
import re
import urllib.request

pattern = re.compile('[0-9]+')

def parseContent():
    soup = BeautifulSoup(urllib.request.urlopen("http://hiphople.com/kboard").read(), "html.parser")
    for div in soup.find_all("tr", class_="notice"):
        div.decompose()

    key_num = pattern.findall(soup.find_all("td", class_="no")[0].text)
    category = soup.find_all("td", class_="categoryTD")[0].find("span").text
    author = soup.find_all("td", class_="author")[0].find("span").text
    title = soup.find_all("td", class_="title")[0].find("a").text
    link = "http://hiphople.com" + soup.find_all("td", class_="title")[0].find("a").attrs["href"]

    soup2 = BeautifulSoup(urllib.request.urlopen(link).read(), "html.parser")
    content = str(soup2.find_all("div", class_="article-content")[0].find_all("p"))
    content = re.sub("<.+?>","", content,0).strip()
    content = re.sub("\xa0","", content, 0).strip()

    result = {"key_num": key_num, "catetory": category, "title": title, "author": author, "content": content}
    return result

if __name__ == "__main__":
    print("data scraping from website")

and this is PythonWebScraping.py

import json
from kafka import KafkaProducer
from dataScraping import parseContent

def json_serializer(data):
    return json.dumps(data).encode("utf-8")


producer = KafkaProducer(acks=1, compression_type = "gzip", bootstrap_servers=["localhost:9092"],
                         value_serializer = json_serializer)
    
if __name__ == "__main__":
    while (True):
        result = parseContent()
        producer.send("hiphople",result)

Please let me know how to fix my code so I can send newly created post to kafka broker as I expected.

ethany21
  • 49
  • 1
  • 7

1 Answers1

-1

Your function is working but its true you return only one event, I did not get 502 bad gateway, maybe you are getting it as ddos protection because of accessing too much times to the url, try adding delays/sleep , or your ip been banned to stop it from scraping the url...

For your second error, your function returns only one/last message

You are sending each time the result to kafka, this is why you are seeing same message over and over again,

You are scraping and taking the last event , what did you wish your function to do?

prevResult = ""
while(True):
result = parseContent()
if(prevResult!=result):
prevResult = result
print( result )

Ran Lupovich
  • 1,655
  • 1
  • 6
  • 13