4

It is OK (response [200]) when I try to parse with manual texting but when I change the input from a file it becomes response [400].

This the code

import requests
from bs4 import BeautifulSoup

def people_spider():
    file = "D:\OneDrive\Documents\GPIP\Files\scraping\idtwitter.csv"
    dataset = open(file, "r")
    for account in dataset:
        href = 'https://twitter.com/' + account
        get_single_item_data(href)

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    print(source_code)
    print(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, features='html.parser')
    for item_name in soup.findAll('p', {'dir': 'ltr'}):
        print(item_name.string)


people_spider()

and the result is

<Response [400]>
https://twitter.com/mr_adhani

<Response [400]>
https://twitter.com/RahayuNarti

<Response [400]>
https://twitter.com/AllMicroJobs

<Response [400]>
https://twitter.com/adibambang05

<Response [400]>
https://twitter.com/NatasyaRD1

<Response [400]>
https://twitter.com/arumyuniadis

<Response [400]>
https://twitter.com/harusan_osk

<Response [400]>
https://twitter.com/LailyFauziana

<Response [400]>
https://twitter.com/Dovia_Liata707

<Response [400]>
https://twitter.com/hapzah_putry

I have changed the extension too. However, it does not change any situation

  • Response 400 corresponds to a bad HTTP request. You may want to check if you're creating the right request object or not. Also when you iterate over files like this, python won't remove the lingering newline character from the "account" variable. – Kartik Anand Dec 26 '18 at 05:11
  • Print the `item_url` to see if it is correct. – Klaus D. Dec 26 '18 at 05:12

1 Answers1

0

the problem is that you are not stripping account variable.

def people_spider():
    file = "D:\OneDrive\Documents\GPIP\Files\scraping\idtwitter.csv"
    dataset = open(file, "r")
    print(dataset)
    for account in dataset:
        href = 'https://twitter.com/' + account.strip()
        get_single_item_data(href)
Reza Torkaman Ahmadi
  • 2,958
  • 2
  • 20
  • 43