3

i'm a total noobie, i'm just starting with web scraping as a hobby.

I want to scrape data from forum (total numer of post, total numer of subjects and numer of all users) from https://www.fly4free.pl/forum/

photo of which data I want to scrape

Watching some turotirals i've came to this code:

from bs4 import BeautifulSoup
import requests
import datetime
import csv

source = requests.get('https://www.fly4free.pl/forum/').text
soup = BeautifulSoup(source, 'lxml')

csv_file = open('4fly_forum.csv', 'w')
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['Data i godzina', 'Wszytskich postów', 'Wszytskich tematów', 'Wszytskich użytkowników'])

czas = datetime.datetime.now()
czas = czas.strftime("%Y-%m-%d %H:%M:%S")
print(czas)

dane = soup.find('p', class_='genmed')

posty = dane.find_all('strong')[0].text
print(posty)

tematy = dane.find_all('strong')[1].text
print(tematy)

user = dane.find_all('strong')[2].text
print(user)

print()

csv_writer.writerow([czas, posty, tematy, user])    
csv_file.close()

I don't know how to make it run once a day and how to add data to the file once a day. Sorry if my questions are infantile for you pros ;), it's my first training assignment.

Also my reasult csv file looks not nice, i would like that the data will nice formated into columns

Any help and insight will be much appreciated. thx in advance Dejvciu

Dejvciu
  • 31
  • 1

2 Answers2

1

You can use the Schedule library in Python to do this. First install it using

pip install schedule

Then you can modify your code to run at intervals of your choice

import schedule
import time

def scrape():
    # your web scraping code here
    print('web scraping')

schedule.every().day.at("10:30").do(scrape) # change 10:30 to time of your choice

while True:
    schedule.run_pending()
    time.sleep(1)

This will run the web scraping script every day at 10:30 and you can easily host it for free to make it run continually.

Here's how you would save the results to a csv in a nice formatted way with filednames (czas, tematy, posty and user) as column names.

import csv
from os import path

# this will avoid appending the headers (fieldnames or column names) everytime the script runs. Headers will be written to csv only once
file_status = path.isfile('filename.csv') 

with open('filename.csv', 'a+', newline='') as csvfile:
    fieldnames = ['czas', 'posty', 'tematy', 'user']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    if not file_status:
        writer.writeheader() 
    writer.writerow({'czas': czas, 'posty': posty, 'tematy': tematy, 'user': user})


stuckoverflow
  • 625
  • 2
  • 7
  • 23
  • awsome thanks, Do you how to make it cretate a new line in csv every time it runs. I think now its creating new file with only one data set every time. – Dejvciu Jan 14 '21 at 19:27
  • i'm getting this after adding it: File "C:\Users\dejvc\Desktop\tutorial\1a.py", line 36 print('web scraping') ^ IndentationError: unindent does not match any outer indentation level – Dejvciu Jan 14 '21 at 19:37
  • error was due to mixing spaces and tabs, i've fixed it – Dejvciu Jan 14 '21 at 19:45
  • I have edited the answer. It will append the data each time to the previous csv file in a nice formatted manner. Consider upvoting if it helped you. :) – stuckoverflow Jan 14 '21 at 19:55
  • thx a lot, i tryied to upvote but i have to small rep ;) – Dejvciu Jan 14 '21 at 20:35
  • @Dejvciu you can still mark it as the accepted answer to help the community :) – stuckoverflow Feb 14 '21 at 10:45
0

I'm also not very experienced but I think that to do that once a day, you can use the task scheduler of your computer. That will run your script once every day. Maybe this video helps you with the task scheduler: https://www.youtube.com/watch?v=s_EMsHlDPnE

  • And if your csv doesn't looks nice because there are empy lines, put that: `csv_file = open('4fly_forum.csv', 'w',newline="")` – Samuel Molina Perales Jan 14 '21 at 19:15
  • i've added it but still i'm getting "new file" every time and i would like that every time i run a script it adds a line line to csv file – Dejvciu Jan 14 '21 at 19:20
  • To append data to an existing file, you use mode='a', i.e., `csv_file = open('4fly_forum.csv', 'a')`. –  Jan 14 '21 at 19:24
  • If I'm understanding you well, then the solution would be `csv_file = open('4fly_forum.csv', 'a',newline="")`. That adds the new information maintaining the old one instead of re-writing the full file each time with just the new information. – Samuel Molina Perales Jan 14 '21 at 19:25
  • thx it worked, but now it's adding both names of colums and data every time, how to fix it, so it will be adding only data – Dejvciu Jan 14 '21 at 19:34
  • If you don't mind doing it non-programmatically, you could just write the names of the coloumns in the csv and delete this line: `csv_writer.writerow(['Data i godzina', 'Wszytskich postów', 'Wszytskich tematów', 'Wszytskich użytkowników'])`I'm sure there's a better way and I'll think about it but I'm not very experienced in python. – Samuel Molina Perales Jan 14 '21 at 19:39
  • Maybe this will hep you: https://stackoverflow.com/questions/28325622/python-csv-writing-headers-only-once – Samuel Molina Perales Jan 14 '21 at 19:42