2

I am trying to parse the info in the bottom right table of the following link, the table that says Current schedule submissions:

dnedesign.us.to/tables/

I was able to parse it down to:

{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"15:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"16:30";s:7:"endTime";s:5:"18:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:0:"";s:7:"endTime";s:0:"";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"16:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:6:"Sunday";s:9:"startTime";s:5:"12:30";s:7:"endTime";s:5:"14:30";}
{s:12:"cfdb7_status";s:6:"unread";s:3:"Day";s:7:"Tuesday";s:9:"startTime";s:5:"14:30";s:7:"endTime";s:5:"16:30";}

and here is the code that performs the parsing to get the above:

try:
    from urllib.request  import urlopen
except ImportError:
    from urllib2 import urlopen
    from bs4 import BeautifulSoup
url = 'http://dnedesign.us.to/tables/'
page = urlopen(url)
soup = BeautifulSoup(page, "html.parser")
for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            print(td.text[4:])

I am trying to parse it down to the following:

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:14:30
Day:Sunday     Starttime:12:30    Endtime:16:30
Day:Sunday     Starttime:12:30    Endtime:16:30
....
....

And so on for the rest of the table.

I am using Python 3.6.9 and Httpie 0.9.8 on Linux Mint Cinnamon 19.1. This is for my graduation project, any help would be appreciated, thanks. Neil M.

Pankaj
  • 931
  • 8
  • 15

1 Answers1

1

You can use regex to parse the well-formed table data, taking care to look out for empty strings:

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup

url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
data = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data.append({cols[x]: cols[x+1] for x in range(0, len(cols), 2)})

for row in data[::-1]:
    row = {
        k: re.sub(
            r"[a-zA-Z]+", lambda x: x.group().capitalize(), "%s:%s" % (k, v)
        ) for k, v in row.items()
    }
    print("    ".join([row["Day"], row["startTime"], row["endTime"]]))

Output:

Day:Tuesday    Starttime:14:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:14:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:12:30    Endtime:16:30
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:    Endtime:
Day:Sunday    Starttime:16:30    Endtime:18:30
Day:Sunday    Starttime:14:30    Endtime:15:30
Day:Sunday    Starttime:14:30    Endtime:16:30

The second stage creates strings to your format specification, but the intermediate step of creating the data list to store key-value pairs of column data for each row is the meat of the work.


In terms of your request to put the items into a class, you can create an instance of Schedule and populate relevant fields instead of using a dictionary:

try:
    from urllib.request import urlopen
except ImportError:
    from urllib2 import urlopen

import re
from bs4 import BeautifulSoup


class Schedule: 
    def __init__(self, day, start, end): 
        self.day = day
        self.start = start 
        self.end = end 


url = 'http://dnedesign.us.to/tables/'
soup = BeautifulSoup(urlopen(url), "html.parser")
schedules = []

for rows in soup.find_all('tr'):
    for td in rows.find_all('td'):      
        if 'a:' in td.text:
            cols = re.findall(r"s:\d+:\"(.*?)\"", td.text)
            data = {cols[x]: cols[x+1] for x in range(0, len(cols), 2)}
            schedules.append(Schedule(data["Day"], data["startTime"], data["endTime"]))

for schedule in schedules:
    print(schedule.day, schedule.start, schedule.end)
ggorlen
  • 44,755
  • 7
  • 76
  • 106
  • Wonderful, thank you. For some reason I get the input inverted as such, first column is end time , second is start time and the last column is day: Endtime:16:30 Starttime:14:30 Day:Tuesday Endtime:14:30 Starttime:12:30 Day:Sunday Endtime:16:30 Starttime:12:30 Day:Sunday Endtime:16:30 Starttime:12:30 Day:Sunday Endtime: Starttime: Day:Sunday Endtime: Starttime: Day:Sunday Endtime:18:30 Starttime:16:30 Day:Sunday Endtime:15:30 Starttime:14:30 Day:Sunday Endtime:16:30 Starttime:14:30 Day:Sunday any idea why this is happening? – Neil Nabil Il Saidawi Mar 12 '19 at 02:51
  • Updated! Although dicts are sorted in Python 3.6, I'm working in 3.4 and didn't take into account the original key ordering. Let me know if this doesn't work out for you. – ggorlen Mar 12 '19 at 03:01
  • Works now, thank you. I would like to modify the program slightly by creating a class and storing the parsed elements in the class so they can be saved and used later, it is a requirement for the project I am working on. Something like the following 'class Schedule: def __init__(day, start, end): day.start = start day.end = end Saturday = Schedule(22:00:00, 23:00:00) print(Saturday.start) print(Saturday.end) ' @@ggloren would this be possible? – Neil Nabil Il Saidawi Mar 13 '19 at 22:35
  • Sure, it's possible, but I recommend accepting the answer to this question if the problem is solved and opening a new question with the fresh set of requirements and your attempt so far. – ggorlen Mar 13 '19 at 22:44
  • Thank you. I will make a new post. As for the answer, I did accept it. But because I have accepted below 15 answers because my account is new it is not showing but they say that they recorded my response. Thanks again. – Neil Nabil Il Saidawi Mar 14 '19 at 18:51
  • Hi @ggloren Sorry for the late reply , but for some reason I can't run this program. – Neil Nabil Il Saidawi Mar 30 '19 at 04:06
  • The letter u is printed next to all parsed elements – Neil Nabil Il Saidawi Mar 30 '19 at 18:59