Properly watch websites for updates

Question

I wrote a script that I'm using to push updates to Pushbullet channels whenever a new Nexus factory image is released. A separate channel exists for each of the first 11 devices on that page, and I'm using a rather convoluted script to watch for updates. The full setup is here (specifically this script), but I'll briefly summarize the script below. My question is this: This is clearly not the correct way to be doing this, as it's very susceptible to multiple points of failure. What would be a better method of doing this? I would prefer to stick with Python, but I'm open to other languages if they would be simpler/better.

(This question is prompted by the fact that I updated my apache 2.4 config tonight and it apparently triggered a slight change in the output of the local files that are watched by urlwatch, so ALL 11 channels got an erroneous update pushed to them.)

Basic script functionality (some nonessential parts are not included):

Create dictionary of each device codename associated with its full model name
Get existing Nexus Factory Images page using Requests
Make bs4 object from source code
For each of the 11 devices in the dictionary (loop), do the following:
- Open/create page in public web directory for the device
- Write source to that page, filtered using bs4: str(soup.select("h2#" + dev + " ~ table")[0])
- Call urlwatch on the page to check for updates, save output to temp file
- If temp file size is > 0 then the page has changed, so push update to the appropriate channel
- Remove webpage and temp file

A thought that I had while typing this question: Would a possible solution be to save each current version string (for example: 5.1.0 (LMY47I)) as a pickled variable, then if urlwatch detects a difference it would compare the new version string to the pickled one and only push if they're different? I would throw regex matching in as well to ensure that the new format matches the old format and just has updated data, but could this at least be a good temporary measure to try to prevent future false alarms?

score 0 · Answer 1 · answered May 12 '15 at 17:05

Scraping is inherently fragile, but if they don't change the source format it should be pretty straightforward in this case. You should parse the webpage into a data structure. Using bs4 is fine for this. The end result should be a python dictionary:

{
 'mantaray': {
     '4.2.2 (JDQ39)': {'link': 'https://...'},
     '4.3 (JWR66Y)': {'link': 'https://...'},
 },
 ...
}

Save this structure with json.dumps. Now every time you parse the page you can generate a similar data structure and compare it to the one you have on disk (update the saved one each time after you are done).

Then the only part left is comparing the datastructure. You can iterate all models and check that each version you have in the current version of the page exists in the previous version. If it does not, you have a new version.

You can also potentially generate an easy to use API for this using https://www.kimonolabs.com/ instead of doing the parsing yourself.

Properly watch websites for updates

1 Answers1