1

I'm making an application that monitors URLs for changes. To program the application logic I am using Google Apps Script and a Google Sheet.

I explain the monitoring mechanism I have thought of. First of all the Script will read data from a sheet with the following columns:

URL: We indicate the URLs we want to monitor

First Time: Indicates if it is the first time that a URL is analyzed.

Changes: Indicates if changes have been made or not with respect to the previous time it has been analyzed.

HashValue: HTML code of the URL analyzed after applying an MD5 hash.

At the moment of the execution of the script the rows of the sheet will start to be read. For each row:

  1. The URL will be read and the URLFetchApp method will be executed to get a response from that web page.
  2. The getContentText method will be applied on the obtained answer to obtain the HTML code of the web page and we will save it in a variable.
  3. We will apply the MD5 Hash algorithm on the HTML code and we will save it in a variable.
  4. In case the URL is being analyzed for the first time we will indicate in the column Changes that no changes have been made (it is the first time we analyze it) and we will save in the column HashValue the content of the variable with the hashed HTML code.
  5. In case the URL has already been analyzed previously, we will compare the previously registered HashValue value with the one we have obtained now.
  6. In case the value is different we will indicate in the Changes column that there have been changes and we will save in the HashValue column the new hash value.

I have already programmed the code. And it works with some web sites. But with other web sites it does not work. After analyzing the HTML code of the websites where it did not work, looking for differences in the code with an online text comparator I noticed the following:

There are websites in which when reloading twice the same page the code changes a little even if the content is static. For example what can change is that an HTML tag has an ID box-wrap-140 and when reloading the page again the ID is box-wrap-148.

Therefore the script as it is implemented would detect that changes are made, because the HTML code is different. After researching many things I can't find an alternative that solves this problem, hence the question in the title

PS: We can ignore details such as the website not being down or giving us 404, 301, etc. response codes. This has already been programmed and works correctly.

PS2: Sorry for my level of English.

Ragnarsito
  • 11
  • 1
  • If the html is not malformed then you can use xmlParse but often that doesn't work and you will have to resort to regular expressions which can get messy and often results in inconsistent problems. Often sites don't want you to scrape their sites for various reasons. So they purposeful add malformed content to make things difficult for you or they may ascii encode portions of their site. – Cooper Mar 02 '22 at 16:00
  • How did it go? Care to share? – David d C e Freitas Aug 22 '22 at 21:17
  • I did it too. I store the last fetched page content. I cut down the size by using selector to narrow down to the part I want to monitor for changes. Then I compress it with gzip and encode it to be safe, even if it bloats a bit. e.g. ```function encodeForStorage(input) { return Utilities.base64Encode(Utilities.gzip(Utilities.newBlob(input, 'application/octet-stream')).getBytes()); }``` and decode with ```const b64decoded = Utilities.base64Decode(input); const b64decodedBlob = Utilities.newBlob(b64decoded, 'application/x-gzip'); return Utilities.ungzip(b64decodedBlob).getDataAsString();``` – David d C e Freitas Sep 13 '22 at 19:12

1 Answers1

1

Yon can use cheerio GS to look for custom tags and exclude those changes(<footer>) or include those changes(like <div>).

TheMaster
  • 45,448
  • 6
  • 62
  • 85