1

I have been trying to tackle a problem I am having extracting text from a website and filtering it to get the information I want. I have gotten to the point where I create a TextEdit file from the website that looks like this:

7:00
Name of Meeting: Location Bad
Address
Area
8:00
Name of Meeting: Location Good
Address
Area
Noon
Name of Meeting: Location Good 2
Address
Area
3:00 pm
Name of Meeting: Location Bad 2
Area

My goal is to extract all meetings at certain locations (Location Good and Location Good 2). Ideally filtering just this information --> Time @ Location Good, Time @ Location Good 2.

I do not know how to format the text in order to get this done. I have tried filtering it, but since information is all separated on different lines, the filter comes back as just the keyword I am filtering (using Automator). To work around this, I've just done it manually and set an applescript to send me a text message with the information I already hand-filtered. This works for now, but when information on the website changes, my information will be out of date.

Here is the website: https://loukyaa.org/meetings/?tsml-day=6&tsml-region=louisville

Question is: how do I manipulate the text in order to filter the information that I want? I am interested in filtering all meetings for "Icehouse" and "Token 3 Club." Thank you!

  • You didn't actually ask a question, but I can have a guess at what it is. But for anyone to have a chance to offer any help, you need to show us exactly what you're dealing with. Ideally, a URL to the website where you're extracting the text from will get you the best help. If not, then you need to format the text in your question so that it exactly mirrors what you're handling at your end. Otherwise, code someone writes to parse text displaying one way on our machine isn't going to work particularly well when the text is formatted differently on yours. – CJK Mar 14 '20 at 07:11

2 Answers2

0

With the incomplete information presented in your question, let me offer a solution for both Safari and Google Chrome to open the target URL in a new window, use JavaScript to get the inner text to the table of meetings, close the window, and filter it to the form of Time @ Location, e.g. 7:00 am @ Token 3 Club containing the meeting time and location for Icehouse and Token 3 Club.

Using JavaScript, in this use case, it returns paragraphs of tab delimitated text in the variable foo which will be filtered using awk in a do shell script command, where the final output is stored in a variable named bar, which you can then do whatever you'd like with.

The following example AppleScript code is for Safari:

set theURL to "https://loukyaa.org/meetings/?tsml-day=6&tsml-region=louisville"

tell application "Safari" to make new document with properties {URL:theURL}

tell application "System Events"
    repeat until exists ¬
        (buttons of UI elements of groups of toolbar 1 of window 1 of ¬
            application process "Safari" whose name = "Reload this page")
        delay 0.5
    end repeat
end tell

tell application "Safari"
    set foo to do JavaScript ¬
        "document.getElementById('meetings_tbody').innerText;" in document 1
    close its front window
end tell

set awkCommand to ¬
    "awk 'BEGIN{FS=\"\t\"; OFS=\" @ \"}/Icehouse|Token 3 Club/{print $1,$3}'"

set bar to do shell script awkCommand & " <<< " & foo's quoted form
  • NOTE: This code was tested under macOS High Sierra, however, for macOS Mojave and later, remove the words buttons of from the repeat until exists ¬ ... code.

  • NOTE: do JavaScript only works if Allow JavaScript from Apple Events is checked on the Safari > Develop menu, which is hidden by default and can be shown by checking [√] Show Develop menu in menu bar in: Safari > Preferences… > Advanced


The following example AppleScript code is for Google Chrome:

set theURL to "https://loukyaa.org/meetings/?tsml-day=6&tsml-region=louisville"

tell application "Google Chrome"
    set URL of active tab of (make new window) to theURL
    repeat until (loading of tab 1 of window 1 is false)
        delay 0.5
    end repeat
    tell active tab of front window to set foo to ¬
        execute javascript ¬
            "document.getElementById('meetings_tbody').innerText;"
    close its front window
end tell

set awkCommand to ¬
    "awk 'BEGIN{FS=\"\t\"; OFS=\" @ \"}/Icehouse|Token 3 Club/{print $1,$3}'"

set bar to do shell script awkCommand & " <<< " & foo's quoted form

NOTE: This should work by default, as Google Chrome allows execution of JavaScript.


In either case the variable bar contains e.g.:

7:00 am @ Token 3 Club
8:00 am @ Token 3 Club
8:30 am @ Icehouse
8:30 am @ Icehouse
10:30 am @ Icehouse
2:00 pm @ Token 3 Club
4:00 pm @ Token 3 Club
6:00 pm @ Icehouse
6:00 pm @ Icehouse
6:00 pm @ Token 3 Club
8:00 pm @ Icehouse
8:00 pm @ Token 3 Club
10:30 pm @ Token 3 Club

You can then do with it as you wish.

Also note the FS=\"\t\"; portion of the awk command will expand to a normal tab character when compiled in, e.g., Script Editor. The use of \t is necessary when posting code on this site, otherwise it will show as, e.g., FS=\" \"; and then when copying the code it will not be a normal tab character once compiled.


Note: The example AppleScript code is just that and does not contain any additional error handling as may be appropriate. The onus is upon the user to add any error handling as may be appropriate, needed or wanted. Have a look at the try statement and error statement in the AppleScript Language Guide. See also, Working with Errors. Additionally, the use of the delay command may be necessary between events where appropriate, e.g. delay 0.5, with the value of the delay set appropriately.

user3439894
  • 7,266
  • 3
  • 17
  • 28
  • Thank you user3439894! I am now learning how to use awk command, as it is clearly very useful! This is exactly the text manipulation I was looking for! The Google Chrome works like a charm, however, not the Safari... it just keeps running until I stop it. I double checked the javascript permissions, and both my Safari and Chrome have it enabled. – just_dabbling Mar 16 '20 at 23:31
  • @just_dabbling, All of the _example_ **AppleScript** _code_ shown in my answer works for me as is on my system running **macOS High Sierra**, what version of **macOS** are you running? – user3439894 Mar 17 '20 at 01:27
  • Catalina. 10.15.3 – just_dabbling Mar 17 '20 at 02:13
  • @just_dabbling, I've now tested it in **macOS Mojave** and **macOS Catalina**, so with **Safari** for **macOS Mojave** and later, remove the _words_ `buttons of` from the `repeat until exists ¬ ...` _code_ and it will work. – user3439894 Mar 17 '20 at 02:26
0

@user3439894's answer is excellent, and he's shown you some good, robust techniques for determining whether a webpage has loaded; some elementary JavaScript; and the power of awk.

I decided to do it a different way. I use JavaScript to do all the heavy processing, largely because my ultimate goal was to obtain a list of record objects, each representing a single event listed on the webpage, from which I extracted the name, location and time of each event.

tell application id "com.apple.Safari" to tell ¬
    document 1 to set allEvents to do JavaScript ¬
    "Array.from(document
               .querySelectorAll('tbody#meetings_tbody '+
                                'tr '+
                                'td.name,'+
                                'td.time,'+
                                'td.location'))
               .reduce((ξ,x,i,L) => { 
                        ξ=Array.from(ξ);
                        i%3==1 && ξ.push({
                                'name': L[i].innerText,
                                'time': L[i-1].innerText,
                                'location': L[i+1].innerText
                        }); 
                        return ξ;
               });"

The variable allEvents should then contain something like this:

{{|name|:"Saturday @ 7", |time|:"7:00 am", location:"Token 3 Club"},
 {|name|:"Early Bird Meeting", |time|:"8:00 am", location:"Token 3 Club"},
 {|name|:"Saturday Morning Meditation Group", |time|:"8:30 am", location:"Christ Church United Methodist"},
 {|name|:"Saturday Morning Gratitude Group", |time|:"8:30 am", location:"Icehouse"},
 ...,
 {|name|:"Agape", |time|:"10:30 pm", location:"Token 3 Club"}}

I'm not sure how familiar you are with AppleScript list or record objects. If you examine the contents carefully, you'll see that each event is represented by an object that looks like this:

{|name|:"...", |time|:"...", location:"..."}

That is a record, which contains three properties: |name|, |time|, and location. Each property has a value, which you retrieve by referencing the <property> of <record>. So, if one creates a record object and assigns it to a variable thus:

set R to {a:1, b:"two", c:pi}

then:

set myvar to b of R

will retrieve the value of property b belonging to record R and store it in the variable myvar. So myvar will now evaluate to "two".

allEvents isn't just one record object; it's many. It's a list of them. Here's an example of a list:

set L to {1, "two", pi, 2^2, "5.0"}

It doesn't contain properties; it only contains values, and these are termed items. A list is held in strict order, whereas a record is not. Therefore, the value "two" will always appear as the second item in that list, but in the record, it can appear at the beginning, middle, or end, but will always be attached to the property b. To retrieve an item from a list:

set myvar to item 2 of L

So, skipping to the end somewhat, if you want the location of the 4th event in that list:

return the location of item 4 in allEvents --> "Icehouse"

You'll still want to follow @user3439894's example, and implement a test to determine when the page has loaded (unless you intend to trigger the script manually only after loading the page yourself). @user3439894 has also shown you how to adapt the code to a Chromium-based browser (Google Chrome, Vivaldi, Brave).

CJK
  • 5,732
  • 1
  • 8
  • 26