Remove HTML tag associated with a class

Question

I am forcing myself to learn how to script solely in AppleScript but I am currently facing an issue with trying to remove a particular tag with a class. I've tried to find solid documentation and examples but at this time it seems to be very limited.

Here is the HTML I have:

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class="foo">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami <span class="foo">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

What I am trying to do is remove a particular class, so it would remove , result:

<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl shoulder biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class="bar">Pig brisket</span> jowl ham pastrami jerky strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>

I know how to do this with do shell script and through the terminal but I am wanting to learn what is available through AppleScript's dictionary.

In research I was able to find a way to parse all HTML tags with:

on removeMarkupFromText(theText)
    set tagDetected to false
    set theCleanText to ""
    repeat with a from 1 to length of theText
        set theCurrentCharacter to character a of theText
        if theCurrentCharacter is "<" then
            set tagDetected to true
        else if theCurrentCharacter is ">" then
            set tagDetected to false
        else if tagDetected is false then
            set theCleanText to theCleanText & theCurrentCharacter as string
        end if
    end repeat
    return theCleanText
end removeMarkupFromText

but that removes all HTML tags and that is not what I want. Searching SO I was able to find how to extract between tags with Parsing HTML source code using AppleScript but I'm not looking to parse the file.

I am familiar with BBEdit's Balance Tags known as Balance in the drop down but when I run:

tell application "BBEdit"
    activate
    find "<span class=\"foo\">" searching in text 1 of text document "test.html" options {search mode:grep, wrap around:true} with selecting match
    balance tags
end tell

it turns greedy and grabs the entire line between the first tag to the second last closing tag with text in between instead of isolating itself to the first tag with it's text.

Further research in the dictionary under tag I did run across find tag which I could do: set spanTarget to (find tag "span" start_offset counter) then target the tag with the class |class| of attributes of tag of spanTarget and use balance tags but I am still running into the same issue as before.

So in pure AppleScript how can I remove a tag associated with a class without it being greedy?

Why? Where is this HTML coming from and what do you want to use it for? If you need to extract text content from other people's web pages, the only thing smart enough to parse web pages correctly is a web browser, so you should look at scripting [e.g.] Safari instead. If you're trying to clean up old HTML files by removing unwanted tags, there are likely tools already out there that'll do it for you, e.g. have you looked at BBEdit's docs? At any rate, AppleScript is the worst language you could use for any sort of 'text' processing; use AS for driving apps, and let the apps do the real work. — foo, Jun 09 '16 at 17:58

score 1 · Answer 1 · answered Jun 10 '16 at 18:07

You can use a regex in the find command for BBEdit or TextWrangler:

To select the tag (Non-Greedy), use this command:

find ".+?" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match

Information from the .+? pattern:

. matches any character (except a line break)
+ means one or more repetitions of any character
? means a non-greedy quantifiers
So the pattern matches an opening span tag, followed by one or more occurrences of any character other than a return, followed by a closing span tag, the non-greedy quantifier achieves the results we want, preventing BBEdit from overrunning the closing  tag and matching across several tags.

To match the pattern across line breaks, just put (?s) at the beginning of the pattern, like this:

find "(?s).+?" searching in text 1 of text document 1 options {search mode:grep, wrap around:true} with selecting match

The command match a tag without line break:

shoulder

Or, the command match a tag with a line break:

shoulder 

Or, the command match a tag with multiple lines:

shoulder xxxx yyyy zzzz

From an AppleScript, you can use the replace command (BBEdit or TextWrangler) to find a pattern and delete all matched strings, like this

replace "(?s)<span class=\"foo\">.+?</span>" using "" searching in text 1 of text document 1 options {search mode:grep, wrap around:true}

Ron Reuter · Answer 2 · 2016-06-10T03:26:30.400

This is a job for Regular Expressions, which are available through the use of the now-supported AppleScriptObjC bridge. Paste this code into Script Editor and run it:

use AppleScript version "2.5" -- for El Capitan or later
use framework "Foundation"
use scripting additions

on stringByMatching:thePattern inString:theString replacingWith:theTemplate
    set theNSString to current application's NSString's stringWithString:theString
    set theOptions to (current application's NSRegularExpressionDotMatchesLineSeparators as integer) + (current application's NSRegularExpressionAnchorsMatchLines as integer)
    set theExpression to current application's NSRegularExpression's regularExpressionWithPattern:thePattern options:theOptions |error|:(missing value)
    set theResult to theExpression's stringByReplacingMatchesInString:theNSString options:theOptions range:{location:0, |length|:theNSString's |length|()} withTemplate:theTemplate
    return theResult as text
end stringByMatching:inString:replacingWith:

set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class='foo'>SHOULDER</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class='bar'>PIG BRISKET</span> jowl ham pastrami <span class='foo'>JERKY</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>"

set modifiedHTML to its stringByMatching:"<span .*?>(.*?)</span>" inString:theHTML replacingWith:"$1"

This works with well-formatted HTML, but as user foo pointed out above, a browser can deal with badly-formatted HTML, but you probably can't.

score 0 · Accepted Answer · answered Jun 10 '16 at 03:32

I believe Ron's answer is a good approach, but if you don't want to use regular expressions this can be achieved with the code below. I wasn't going to post it after seeing Ron had answered, but I had already created it so I figured I would at least give you a second option since you are trying to learn.

on run
    set theHTML to "<p>Bacon ipsum dolor amet pork chop landjaeger short ribs boudin short loin jowl <span class=\"foo\">shoulder</span> biltong shankle capicola drumstick pork loin rump spare ribs ham hock. <span class=\"bar\">Pig brisket</span> jowl ham pastrami <span class=\"foo\">jerky</span> strip steak bacon doner. Short loin leberkas jowl, filet mignon turducken chicken ribeye shank tail swine strip steak pork loin sausage. Frankfurter ground round porchetta, pork short ribs jowl alcatra flank sausage.</p>" 
    set theHTML to removeTag(theHTML, "<span class=\"foo\">", "</span>")
end run

on removeTag(theText, startTag, endTag)
    if theText contains startTag then
        set AppleScript's text item delimiters to {""}
        set AppleScript's text item delimiters to startTag
        set tempText to text items of (theText as string)
        set AppleScript's text item delimiters to {""}

        set middleText to item 2 of tempText as string
        if middleText contains endTag then
            set AppleScript's text item delimiters to endTag
            set tempText2 to text items of (middleText as string)
            set AppleScript's text item delimiters to {""}
            set newString to implode(tempText2, endTag)
            set item 2 of tempText to newString
        end if
        set newString to implode(tempText, startTag)
        removeTag(newString, startTag, endTag) -- recursive
    else
        return theText
    end if
end removeTag

on implode(parts, tag)
    set newString to items 1 thru 2 of parts as string
    if (count of parts) > 2 then
        set newList to {newString, items 3 thru -1 of parts}
        set AppleScript's text item delimiters to tag
        set newString to (newList as string)
        set AppleScript's text item delimiters to {""}
    end if
    return newString
end implode

Interesting approach. Do you mind explaining what the `text item delimiters` are doing? Also, isn't it stated that the `on run` should be placed at the end of the script to work properly? — DᴀʀᴛʜVᴀᴅᴇʀ, Jun 10 '16 at 18:40
@Darth_Vader - Text item delimiters are similar to an "explode" or "delimit" command in other languages. Turns a string into a list broken into the parts at the delimiter. I've never had an issue with the on run being placed at the top. — ThrowBackDewd, Jun 10 '16 at 19:32

Remove HTML tag associated with a class

3 Answers3