In ruby, how can I remove non word characters?

Question

this script is a part of a bigger one, when I run this script I get "<p></p>" print out as well. How can I remove this?

I used this regex: m.gsub!(/(?=\S)(\d|\W)/,"")

But it only removed the char "<" and "/>"

Here is my script:

require 'open-uri'
require 'rexml/document'
include REXML

doc = REXML::Document.new(open('http://testnavet.skolverket.se/SusaNavExport/EmilObjectExporter?id=184594606&amp;strId=info.uh.gu.GS5&amp;EMILVersion=1.1').read)

doc.elements.each("//*[name()='ct:text'] | /ns:educationInfo/ns:extensionInfo/gu:guInfoExtensions/gu:guSubject/gu:descriptions/gu:description"){
          |e| m = e.text 
              puts "Description: " + m  
        }

What do you define a "word" as? What you have *is* removing non-word characters. `p` is a word character so it remains. — Andrew Marshall, Mar 03 '12 at 20:36
I like to remove the html char, @AndrewMarshall yeah I know buddy.. — , Mar 03 '12 at 21:06

Jwosty · Accepted Answer · 2012-03-03T21:05:52.800

1

Ah, so you want to remove HTML tags. If so, you can do this:

str.gsub(/<.+?>/, "")

Thus, "<div>Hello world!</div>" becomes "Hello world"

edited Mar 03 '12 at 21:05

answered Mar 03 '12 at 20:38

Jwosty

3,497
2
22
50

**No, it doesn’t.** It becomes `""`. – tchrist Mar 03 '12 at 20:43
1

That's because it should be: `/<.+?>/` where you do the non-greedy match: `+?`. Note, this is a base-case and escaped > characters would defeat this. Is that what the OP is looking for? – Mike Ryan Mar 03 '12 at 20:54
2

just for those who dont know http://rubular.com/ is a great place for playing around with ruby regexps – Hugo Mar 03 '12 at 21:10
it should be: str.gsub!(/<.+?>/, "")... you forgat the the "!" char – Mar 03 '12 at 21:16
@SHUMAcupcake Note that `gsub` does work, it just returns the result, rather than modifying `str` like `gsub!` does. – Andrew Marshall Mar 03 '12 at 21:23
@AndrewMarshall Aha, cool. Do you know how I should handle the outputs that dosent have anything in them? – Mar 03 '12 at 21:26
That just means there was nothing inside the tag – Jwosty Mar 06 '12 at 00:43

In ruby, how can I remove non word characters?

1 Answers1