0

Ok, this is kind of trick. I have this text:

<something>
   <h1> quoiwuqoiuwoi aoiuoisquiooi

       <script> dsadsa  dsa </script>

       Some text here in the middle! =)   

       <script> dsadsa  dsa </script>

   </h1>
</something>

I want to get only the content in without the tags, in other words:

   <h1> quoiwuqoiuwoi aoiuoisquiooi


       Some text here in the middle! =)   


   </h1>

Including the tags.

Doing some research I´ve found out I can get everything between the h1 tags with the following regex:

   /<h1([^]*)h1>/

How ever, I can´t find a way to exclude whats bettween the tags. Including the script tag itself. Any help would be much apreciated.

In case anyone is wondering why I need that, here is a brief explanation:

I´m using this code to scrapy some data from a site using googleSpreadSheet:

function doGet() {
  var html = UrlFetchApp.fetch('https://www.nespresso.com/br/pt/product/maquina-de-cafe-espresso-pixie-clips-c60-preta-e-lima-neon-110v').getContentText();
  var regExp = new RegExp("<h1([^]*)h1>", "gi");
  var h1 = regExp.exec(html);
  Logger.log(h1);
  var doc = XmlService.parse(h1[0]);
  var html = doc.getRootElement();
  var menu = getElementsByClassName(html, 'nes_pdp-title nes_pdp-title-sep-none')[0];
  var output = menu.getText();
  Logger.log(output);
}

How ever it has a problem parssing script tags and iframes. the only solution I could find was to strip the code from them. If anyone has a better solution, I all ears.

If I don´t remove the script and iframe tags, the code breaks before I could call the .getElementsByTagName. It breaks when I use .XmlService(). I can only pass a valeu to XmlSevive() if it does not have a javascript nor a iframe tag. Thank You again!

user3347814
  • 1,138
  • 9
  • 28
  • 50

1 Answers1

2

Try replacing .innerHTML of h1 element using String.prototype.replace() with RegExp /<script>.*<\/script>/g to match script tags including text within script tags , .trim()

var h1 = document.getElementsByTagName("something")[0].querySelector("h1");
h1.innerHTML = h1.innerHTML.replace(/<script>.*<\/script>/g,"")
              .trim();
console.log(h1.outerHTML)
<something>
   <h1> quoiwuqoiuwoi aoiuoisquiooi

       <script> dsadsa  dsa </script>

       Some text here in the middle! =)   

       <script> dsadsa  dsa </script>

   </h1>
</something>
guest271314
  • 1
  • 15
  • 104
  • 177
  • Thank you, but the code breaks due to the script tags before I could call the .getElementsByTagName. It breaks when I use .XmlService(). I can only pass a valeu to XmlSevive() if it does not have a javascript nor a iframe tag. – user3347814 Nov 23 '15 at 10:42
  • @user3347814 What is requirement ? If expected result is to include text returned from `.replace()` inside existing `h1` , e.g.; `document.getElementsByTagName("h1")[0].innerHTML = res;`? – guest271314 Nov 23 '15 at 10:45