2

I want to extract text from html just like this-

<div id="sn1058961" class="soundTrack soda odd">Boom Shack-a-Lak<br />
Written by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a> (as  Stephen Kapur) and Ervin Barrington Woolley<br />
Performed by <a href="/name/nm0031896?ref_=ttsnd_snd_1">Apache Indian</a><br   />
Courtesy of Island Records Ltd.<br />
Under license from Universal Music Enterprises<br />

in the following form.

If I use the following xpath

//*[@id="soundtracks_content"]/div[2]/div[1]/node()[count(preceding-sibling::br)=1][normalize-space()]

then it must extract one single piece of text "Written by Apache Indian (as Stephen Kapur) and Ervin Barrington Woolley" but the above command is extracting three text elements "Written by", "Apache Indian" and "(as Stephen Kapur) and Ervin Barrington Woolley". Can you suggest another xpath that would extract a single text from the above html. I have been practising my xpath on the url: "http://www.imdb.com/title/tt2096672/soundtrack?ref_=tt_ql_trv_7"

I am using using import.io to scrape data through xpath but I am not allowed to enter the entire xpath I just enter

node()[count(preceding-sibling::br)=1][normalize-space()]

I have pasted the a picture of what I am actually doing -enter image description here Please note I also need anchor text

1 Answers1

1

with xpath 2.0

string-join(//*[@id="soundtracks_content"]/div[2]/div[1]//text()[count(preceding-sibling::br)=1][normalize-space()], "")
eLRuLL
  • 18,488
  • 9
  • 73
  • 99