0

I want to scrape a .mhtml file using bash, originally I only use curl+xidel to scrape the html file, but now the web has "something" that prevent me from scraping.

this is some of the content:

QuoteStrip-watchLiveLink">LIVE<img src=3D"https://static-redesign.cnbcfm.co=
m/dist/4db8932b7ac3e84e3f64.svg" alt=3D"Watch live logo" class=3D"QuoteStri=
p-watchLiveLogo"></a><a href=3D"https://www.cnbc.com/live-tv/" style=3D"col=
or: rgb(0, 47, 108);">SHARK TANK</a></div></div></div><div class=3D"QuoteSt=
rip-quoteStripSubHeader"><span>RT Quote</span><span> | <!-- -->Exchange</sp=
an><span> | <!-- -->USD</span></div><div class=3D"QuoteStrip-dataContainer"=
><div class=3D"QuoteStrip-lastTimeAndPriceContainer"><div class=3D"QuoteStr=
ip-lastTradeTime">Last | 11:46 PM EDT</div><div class=3D"QuoteStrip-lastPri=
ceStripContainer"><span class=3D"QuoteStrip-lastPrice">1,621.41</span><span=
 class=3D"QuoteStrip-changeDown"><img class=3D"QuoteStrip-changeIcon" src=
=3D"https://static-redesign.cnbcfm.com/dist/4ee243ff052e81044388.svg" alt=
=3D"quote price arrow down"><span>-6.2537</span><span> (<!-- -->-0.3842%<!-=
- -->)</span></span></div></div></div></div><div class=3D"PhoenixChartWrapp=

question: How can I get only 1,621.41 as output in bash?

My regular program:

#!/bin/bash
curl -s -o ~/Desktop/xau.html -- https://www.cnbc.com/quotes/XAU=
gold=$(xidel -se /html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1] ~/Desktop/xau.html | sed 's/\,//g')
echo $gold
exit 0

output: some numbers

foopeen
  • 31
  • 5
  • How do you visually identify these numbers inside the text? Do they always come after `lastPrice">` and before ``? – user1934428 Oct 21 '22 at 06:30
  • when I'm using XPATH, it's always come after the lastPrice tag, and it's something like `/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]` for the full, or `//*[@id="quote-page-strip"]/div[3]/div/div[2]/span[1]` something like this from there, it's how I search the value – foopeen Oct 21 '22 at 06:51
  • So your question boils down to: How to extract text between two pieces of other (fixed) text. Is this correct? I ask, because _something like this_ is a somewhat vague problem definition. – user1934428 Oct 21 '22 at 06:54
  • I'm sorry but, the lines of codes I provided is just a stripe from the whole, I just want to keep it clean. – foopeen Oct 21 '22 at 06:54
  • @user1934428 kinda yes, before I thought before those "fixed" text, there are other random texts, so yeah, that's why I changed the question. – foopeen Oct 21 '22 at 06:57
  • Without giving an exact definition how the numbers to be looked up are delimited, you can not expect to find an algorithm. Since you are already using xpath now, can't you extract it with xpath itself? – user1934428 Oct 21 '22 at 06:58
  • I said in my intro, "something" prevent me using the xpath extract, I think it's like something coming from the cloudflare anti-scrapping machine. – foopeen Oct 21 '22 at 07:04
  • But you can download the document, and you have installed xpath locally. If the file is valid XML, what should prevent you using it? OTOH, if it is html, you could use a parser for HTML ... perhaps via Node.js or Perl? – user1934428 Oct 21 '22 at 07:09
  • I don't know, It just goes blank when I opened the file, I don't know what's wrong. when I tried to extract it, it returns null on bash. yesterday still fine, it returns some numbers. – foopeen Oct 21 '22 at 07:41
  • If the file is blank, there is not much you can extract of course. However, what do you mean by _opening_ the file? I would first of all look at the file with something like `xxd` to get an idea what you have. Could it be that your real problem is not getting information from a certain file, but getting the file itself in a certain format reproduceably? – user1934428 Oct 21 '22 at 07:45
  • I opened the html file in a browser, the file open should be just fine, even without internet. it opens up showing the content for 1 seconds and then goes "blank white canvas" – foopeen Oct 21 '22 at 08:03
  • it can be open without problem in textpad, but they are just text, not a html which need to be process using Xidel. – foopeen Oct 21 '22 at 08:05
  • I guess _textpad_ is a text editor you are using. So, how does the text look like then? – user1934428 Oct 21 '22 at 08:10
  • @user1934428 let me clarify this again, my regular program goes like this: #!/bin/bash curl -s -o ~/Desktop/xau.html -- https://www.cnbc.com/quotes/XAU= gold=$(xidel -se "/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1] ~/Desktop/xau.html" | sed 's/\,//g') echo $gold exit 0 you are welcome to try – foopeen Oct 21 '22 at 08:40

2 Answers2

1

I only use curl+xidel to scrape the html file

xidel can open urls no problem, so no need for curl.

/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]
                                            ^

This particular div doesn't exist. There's only one. So this should work:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  /html/body/div[2]/div/div[1]/div[3]/div/div/div[1]/div[2]/div[3]/div/div[2]/span[1]
'

Also please be sure to quote the extraction-query. This will prevent situations where you'd otherwise have to escape lots of characters.

The website's HTML-source is minified. To have a better overview of all the HTML element-nodes I suggest you prettify the source again:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e . \
  --output-format=html --output-node-indent > ~/Desktop/xau.html

And that way you can see the query can be simplified to:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  //span[@class="QuoteStrip-lastPrice"]
'

Or alternatively from one of the JSONs in the <head>-node:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  parse-json(//script[@type="application/ld+json"][2])/price
'
Reino
  • 3,203
  • 1
  • 13
  • 21
  • about the prettify thing, `Error Unknown option: output-node-indent (when reading argument: output-node-indent) The following command line options are valid: --data= Data/URL/File/Stdin(-) to process (--data= prefix can be omitted) bla bla bla` – foopeen Oct 22 '22 at 02:57
  • I actually used only `xidel` before, I run it every x secs, but sometimes it caused problems, stop halfway, so I have to ctrl+c it and re-run the bash script. that's why i prefer downloading it and process it locally. btw, about `This particular div doesn't exist. There's only one. So this should work:` How did you find out this? I copied the XPATH dirrectly from Chrome Browser's inpect, are you saying that they cheat me? – foopeen Oct 22 '22 at 03:05
  • in truth, this is what i'm looking for, but I still have some questions about why it doesn't work "my" way all of sudden, i'll put it on another thread. I want to give "useful" credit on your answer but I still can't do that, sorry. – foopeen Oct 22 '22 at 05:28
  • @foopeen I found it out by prettifying the HTML-source, as mentioned above. I don't use Chrome, so I can't comment on that. About the error-message: please [update](https://videlibri.sourceforge.net/xidel.html#downloads) to a binary from the v0.9.9 development branch. – Reino Oct 22 '22 at 19:52
  • since you are the one that solve this problem, I think you should go [here](https://stackoverflow.com/questions/74161494/whats-the-difference-between-this-two-full-xpath) and post your solution, also please state which tool you used to find out that trouble and such. thank you – foopeen Oct 23 '22 at 02:56
  • @foopeen I thought it was obvious that the only tool I use is `xidel`. Again, I don't use Chrome, so I can't comment on that or post an answer. Btw, you say I solved your problem, yet you accepted the dubious `sed` approach... – Reino Oct 23 '22 at 09:32
  • you don't have to use chrome to be able to know the XPATH of the HTML file, that "Chrome Elements Inpect" is just a tool really, you can open any HTML file using any simple editor like code editor (sublime text for example), it's not obvious because it seems hidden. The only I see from the structure is that the price tag is in the div[2] tag. as for the accepted answer, the origininal question was about extracting an information "mhtml" file, which contain messy structure and weight lot more than just regular "html" file, if I accepted your answer, that will confuse people. – foopeen Oct 24 '22 at 02:09
  • again, If I want me to accept your answer, I'll have to change the whole question, including the title. sorry, it's the best I can think of. – foopeen Oct 24 '22 at 02:11
  • you mentioned the html prettify before, So I tried to use that way. Pasted the html file I downloaded, prettify it, and open it using Sublime Text. I found out that the div[1] which suppose to have `
    ...
    ` is not there, instead it became the place of `
    `....hmm... it becomes a problem now, the question is now changed to "why did it disappeared?"
    – foopeen Oct 24 '22 at 04:00
  • you are welcome to answer the [post](https://stackoverflow.com/questions/74176667/why-is-the-web-page-html-element-absolute-xpath-dissapeared-when-downloaded-as) if you know what's going on – foopeen Oct 24 '22 at 04:20
0

One difficulty is that the lines are broken about anywhere (=\n). First join the lines and then extract what you look for:

$ sed -En ':a
s/=\n//g;s!.*<span class=3D"QuoteStrip-lastPrice">([^<]*)</span>.*!\1!p;tb
N;ba
:b
> q' file
1,621.41

Or, with GNU sed and its -z option:

$ sed -Ez 's!=\n!!g;s!.*<span class=3D"QuoteStrip-lastPrice">([^<]*)</span>.*!\1!' file
1,621.41
Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51