How to extract some digits text, between many texts from a file in bash?

Question

I want to scrape a .mhtml file using bash, originally I only use curl+xidel to scrape the html file, but now the web has "something" that prevent me from scraping.

this is some of the content:

QuoteStrip-watchLiveLink">LIVE<img src=3D"https://static-redesign.cnbcfm.co=
m/dist/4db8932b7ac3e84e3f64.svg" alt=3D"Watch live logo" class=3D"QuoteStri=
p-watchLiveLogo"></a><a href=3D"https://www.cnbc.com/live-tv/" style=3D"col=
or: rgb(0, 47, 108);">SHARK TANK</a></div></div></div><div class=3D"QuoteSt=
rip-quoteStripSubHeader"><span>RT Quote</span><span> | <!-- -->Exchange</sp=
an><span> | <!-- -->USD</span></div><div class=3D"QuoteStrip-dataContainer"=
><div class=3D"QuoteStrip-lastTimeAndPriceContainer"><div class=3D"QuoteStr=
ip-lastTradeTime">Last | 11:46 PM EDT</div><div class=3D"QuoteStrip-lastPri=
ceStripContainer"><span class=3D"QuoteStrip-lastPrice">1,621.41</span><span=
 class=3D"QuoteStrip-changeDown"><img class=3D"QuoteStrip-changeIcon" src=
=3D"https://static-redesign.cnbcfm.com/dist/4ee243ff052e81044388.svg" alt=
=3D"quote price arrow down"><span>-6.2537</span><span> (<!-- -->-0.3842%<!-=
- -->)</span></span></div></div></div></div><div class=3D"PhoenixChartWrapp=

question: How can I get only 1,621.41 as output in bash?

My regular program:

#!/bin/bash
curl -s -o ~/Desktop/xau.html -- https://www.cnbc.com/quotes/XAU=
gold=$(xidel -se /html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1] ~/Desktop/xau.html | sed 's/\,//g')
echo $gold
exit 0

output: some numbers

How do you visually identify these numbers inside the text? Do they always come after `lastPrice">` and before ``? — user1934428, Oct 21 '22 at 06:30
when I'm using XPATH, it's always come after the lastPrice tag, and it's something like `/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]` for the full, or `//*[@id="quote-page-strip"]/div[3]/div/div[2]/span[1]` something like this from there, it's how I search the value — foopeen, Oct 21 '22 at 06:51
So your question boils down to: How to extract text between two pieces of other (fixed) text. Is this correct? I ask, because _something like this_ is a somewhat vague problem definition. — user1934428, Oct 21 '22 at 06:54
I'm sorry but, the lines of codes I provided is just a stripe from the whole, I just want to keep it clean. — foopeen, Oct 21 '22 at 06:54
@user1934428 kinda yes, before I thought before those "fixed" text, there are other random texts, so yeah, that's why I changed the question. — foopeen, Oct 21 '22 at 06:57
Without giving an exact definition how the numbers to be looked up are delimited, you can not expect to find an algorithm. Since you are already using xpath now, can't you extract it with xpath itself? — user1934428, Oct 21 '22 at 06:58
I said in my intro, "something" prevent me using the xpath extract, I think it's like something coming from the cloudflare anti-scrapping machine. — foopeen, Oct 21 '22 at 07:04
But you can download the document, and you have installed xpath locally. If the file is valid XML, what should prevent you using it? OTOH, if it is html, you could use a parser for HTML ... perhaps via Node.js or Perl? — user1934428, Oct 21 '22 at 07:09
I don't know, It just goes blank when I opened the file, I don't know what's wrong. when I tried to extract it, it returns null on bash. yesterday still fine, it returns some numbers. — foopeen, Oct 21 '22 at 07:41
If the file is blank, there is not much you can extract of course. However, what do you mean by _opening_ the file? I would first of all look at the file with something like `xxd` to get an idea what you have. Could it be that your real problem is not getting information from a certain file, but getting the file itself in a certain format reproduceably? — user1934428, Oct 21 '22 at 07:45
I opened the html file in a browser, the file open should be just fine, even without internet. it opens up showing the content for 1 seconds and then goes "blank white canvas" — foopeen, Oct 21 '22 at 08:03
it can be open without problem in textpad, but they are just text, not a html which need to be process using Xidel. — foopeen, Oct 21 '22 at 08:05
I guess _textpad_ is a text editor you are using. So, how does the text look like then? — user1934428, Oct 21 '22 at 08:10
@user1934428 let me clarify this again, my regular program goes like this: #!/bin/bash curl -s -o ~/Desktop/xau.html -- https://www.cnbc.com/quotes/XAU= gold=$(xidel -se "/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1] ~/Desktop/xau.html" | sed 's/\,//g') echo $gold exit 0 you are welcome to try — foopeen, Oct 21 '22 at 08:40

score 1 · Answer 1 · answered Oct 21 '22 at 16:28

1

I only use curl+xidel to scrape the html file

xidel can open urls no problem, so no need for curl.

/html/body/div[2]/div/div[1]/div[3]/div/div[2]/div[1]/div[2]/div[3]/div/div[2]/span[1]
                                            ^

This particular div doesn't exist. There's only one. So this should work:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  /html/body/div[2]/div/div[1]/div[3]/div/div/div[1]/div[2]/div[3]/div/div[2]/span[1]
'

Also please be sure to quote the extraction-query. This will prevent situations where you'd otherwise have to escape lots of characters.

The website's HTML-source is minified. To have a better overview of all the HTML element-nodes I suggest you prettify the source again:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e . \
  --output-format=html --output-node-indent > ~/Desktop/xau.html

And that way you can see the query can be simplified to:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  //span[@class="QuoteStrip-lastPrice"]
'

Or alternatively from one of the JSONs in the <head>-node:

$ xidel -s "https://www.cnbc.com/quotes/XAU=" -e '
  parse-json(//script[@type="application/ld+json"][2])/price
'

answered Oct 21 '22 at 16:28

Reino

3,203
1
13
21

about the prettify thing, `Error Unknown option: output-node-indent (when reading argument: output-node-indent) The following command line options are valid: --data= Data/URL/File/Stdin(-) to process (--data= prefix can be omitted) bla bla bla` – foopeen Oct 22 '22 at 02:57
I actually used only `xidel` before, I run it every x secs, but sometimes it caused problems, stop halfway, so I have to ctrl+c it and re-run the bash script. that's why i prefer downloading it and process it locally. btw, about `This particular div doesn't exist. There's only one. So this should work:` How did you find out this? I copied the XPATH dirrectly from Chrome Browser's inpect, are you saying that they cheat me? – foopeen Oct 22 '22 at 03:05
in truth, this is what i'm looking for, but I still have some questions about why it doesn't work "my" way all of sudden, i'll put it on another thread. I want to give "useful" credit on your answer but I still can't do that, sorry. – foopeen Oct 22 '22 at 05:28
@foopeen I found it out by prettifying the HTML-source, as mentioned above. I don't use Chrome, so I can't comment on that. About the error-message: please [update](https://videlibri.sourceforge.net/xidel.html#downloads) to a binary from the v0.9.9 development branch. – Reino Oct 22 '22 at 19:52
since you are the one that solve this problem, I think you should go [here](https://stackoverflow.com/questions/74161494/whats-the-difference-between-this-two-full-xpath) and post your solution, also please state which tool you used to find out that trouble and such. thank you – foopeen Oct 23 '22 at 02:56
@foopeen I thought it was obvious that the only tool I use is `xidel`. Again, I don't use Chrome, so I can't comment on that or post an answer. Btw, you say I solved your problem, yet you accepted the dubious `sed` approach... – Reino Oct 23 '22 at 09:32
you don't have to use chrome to be able to know the XPATH of the HTML file, that "Chrome Elements Inpect" is just a tool really, you can open any HTML file using any simple editor like code editor (sublime text for example), it's not obvious because it seems hidden. The only I see from the structure is that the price tag is in the div[2] tag. as for the accepted answer, the origininal question was about extracting an information "mhtml" file, which contain messy structure and weight lot more than just regular "html" file, if I accepted your answer, that will confuse people. – foopeen Oct 24 '22 at 02:09
again, If I want me to accept your answer, I'll have to change the whole question, including the title. sorry, it's the best I can think of. – foopeen Oct 24 '22 at 02:11
you mentioned the html prettify before, So I tried to use that way. Pasted the html file I downloaded, prettify it, and open it using Sublime Text. I found out that the div[1] which suppose to have `
...
` is not there, instead it became the place of `
`....hmm... it becomes a problem now, the question is now changed to "why did it disappeared?"
– foopeen Oct 24 '22 at 04:00
you are welcome to answer the [post](https://stackoverflow.com/questions/74176667/why-is-the-web-page-html-element-absolute-xpath-dissapeared-when-downloaded-as) if you know what's going on – foopeen Oct 24 '22 at 04:20

Renaud Pacalet · Accepted Answer · 2022-10-22T07:22:46.693

0

One difficulty is that the lines are broken about anywhere (=\n). First join the lines and then extract what you look for:

$ sed -En ':a
s/=\n//g;s!.*<span class=3D"QuoteStrip-lastPrice">([^<]*)</span>.*!\1!p;tb
N;ba
:b
> q' file
1,621.41

Or, with GNU sed and its -z option:

$ sed -Ez 's!=\n!!g;s!.*<span class=3D"QuoteStrip-lastPrice">([^<]*)</span>.*!\1!' file
1,621.41

edited Oct 22 '22 at 07:22

answered Oct 21 '22 at 04:58

Renaud Pacalet

25,260
3
34
51

I got this output `sed: 1: "s/=\n//g;s!.* – foopeen Oct 21 '22 at 05:13
What version `sed` do you use? – Renaud Pacalet Oct 21 '22 at 05:43
the OSX version – foopeen Oct 21 '22 at 06:31
@foopeen : You could install the gnu tools and use gnu sed, or you adapt the solution to BSD sed, which you seem to currently use. – user1934428 Oct 21 '22 at 06:55
@foopeen: did you type the first `sed` script in separate lines, like shown, or all on the same line? If the latter, try to copy-paste exactly from the answer. I just tested under macOS and it works. – Renaud Pacalet Oct 21 '22 at 07:04
`sed: illegal option -- z` from the GNU sed – foopeen Oct 21 '22 at 07:14
@RenaudPacalet ah, sorry, I tried again, this time it works, on the same situation as the "example" I provided, but it return nothing when I used it on the full file – foopeen Oct 21 '22 at 07:18
I uploaded the file, it's 1,9M in size, https://drive.google.com/file/d/1TTKn_n00dVh2C5l5tVvztLKwzxEgNqyL/view?usp=sharing – foopeen Oct 21 '22 at 07:22
I do not use Google drive. – Renaud Pacalet Oct 21 '22 at 11:22
@RenaudPacalet sorry, here's the download link https://file.io/VQY8NHmvzsCl – foopeen Oct 22 '22 at 02:40
since sir @RenaudPacalet answered the question I started, I will give you the credit for answering my question. Thank you mister, tho this is not the best solution, i still learned something about `sed` – foopeen Oct 22 '22 at 05:30
OK, with your full example I think I understood what wasn't working and I fixed it. Please try the new version (but also consider upgrading the `sed` utility to a better version that Apple's default). – Renaud Pacalet Oct 22 '22 at 07:24

How to extract some digits text, between many texts from a file in bash?

2 Answers2