Is there an alternate way of grabbing text from a site that provides no API?

Question

We have a bot for Slack that will take input such as:

bible John 3:17 (ESV)

This will transform into a call to

https://www.biblegateway.com/passage/?search=John+3:17&version=ESV

So, what we've done for now is to just grep out the og.description, e.g., for the above, we'd get:

curl 'https://www.biblegateway.com/passage/?search=John+3:17&version=ESV' | grep "og:description" | sed 's/.*content="//' | sed 's/".*//'

For God did not send his Son into the world to condemn the world, but in order that the world might be saved through him.

This works great for small requests, e.g., bible John 3:1-4 -- however, if we request larger sections, the description field is truncated at a certain point. So if we were to do bible John 3, it will only return the first 5 or so verses from John 1.

Is there a better way to go about this, other than the curl above? The only other place in the response that contains the full text is the raw html, e.g.,:

<h1 class="passage-display"> <span class="passage-display-bcv">John 3</span><span class="passage-display-version">English Standard Version (ESV)</span></h1> [ ... etc. etc. ... ]

Should we be looking at something other than just http requests for this?

web scraping with any language you like, perhaps Java and JSoup or something pythonic ... — Marged, Jan 21 '16 at 21:17

score 1 · Answer 1 · answered Jan 21 '16 at 21:28

If you want to stick with a oneliner, but have more precision about what you scrape, you can try the Mojolicious Perl project. Here's an example syntax:

perl -Mojo -E 'say g("mojolicious.org")->dom->at("title")->text'

That would parse out the text from the tag. For tasks too complex for one line, see the Mojo web scraping example.

Installing Mojolicious is easy:

curl -L https://cpanmin.us | perl - -M https://cpan.metacpan.org -n Mojolicious

Even if you don't know Perl, you may well be able to piece together what you need for your scraping job, as the syntax for the DOM traversal may be familiar if you've used jQuery.

So what I'm confused about with scraping specific tags, is that this site seems somewhat complex in it's tag encapsulation -- so I'm not sure how to grab the tag I want (or I guess, *all* the tags I want): http://i.imgur.com/nNLJYSr.png — MrDuk, Jan 21 '16 at 21:30

score 0 · Answer 2 · answered Sep 24 '16 at 02:33

The CLI for scripture_lookup is very fast and easy to use.

Providers a clean interface to common scripture providers, such as Crossways ESV and Bible Gateway.

Current (default) provider is BibleGatewayScraper, which pulls back scripture from Bible Gateway.

https://github.com/wrightling/scripture_lookup

Is there an alternate way of grabbing text from a site that provides no API?

2 Answers2