How do I sequentially extract English text from different web pages of a Tamil website?

Question

The Naalayira Divya Prabandham is a 4000-verse collection of Hindu poems written in the Tamil language. The website http://dravidaveda.org has a web page for each of the 4000 verses. Each verse page gives the Tamil verse, a word-by-word Tamil commentary on the verse, and an English translation. For instance, here is the web page for verse 1008.

My question, is there any way I can extract the English translations of all 4000 verses in order, so I can have a complete English translation of the Naalayira Divya Prabandham in a single document? For instance, in the web page I linked to above, I want to extract "Singavel-Kundram is the place where the pure Lord came as a man-lion,-while the world stood awe-struck,-and tore the Asura Hiranya’s chest with his claws. Red eyed lions offer worship by heaping elephant tusks at his feet with reverence." along with the number 1008, and I want to put it in the 1008th place in my document.

So how would I go about doing that? I assume this would this might require some kind of programming, but I don't have much of a technical background, so can someone tell me what I would need to do? Note that the article ID's, for instance the number 1379 in the URL "dravidaveda.org/index.php?option=com_content&view=article&id=1379&ml=1", don't go sequentially by verse, so that may pose a bit of a problem from a programming standpoint.

Do you know any programming at all? If not, do you want to learn to program for this project? — Joni, May 15 '17 at 07:04
from which menu all the verses can be accessed? @KeshavSrinivasan — Ambrish Pathak, May 15 '17 at 07:24
@AmbrishPathak That's part of the problem, there is no webpage that has links to all 4000 verse pages. But if you start at the homepage and scroll down, you'll see a menu with a bunch of green arrows on the right hand side. If you click one of the options you'll get a bunch of links, and if you click one of those links you'll be taken to a webpage that has links to 10 verses out of the 4000. — Keshav Srinivasan, May 15 '17 at 08:07
@AmbrishPathak But I find the easiest way to navigate to a specific verse page is through Google searching. Like to find verse 1008 I just type site:dravidaveda.org 1008 — Keshav Srinivasan, May 15 '17 at 08:09
So, this can be solved in two parts, first part would extract all 4000 links to a excel and then using these links english part can be extracted one by one using vba — Ambrish Pathak, May 15 '17 at 08:15
@AmbrishPathak OK, but is there any way to get all 4000 links, short of manually doing it one by one? — Keshav Srinivasan, May 15 '17 at 08:30
Yes this can be done using vba, search google using site:dravidaveda.org 1008, replace 1008 with loop numbers and extract all the links first, i know it is lengthy process. — Ambrish Pathak, May 15 '17 at 08:34
@AmbrishPathak But how would I put a Google search in a loop? What language can do automated Google searching? — Keshav Srinivasan, May 15 '17 at 08:45
Search and try using VBA (excel macro), i can provide you with the vba code that can pull english text for 1008th link — Ambrish Pathak, May 15 '17 at 09:01
@KeshavSrinivasan If you're going to cross-post something, you should note that you've done so and link to the question on other sites. — ArtOfCode, May 15 '17 at 12:39

Pandya · Answer 1 · 2017-05-16T11:00:46.430

You can do with software/commands that dump content of webpages into terminal or console. e.g lynx, w3m, links, etc. (Though it is also possible with wget, curl, aria2 etc). Visit manual pages of respective commands for further information.

Here I'm providing sample example by using lynx:

#!/bin/bash
for i in {47..4568}
 do
 {
 lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -n 1 >> ndp.txt
 echo -e "\n" >> ndp.txt
 lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | grep 'English Translation' -A 10 >> ndp.txt
 echo -e "\n\n" >> ndp.txt
 }
 done;

Here {47..4598} will auto expand into 47,48,....,4568 sequentially. (I've found Nalayira Divya Prabandham can be fetched from this range)

1^stlynx command will write no. of verse e.g (1008) in the file named npd.txt

2^ndlynx command will write "English Translation" for that verse in npd.txt

Hence, with the help of for loop and depending on the range provided, you'll get no. of verses with Engish Translation in the file npd.txt.

Note that as you've mentioned that page id doesn't go subsequently, it is difficult to predict the ids to skip while coding. Anyway, I think you'll easily remove those lines from undesirable page ids from npd.txt after having it.

However, if you want you can skip dumping of those pages by using verification like:

if [[ $(lynx -dump ""http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -c 1) = "(" ]]
then
[Your commands here]
fi

Here expression given in the condition of if will check whether first character of page we're going to dump is "(" or not.

So, Following command may work depending on the contents from web pages:

#!/bin/bash
for i in {47..4568}
 do
 {
   if [[ $(lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -c 1) = "(" ]]
   then 
     {
     lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | head -n 1 >> ndp.txt
     echo -e "\n" >> ndp.txt
     lynx -dump "http://dravidaveda.org/index.php?option=com_content&view=article&id=$i&ml=1" | grep 'English Translation' -A 10 >> ndp.txt
     echo -e "\n\n" >> ndp.txt
     } 
   fi
 }
 done;

I've checked and above script runs fine on my PC.

Update/Improvement:

The file ndp.txt is having verses in non-sequential order since we're getting verses in non-sequential order from website. So, finally it can be sorted with the following command (Thanks to @terdon for perl code):

perl -ne 'if(/^\((\d+)\)\s*$/){$d=$1;} push @{$k{$d}},$_; END{print "@{$k{$_}}\n" for sort { $a <=> $b} keys(%k)} ' npd.txt

How do I sequentially extract English text from different web pages of a Tamil website?

1 Answers1