0

I try to parse html page using XPath with xidel. The page have a table with multiple rows and columns I need to get values from each row from columns 2 and 5 (IP and port) and store them in csv-like file. Here is my script

#!/bin/bash
for (( i = 2; i <= 100; i++ ))
do
xidel http://www.vpngate.net/en/ -e '//*[@id="vg_hosts_table_id"]/tbody/tr["'$i'"]/td[2]/span[1]' >> "$i".txt #get value from first column
xidel http://www.vpngate.net/en/ -e '//*[@id="vg_hosts_table_id"]/tbody/tr["'$i'"]/td[5]' >> "$i".txt #get value from second column
sed -i ':a;N;$!ba;s/\n/^/g' "$i".txt #replace newline with custom delimiter
sed -i '/\s/d' "$i".txt #remove blanks
cat "$i".txt >> ip_port_list #create list
zip -m ips.zip "$i".txt #archive unneeded texts
done

The perfomance is not issue When i manually increment each tr - looks perfect. But not with variable from loop. I want to receive a pair of values from each row. Now i got only partial data or even empty file

Exabyte
  • 53
  • 5
  • 1
    Welcome to StackOverflow! Could you edit your question and add some information on what exactly the problem is (error message, output differs from expected output)? – Michael Jaros Mar 08 '15 at 18:28
  • Fetching the same page 198 times seems incredibly wasteful, anyway. If `xidel` doesn't allow you to extract the fields you want in one go, perhaps consider switching to a different tool. – tripleee Mar 09 '15 at 12:35
  • [The template language](http://benibela.de/documentation/internettools/extendedhtmlparser.THtmlTemplateParser.html) looks vaguely like an ad-hoc reinvention of XSLT. Not necessarily worse, but I would go with the standard tool. – tripleee Mar 09 '15 at 13:48
  • @triplee - it is a just for example - i download it just once :) First row before loop init - wget page -O page.html, then pass this name to xidel. – Exabyte Mar 09 '15 at 20:07
  • And thanks for template lang - i'll read about them. – Exabyte Mar 09 '15 at 20:10

2 Answers2

2

I need to get values from each row from columns 2 and 5 (IP and port) and store them in csv-like file.

xidel -s "https://www.vpngate.net/en/" -e '
  (//table[@id="vg_hosts_table_id"])[3]//tr[not(td[@class="vg_table_header"])]/concat(
    td[2]/span[@style="font-size: 10pt;"],
    ",",
    extract(
      td[5],
      "TCP: (\d+)",
      1
    )
  )
'
220.218.70.177,443
211.58.36.54,995
1.239.223.190,1351
[...]
153.207.18.229,1542
  • (//table[@id="vg_hosts_table_id"])[3]: Select the 3rd table of its kind. The one you want.
  • //tr[not(td[@class="vg_table_header"])]: Select all rows, except the headers.
  • td[2]/span[@style="font-size: 10pt;"]: Select the 2nd column and the <span> that contains just the IP-address.
  • extract(td[5],"TCP: (\d+)",1): Select the 5th column and extract (regex) the numerical value after "TCP ".
Reino
  • 3,203
  • 1
  • 13
  • 21
0

Maybe this xidel line will come in handy:

xidel -q http://www.vpngate.net/en/ -e '//*[@id="vg_hosts_table_id"]/tbody/tr[*]/concat(td[2]/span[1],",",substring-after(substring-before(td[5],"UDP:"),"TCP: "))'

This will only do one fetch (so the admins of vpngate won't block you) and it'll also create a CSV output (ip,port)... Hopefully that is what you were looking for?

MatrixView
  • 311
  • 2
  • 7
  • Yes. This code did the job. I looking for this :) Thank you. But my question is still unresolved... How to pass a number to variable in tr[*].. For example, i need only even rows, or 5, 8, 14, 21 (numbers from array) – Exabyte Mar 25 '15 at 21:14
  • You mean something like this (Windows - batch): FOR %A IN (5,8,14,21) DO xidel -q http://www.vpngate.net/en/ -e '//*[@id="vg_hosts_table_id"]/tbody/tr[%A]/concat(td[2]/span[1],",",substring-after(substring-before(td[5],"UDP:"),"TCP: " ))' This works, however, it will call xidel 4 times. – MatrixView Mar 30 '15 at 06:53
  • Thank you. I can use this piece on Windows :) But i look for Linux version. It is all about variable evaluation. I trying sort out this - but no luck yet – Exabyte Mar 31 '15 at 01:53