1

I am using Xidel to scrape information from webpage and I am stuck on exporting the information in a different order than it is on the page.

Example:

<tr>
<td></td>
<td></td>
<td></td>
<td><a><font><b>{ location:=. }</b></font>{ title:=. }</a></td>
<td>{ dates:=. }</td>
<td></td>
</tr>

This code will export as title, and then subtitle. Is there any way in Xidel to change the order?

dirkk
  • 6,160
  • 5
  • 33
  • 51
Jirka Matousek
  • 328
  • 3
  • 7

2 Answers2

0

This may be as easy as:

xidel -q page.html -e subtitle:=//h2,title:=//h1

Something like the following (with several "-e" params) would also work, but like the previous code it will first group all subtitles and then all titles on the page, which is probably not what you want...

xidel -q page.html -e "<div><h2>{subtitle:=.}</h2></div>+" -e "<div><h1>{title:=.}</h1></div>+" 

AFAIK, in your case there's no ordering feature in Xidel. But what you CAN do is write a script wherein you save the values as env. variables with the xidel --output-format cmd (if Windows) and then (in the right order) echo/process those variables/values.

Dirkk has given a great tip (to not group), with that your line could look something like this:

xidel -q page.html --xquery "for $i in //div return (concat('sub:=',$i/h2), concat('title:=',$i/h1))"
MatrixView
  • 311
  • 2
  • 7
0

I have never used this tool, but given a quick look at the documentation and seeing that it supports XQuery, the following should work I guess:

xidel -q page.html --xquery "for $div in //div return ($div/h2, $div/h1)" --output-format xml 

This assumes you have several such div elements in your page and want to sort all your titles with a subtitle first individually, i.e. not all subtitles first. Also, as you not have given a more specific example XML, it simply selects all divs and iterates over them - In real world HTML you probably want more characteristic features (like id attributes).

dirkk
  • 6,160
  • 5
  • 33
  • 51
  • Thank you! I updated the code to show a better example of what I am trying to solve. How would I evaluate that in XQuery? – Jirka Matousek Oct 15 '14 at 11:08
  • First of all, if you edit you can and should directly modify the question, there is no need for an Update section. Your XML is basically still the same. You would select all table rows by using `//tr`, but if you have other tables in this page, these would be selected as well. Use same uniquely identifying elements of a webpage, e.g. an idea, or an h1 or h2 header, some characteristic link... – dirkk Oct 15 '14 at 12:14