0

I have a bunch of Yandex.XML files with search results. http://api.yandex.com/xml/doc/dg/concepts/response.xml

I want to find out the queries (//yandexsearch/request/query) for all such XML files where the first URL ((//yandexsearch/response/results/grouping/group/doc/url)[1]) equals a certain value (say, http://www.example.org/).

Drawing an analogy with grep, I'd first use the -l flag to list the matching documents, and then pipe such list to xargs xmllint to extract the original query, but perhaps xmllint (or another OS X tool) has a better way (plus, I haven't found xmllint having a flag similar to -l for the original matching in the first place).

cnst
  • 25,870
  • 6
  • 90
  • 122

1 Answers1

1

Search for yandexsearch elements whose response element contains the URL you're looking for, then select the query.

/yandexsearch[
  contains(
    (response/results/grouping/group/doc/url)[1],
    "http://www.example.org"
  )]/request/query

For the example XML given on that page and the search string http://www.yandex.ru, it will return following element:

<query>yandex</query>

If your search string always is the prefix of the url, you might want to use starts-with(...) instead of contains(...).

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
  • Looks like it's supposed to do what I'm asking, but I'm getting a "Segmentation fault"! – cnst Jan 11 '14 at 19:24
  • Hard to tell what's the problem now; it could be broken software (accessing memory it may not) or even broken hardware (defective memory). Try using a newer version of `xmllint`, and post more detailed error information if there is _any_. How are you calling `xmllint`? – Jens Erat Jan 11 '14 at 19:26
  • ok, so, I'm getting the results when only one file is provided that is supposed to produce results, otherwise, I'm just getting segmentation fault, even if the input is merely one single file. I'm pretty much calling xmllint with just a different string that "http://www.example.org/", files are all rather small, too. – cnst Jan 11 '14 at 19:28
  • I'm not getting any messages other than the segmentation fault. – cnst Jan 11 '14 at 19:30
  • I think it might be related to the fact that no selection is taking place, so, the whole xpath expression resolves to nil. Is there, perhaps, some safe empty string that I could somehow select if the expression, as above, returns nothing? – cnst Jan 11 '14 at 19:33
  • I've tried adding ` | //yandexsearch/request/groupings/groupby/text()` to the xpath expression, and then I'm no longer getting segmentation fault, but instead get a message: `XPath set is empty`. I guess xml and xpath are not really command-line friendly... :/ – cnst Jan 11 '14 at 19:38
  • A segmentation fault may never happen, this definitely is a bug in xmllint. You could either try: Using perl's XPath-wrapper, which is then available as `xpath` command on the command line, or use xmlstarlet. Or some larger XPath/XQuery interpreter with command line support like BaseX and Saxon (and others). – Jens Erat Jan 11 '14 at 21:05