1

Hi All I'm trying to "parse/extract" html-data with Clojure en Enlive (any better choices ?)

I am trying to get all the ul > li tags that are *NOT part of the <nav> tag I think I should use the (html/but) function from Enlive but can't seem to make it work ?

;;test-envlive.clj

(defn get-tags [dom tag-list]
  (let [tags
         (mapv
          #(vec (html/select dom %1))
          tag-list)]
    tags))

;;Gives NO tags
(get-tags test-dom [[[(html/but :nav) :ul :> :li]]])

;;Gives ALL the LI-tags
(get-tags test-dom [[:ul :> :li]])
<!-- test.html -->
<html>
<head><title>Test page</title>  </head>
<body>
    <div>
        <nav>
            <ul>
                <li>
                    skip these navs-li
                </li>
                
            </ul>
        </nav>
        <h1>Hello World<h1>                 
        <ul><li>get only these li's</li>                
        </ul>           
    </div>  
</body></html>
cfrick
  • 35,203
  • 6
  • 56
  • 68
user914584
  • 571
  • 8
  • 15

3 Answers3

1

If you had a valid xhtml, you could use XPath from sigel:

(require '[sigel.xpath.core :as xpath])
(let [data "<html><head><title>Test page</title></head>
                <body><div><nav><ul><li>skip these navs-li</li></ul></nav>
                <h1>Hello World</h1>
                <ul><li>get only these li's</li></ul>
                </div></body></html>"]
        (xpath/select data "//li[not(ancestor::nav)]"))
akond
  • 15,865
  • 4
  • 35
  • 55
0

You could do this with the Tupelo Forest library. Watch the video and see the examples in the unit tests.

Here is one way to solve your problem:

(ns tst.tupelo.forest-examples
  (:use tupelo.core tupelo.forest tupelo.test)
  (:require. ... ))

<snip>

(verify
  (let [html-data "<html>
                      <head><title>Test page</title>  </head>
                      <body>
                          <div>
                              <nav>
                                  <ul>
                                      <li>
                                          skip these navs-li
                                      </li>

                                  </ul>
                              </nav>
                              <h1>Hello World<h1>
                              <ul><li>get only these li's</li>
                              </ul>
                          </div>
                      </body>
                  </html> "]

and the interesting part comes next.

    (hid-count-reset)
    (with-forest (new-forest)
      (let [root-hid   (add-tree-html html-data)
            out-hiccup (hid->hiccup root-hid)
            result-1   (find-paths root-hid [:html :body :div :ul :li])
            li-hid     (last (only result-1))
            li-hiccup  (hid->hiccup li-hid)]
        (is= out-hiccup [:html
                         [:head [:title "Test page"]]
                         [:body
                          [:div
                           [:nav
                            [:ul
                             [:li
                              "\n                                          skip these navs-li\n                                      "]]]
                           [:h1 "Hello World"]
                           [:ul [:li "get only these li's"]]]]])
        (is= result-1 [[1011 1010 1009 1008 1007]])
        (is= li-hid 1007)
        (is= li-hiccup [:li "get only these li's"])))))

The above code can be seen live in the examples.

Alan Thompson
  • 29,276
  • 6
  • 41
  • 48
0

I was able to select target li with Hickory, so if you don't mind changing your library:

Dependency: [hickory "0.7.1"]

Require: [hickory.core :as h] [hickory.select :as s]

(s/select (s/and
            (s/descendant (s/tag :ul)
                          (s/tag :li))
            (s/not (s/descendant (s/tag :nav)
                                 (s/tag :li))))
          (h/as-hickory (h/parse (slurp "resources/site.html"))))

=> [{:type :element, :attrs nil, :tag :li, :content ["get only these li's"]}]
Martin Půda
  • 7,353
  • 2
  • 6
  • 13