0

I have a html page, with one structure that I want to turn into Clojure data structure. I’m hitting a mental block on how to approach this in an idiomatic way

This is the structure I have:

<div class=“group”>
  <h2>title1<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2<h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>

Structure I want:

'(
[“Title1” “subhead1” “path1”]
[“Title1” “subhead2” “path2”]
[“Title2” “subhead3” “path3”]
[“Title3” “subhead4” “path4”]
[“Title3” “subhead5” “path5”]
[“Title3” “subhead6” “path6”]
)

The repetition of titles is intentional.

I’ve read David Nolan’s enlive tutorial. That offers a good solution if there was a parity between group and subgroup, but in this case it can be random.

Thanks for any advice.

user619882
  • 350
  • 3
  • 13
  • Do you have typos in your HTML? It seems like `

    title1

    ` and `

    title2

    ` should be `

    title1

    ` and `

    title2

    `, respectively.
    – Sam Estep Aug 08 '17 at 18:21

3 Answers3

3

You can use Hickory for parsing, and then Clojure has some very nice tools for transforming the parsed HTML to the form you want:

(require '[hickory.core :as html])

(defn classifier [tag klass]
  (comp #{[:element tag klass]} (juxt :type :tag (comp :class :attrs))))

(def group? (classifier :div "“group”"))
(def subgroup? (classifier :div "“subgroup”"))
(def path? (classifier :a nil))
(defn identifier? [tag] (classifier tag nil))

(defn only [x]
  ;; https://stackoverflow.com/a/14792289/5044950
  {:pre [(seq x)
         (nil? (next x))]}
  (first x))

(defn identifier [tag element]
  (->> element :content (filter (identifier? tag)) only :content only))

(defn process [data]
  (for [group (filter group? (map html/as-hickory (html/parse-fragment data)))
        :let [title (identifier :h2 group)]
        subgroup (filter subgroup? (:content group))
        :let [subheading (identifier :h3 subgroup)]
        path (filter path? (:content subgroup))]
    [title subheading (:href (:attrs path))]))

Example:

(require '[clojure.pprint :as pprint])

(def data
"<div class=“group”>
  <h2>title1</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading1</h3>
    <a href=“path1” />
  </div>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading2</h3>
    <a href=“path2” />
  </div>
</div>
<div class=“group”>
  <h2>title2</h2>
  <div class=“subgroup”>
    <p>unused</p>
    <h3>subheading3</h3>
    <a href=“path3” />
  </div>
</div>")

(pprint/pprint (process data))
;; (["title1" "subheading1" "“path1”"]
;;  ["title1" "subheading2" "“path2”"]
;;  ["title2" "subheading3" "“path3”"])
Sam Estep
  • 12,974
  • 2
  • 37
  • 75
  • Thank you. That's incredibly succinct. I was struggling with getting at the right nodes, with enlive. I was not aware of Hickory. I'll read it's docs - it looks really useful, and a good name for a clojure library. – user619882 Aug 08 '17 at 20:17
0

The solution can be splited in two parts

  • Parsing: parse it with clojure html parser or any other parser.
  • Custom data structure: modify the parsed html, you can use clojure.walk for that if you want.
hernan
  • 572
  • 4
  • 10
0

You can solve this problem with the tupelo.forest library. Here is an annotated unit test showing the approach. You can find more information in the API docs and both the unit tests and the example demos. Additional documentation is forthcoming.

(dotest
  (with-forest (new-forest)
    (let [html-str        "<div class=“group”>
                              <h2>title1</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading1</h3>
                                <a href=“path1” />
                              </div>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading2</h3>
                                <a href=“path2” />
                              </div>
                            </div>
                            <div class=“group”>
                              <h2>title2</h2>
                              <div class=“subgroup”>
                                <p>unused</p>
                                <h3>subheading3</h3>
                                <a href=“path3” />
                              </div>
                            </div>"

          enlive-tree     (->> html-str
                            java.io.StringReader.
                            en-html/html-resource
                            first)
          root-hid        (add-tree-enlive enlive-tree)
          tree-1          (hid->hiccup root-hid)

          ; Removing whitespace nodes is optional; just done to keep things neat
          blank-leaf-hid? (fn fn-blank-leaf-hid? ; whitespace pred fn
                            [hid]
                            (let [node (hid->node hid)]
                              (and (contains-key? node ::tf/value)
                                (ts/whitespace? (grab ::tf/value node)))))
          blank-leaf-hids (keep-if blank-leaf-hid? (all-leaf-hids)) ; find whitespace nodes
          >>              (apply remove-hid blank-leaf-hids) ; delete whitespace nodes found
          tree-2          (hid->hiccup root-hid)
          >>              (is= tree-2 [:html
                                       [:body
                                        [:div {:class "“group”"}
                                         [:h2 "title1"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading1"]
                                          [:a {:href "“path1”"}]]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading2"]
                                          [:a {:href "“path2”"}]]]
                                        [:div {:class "“group”"}
                                         [:h2 "title2"]
                                         [:div {:class "“subgroup”"}
                                          [:p "unused"]
                                          [:h3 "subheading3"]
                                          [:a {:href "“path3”"}]]]]])

          ; find consectutive nested [:div :h2] pairs at any depth in the tree
          div-h2-paths    (find-paths root-hid [:** :div :h2])
          >>              (is= (format-paths div-h2-paths)
                            [[{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title1"}]]]]
                             [{:tag :html}
                              [{:tag :body}
                               [{:class "“group”", :tag :div}
                                [{:tag :h2, :tupelo.forest/value "title2"}]]]]])

          ; find the hid for each top-level :div (i.e. "group"); the next-to-last (-2) hid in each vector
          div-hids        (mapv #(idx % -2) div-h2-paths)
          ; for each of div-hids, find and collect nested :h3 values
          dif-h3-paths    (vec
                            (lazy-gen
                              (doseq [div-hid div-hids]
                                (let [h2-value  (find-leaf-value div-hid [:div :h2])
                                      h3-paths  (find-paths div-hid [:** :h3])
                                      h3-values (it-> h3-paths (mapv last it) (mapv hid->value it))]
                                  (doseq [h3-value h3-values]
                                    (yield [h2-value h3-value]))))))
          ]
      (is= dif-h3-paths
        [["title1" "subheading1"]
         ["title1" "subheading2"]
         ["title2" "subheading3"]])

      )))
Alan Thompson
  • 29,276
  • 6
  • 41
  • 48