3

could someone explain me how to scrape content from <td> tags where the <th> has content value (actually in this case I need content of <b> tag for matching operation) "Row1 title", but without scraping <th> tag (or any of its content) in process? Here is my test HTML:

<table class="table_class"> 
                    <tbody> 
                       <tr> 
                         <th>
                           <b>
                              Row1 title
                           </b>
                         </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                       <tr> 
                         <th>                                
                              Row2 title                                
                          </th> 
                         <td>2.660.784</td> 
                         <td>2.944.552</td> 
                         <td>Correct, has 3 td elements</td> 
                       </tr> 
                    </tbody>
</table>

Data which I want to extract should come from these tags:

                     <td>2.660.784</td> 
                     <td>2.944.552</td> 
                     <td>Correct, has 3 td elements</td> 

I have managed to create function which returns entire content of the table, but I would like to exclude the <th> node from result, and to return only data from <td> nodes, which content I can use for further parsing. Can anyone help me with this?

Мitke
  • 310
  • 3
  • 17

1 Answers1

2

With enlive something like this

(ns tutorial.so-scrape
  (:require [net.cgrand.enlive-html :as html])

(defn parse-tds [url] 
 (html/select (html/html-resource (java.net.URL. url)) [:table :td])) 

should give you a sequence of all the td nodes, something of the form {:tag :td :attrs {...} :content (...)}. I am not aware that enlive gives you the possibility to get the content of those nodes directly. I could be wrong.

You could then extract the content of the sequence for something along the lines of
(for [line ws-content] (apply str (:content line)))

In regard to the question you posted yesterday (I am assuming you are still working with that page) - the solution I gave there was a little complex - but its also flexible. For example if you change the tag-type function like this

(defn tag-type [node]
  (case (:tag node) 
   :td    ::TerminalNode
   ::IgnoreNode)

(change the return value of all nodes to ::IgnoreNode except for :td then it just gives you a sequence of the content of the :tds which is probably close to what you want. Let me know if you need more help.

EDIT (in reply to comments below) I don't think selecting nodes based on their :content is possible with enlive alone - but you can certainly do so with Clojure.

for example you could do something like

(for [line ws-content :when (re-find (re-pattern "WHAT YOU WANT TO MATCH") (:content line))]
  (:content line))

could work. (you might have to tweak the (:content line) form a little..

Community
  • 1
  • 1
Paul
  • 7,836
  • 2
  • 41
  • 48
  • (pseudo code) `(map #(html/select % [[:th :b ("WHERE :CONTENT OF :b tag is equal to "UKUPNA AKTIVA")]]) (remove nil? (map get-data-balance_sheet (h3+table "http://www.belex.rs/trgovanje/prospekt/VZAS/show"))))` I get (()), but if only th stands there, I get sequence of all th elements in table. I want just one that has `` element with content "UKUPNA AKTIVA". This result I will try to use to get the ``nodes which come after him. – Мitke Oct 19 '11 at 12:01
  • The result to which I'm hopping for is sequence of elements where th tag (or in my case b tag) has specified value. – Мitke Oct 19 '11 at 12:05
  • I don't think what you want is possible with enlive alone - it selects nodes based on the nodes id's and classes, not on their content (as far as I know). But you can of course do this with Clojure - I don't see a reason why you wouldn't do it that way (see change in answer). – Paul Oct 19 '11 at 12:26
  • The edit you posted worked great for me. In the meantime, I manage to wrestle up some code, which enables me same thing, but yours is way more elegant. Here is my code (or to be precise a heavy duty hack ): `(apply str (flatten (map :content (flatten (map #(html/select % [:th :b]) (remove nil? (map get-data-balance_sheet (h3+table "http://www.belex.rs/trgovanje/prospekt/VZAS/show"))))))))` – Мitke Oct 19 '11 at 13:24
  • Well, yours is far shorter, on the other hand ;) Good luck and fun with clojure! – Paul Oct 19 '11 at 13:26
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/4393/discussion-between-mitke-and-paul) – Мitke Oct 19 '11 at 18:10