2

My objective is to extract and parse a series of bibliographical references from a webpage for entry into a database later. The references are all in MLA format. This should be a general solution, for all instances of MLA-format bibliographies, and should work on more than just the webpage indicated below.

Here is my attempt code, which doesn't work:

(use '[net.cgrand.enlive-html])

(def ^:dynamic *base-url* "https://www.impacttest.com/research/?Clinical-Research-Database-4")
(def ^:dynamic *ref-selector*     [:div#content_1 :ul :li])


(defn fetch-url [url]
  (html-resource (java.net.URL. url)))

(defn references []
  (select (fetch-url *base-url*) *ref-selector*))

(def ^:dynamic *ref-regex*    #"\s([A-Z]{1}[\w|\s]+)[,|\.]")
(def ^:dynamic *ref-modifier* `(remove :content))

(defmacro extract-re [node re modifier]
  `(doseq [seqs (map :content (node))]
    (re-find re (apply str (modifier seqs)))))

(extract-re references *ref-regex* *ref-modifier*)

(macroexpand-1 '(extract-re references *ref-regex* *ref-modifier*))

I would like the macro extract-re to create a doseq that runs a regex matcher (re-find) on all of the enlive nodes. There are two variables that need to change: one is the regex itself, and the other is the modifier, which modifies the enlive node before it's processed. Without the modifier, the regex will match both the authors and some titles. I tried writing a function, but couldn't get it to work in a general case, so I think a macro is the way to go.

On MLA references, I think it's easier to use the modifier on the enlive node than to do all of the extraction with regex, although I may be wrong on that. I can't think of how to do a regex that will only match the title or only the authors.

So, how do I pass the modifier to the macro and have it execute properly? I don't fully understand the quoting details of macros, so I may be way off on how I wrote the macro to begin with, or even if a macro is necessary.

Ben
  • 574
  • 3
  • 12
  • 2
    "I tried writing a function, but couldn't get it to work in a general case, so I think a macro is the way to go." -- macros are for altering syntax, outside of that there is nothing in clojure that a new macro could do but a function could not. How does a new syntax help you apply a regex properly? – noisesmith Jul 10 '14 at 14:34
  • I helps me apply the modifier properly, it has nothing to do with the regex. Other modifiers might not take the same form as the one here. – Ben Jul 10 '14 at 14:43
  • in fact it does not help you apply the modifier at all, see my fix posted shortly – noisesmith Jul 10 '14 at 15:07
  • 1
    Edited the title to something more relevant to the question/answer. Please revise if you have better wording. – A. Webb Jul 10 '14 at 16:53

2 Answers2

3

There are numerous issues with this code.

'(use [net.cgrand.enlive-html])

This does not bring in a library, it creates a literal list, and does nothing with it:

user> (class '(use [net.cgrand.enlive-html]))
clojure.lang.PersistentList

it is effectively a no-op.

(def ^:dynamic *ref-modifier* `(remove :content))

This creates a two element list, not a "modifier" of any sort.

(defmacro extract-re [node re modifier]
  `(doseq [seqs (map :content (node))]
    (re-find re (apply str (modifier seqs)))))

Here you use syntax-quote, but you never unquote anything inside it. The macro doesn't use any of its arguments in any way.

You seem to want to apply modifier as if it were a function (this does not even begin to happen, see the above quoting issues), but as we see in the actual call, modifier is a two element list, and would cause an error if called.

Finally, doseq only works for side effects, and always returns nil. The doseq block does not use the value generated by the re-find, so the doseq body is effectively a no-op.

Additionally, I see dubious utility in using dynamic var declarations for vars that will be supplied as explicit function arguments.

With all of these issues addressed, I think we are closer to something that works:

(use 'net.cgrand.enlive-html)

(def ^:dynamic *base-url*
  "https://www.impacttest.com/research/?Clinical-Research-Database-4")

(def ^:dynamic *ref-selector* [:div#content_1 :ul :li])


(defn fetch-url [url]
  (html-resource (java.net.URL. url)))

(defn references []
  (select (fetch-url *base-url*) *ref-selector*))

(def ^:dynamic *ref-regex* #"\s([A-Z]{1}[\w|\s]+)[,|\.]")

(def ^:dynamic *ref-modifier* (partial remove :content))

(defn extract-re [node re modifier]
  (doall
    (for [sq (map :content (node))]
      (re-find re (apply str (modifier sq))))))

and in action:

user> (extract-re references *ref-regex* *ref-modifier*)

([" Dambinova SA," "Dambinova SA"] [" Zuckerman SL," "Zuckerman SL"] [" Conklin HM," "Conklin HM"] [" Covassin T," "Covassin T"] [" Maerlender A," "Maerlender A"] [" Fedor A," "Fedor A"] [" Resch J," "Resch J"] [" Elbin RJ," "Elbin RJ"] [" Rabinowitz AR," "Rabinowitz AR"] [" Kinnaman KA," "Kinnaman KA"] [" Tsushima WT," "Tsushima WT"] [" Amonette WE," "Amonette WE"] [" Lovell MR," "Lovell MR"] [" Schatz P," "Schatz P"] [" McGrath N," "McGrath N"] [" Kontos AP," "Kontos AP"] [" AB," "AB"] [" Meehan WP," "Meehan WP"] [" Rieger BP," "Rieger BP"] [" Solomon GS," "Solomon GS"] [" Sandel NK," "Sandel NK"] [" Schatz P," "Schatz P"] [" Schatz P," "Schatz P"] [" Lebrun CM," "Lebrun CM"] [" Brooks B," "Brooks B"] [" Meehan WP," "Meehan WP"] [" Fakhran S," "Fakhran S"] [" Cole WR," "Cole WR"] [" Tsushima M," "Tsushima M"] [" Zuckerman SL," "Zuckerman SL"] [" JK," "JK"] [" Covassin T," "Covassin T"] [" Moser RS," "Moser RS"] [" Mayers LB," "Mayers LB"] [" McAllister TW," "McAllister TW"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Neal MT," "Neal MT"] [" Lau BC," "Lau BC"] [" Kontos AP," "Kontos AP"] [" Gardner A," "Gardner A"] [" Elbin RJ," "Elbin RJ"] [" Wolf EG," "Wolf EG"] [" Reddy CC," "Reddy CC"] [" Moser RS," "Moser RS"] [" Guerriero RM," "Guerriero RM"] [" Deibert E," "Deibert E"] [" Wiebe DJ," "Wiebe DJ"] [" Baillargeon A," "Baillargeon A"] [" Erdal K." "Erdal K"] [" Maugans TA," "Maugans TA"] [" Iverson GL," "Iverson GL"] [" Ponsford J," "Ponsford J"] [" Schatz P," "Schatz P"] [" Mulligan I," "Mulligan I"] [" Echlin PS," "Echlin PS"] [" McLeod TC," "McLeod TC"] [" Zuckerman SL," "Zuckerman SL"] [" Kontos AP," "Kontos AP"] [" Zuckerman SL," "Zuckerman SL"] [" Schatz P," "Schatz P"] [" Kontos AP," "Kontos AP"] [" Covassin T," "Covassin T"] [" Covassin T," "Covassin T"] [" Duhaime AC," "Duhaime AC"] [" Echemendia RJ," "Echemendia RJ"] [" Ramanathan DM," "Ramanathan DM"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Krol AL," "Krol AL"] [" Turgeon C," "Turgeon C"] [" Randolph C." "Randolph C"] [" Barlow M," "Barlow M"] [" Schatz P," "Schatz P"] [" Moser RS," "Moser RS"] [" Broglio SP," "Broglio SP"] [" Thomas DG," "Thomas DG"] [" Allen BJ," "Allen BJ"] [" Solomon GS," "Solomon GS"] [" Ponsford J," "Ponsford J"] [" Johnson EW," "Johnson EW"] [" Randolph C," "Randolph C"] [" Elbin RJ," "Elbin RJ"] [" Broglio SP," "Broglio SP"] [" Kontos AP," "Kontos AP"] [" Lau BC," "Lau BC"] [" Lau BC," "Lau BC"] [" Hettich T," "Hettich T"] [" Elbin T," "Elbin T"] [" Maerlender A," "Maerlender A"] [" Kontos AP," "Kontos AP"] [" Talavage TM," "Talavage TM"] [" Meehan WP 3rd," "Meehan WP 3rd"] [" Lange RT," "Lange RT"] [" Covassin T," "Covassin T"] [" Schatz P." "Schatz P"] [" Lange RT," "Lange RT"] [" Pardini JE," "Pardini JE"] [" Echlin PS," "Echlin PS"] [" Schatz P," "Schatz P"] [" Echlin PS," "Echlin PS"] [" Keightley ML," "Keightley ML"] [" McGrath N." "McGrath N"] [" Covassin T," "Covassin T"] [" Pontifex MB," "Pontifex MB"] [" AB," "AB"] [" Casson IR," "Casson IR"] [" McCrory P," "McCrory P"] [" Covassin T," "Covassin T"] [" Bruce JM," "Bruce JM"] [" Covassin T," "Covassin T"] [" Lovell M." "Lovell M"] [" Lau B," "Lau B"] [" Nance ML," "Nance ML"] [" Peterson SE," "Peterson SE"] [" Lovell M." "Lovell M"] [" Broglio SP," "Broglio SP"] [" Broglio SP," "Broglio SP"] [" Colvin AC," "Colvin AC"] [" Reddy CC," "Reddy CC"] [" Solomon GS," "Solomon GS"] [" Covassin T," "Covassin T"] [" Majerske CW," "Majerske CW"] [" Lovell MR," "Lovell MR"] [" AB," "AB"] [" Tsushima WT," "Tsushima WT"] [" Miller JR," "Miller JR"] [" Slobounov S," "Slobounov S"] [" Mihalik JP," "Mihalik JP"] [" Covassin T," "Covassin T"] [" Lovell MR," "Lovell MR"] [" Stoller KP." "Stoller KP"] [" Broglio SP," "Broglio SP"] [" Moser RS," "Moser RS"] [" Iverson G." "Iverson G"] [" Fazio VC," "Fazio VC"] [" Swanik CB," "Swanik CB"] [" Broglio SP," "Broglio SP"] [" Covassin T," "Covassin T"] [" Broglio SP," "Broglio SP"] [" Chen JK," "Chen JK"] [" Van Kampen DA," "Van Kampen DA"] [" Broglio SP," "Broglio SP"] [" Pellman EJ," "Pellman EJ"] [" Pellman EJ," "Pellman EJ"] [" Schatz P," "Schatz P"] [" Biasca N," "Biasca N"] [" Collins M," "Collins M"] [" Lovell MR," "Lovell MR"] [" Lovell MR," "Lovell MR"] [" Iverson GL," "Iverson GL"] [" Cantu RC," "Cantu RC"] [" McClincy MP," "McClincy MP"] [" Schatz P," "Schatz P"] [" Iverson GL," "Iverson GL"] [" Van Kampen DA," "Van Kampen DA"] [" Lovell M," "Lovell M"] [" Mihalik JP," "Mihalik JP"] [" Moser RS," "Moser RS"] [" Broshek DK," "Broshek DK"] [" Grove R," "Grove R"] [" McCrea M," "McCrea M"] [" McCrory P," "McCrory P"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Bruce JM," "Bruce JM"] [" Pellman EJ," "Pellman EJ"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Kontos A," "Kontos A"] [" Collins MW," "Collins MW"] [" Iverson GL," "Iverson GL"] [" Lovell M," "Lovell M"] [" Field M," "Field M"] [" Covassin T," "Covassin T"] [" Iverson GL," "Iverson GL"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Collins MW," "Collins MW"] [" Collins MW," "Collins MW"] [" Maroon JC," "Maroon JC"] [" Lovell MR," "Lovell MR"] [" Lovell MR." "Lovell MR"] [" Aubry M," "Aubry M"] [" Grindel SH," "Grindel SH"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"] [" Collins MW," "Collins MW"] [" Lovell MR," "Lovell MR"])
noisesmith
  • 20,076
  • 2
  • 41
  • 49
  • Perfect, thank you so much. The biggest thing I was struggling with was that I didn't understand macros/Clojure enough to even ask/formulate the question in the right way, and yet somehow you figured it out. Thanks again. – Ben Jul 10 '14 at 15:23
1

For the sake of illustration...

Note: I'll be prefixing enlive with html

(require '[net.cgrand.enlive-html :as html])

The output of (references) is a sequence of individual reference elements like

(def data-sample 
  '{:tag :li, :attrs nil, 
    :content 
    ("\n\t\t\t\t\t\t\t\t\t\t\t\t\t" 
      {:tag :strong, :attrs nil, 
       :content 
       ("AMPAR peptide values *snip*.")}
      " Dambinova SA, Shikuev, Weissman JD, Mullins, JD. "
      {:tag :em, :attrs nil, :content ("Military Medicine.")}
      " 2013, 178 (3):285-290.\t\t\t\t\t\t\t\t\t\t\t\t")})

You'll notice that the article title is in bold and the journal in italics, so we could use selectors to extract those at least. But, since changes in formatting are used to provide visual separation of the components, they'll also provide separation in the data.

(defn trimmed-text-only [html-data] 
  (as-> html-data x
    (html/select x [html/text-node])
    (map clojure.string/trim x)
    (remove empty? x)))

(trimmed-text-only data-sample)
;=> 
("AMPAR peptide values *snip*." 
 "Dambinova SA, Shikuev, Weissman JD, Mullins, JD." 
 "Military Medicine." 
 "2013, 178 (3):285-290.")

This already makes the components evident, but notice that each is separated by a period with periods not used within components. So, we can also ignore formatting altogether and split by the periods with the added benefit of removing those periods.

(defn extract-major-reference-components
  [html-data]
  (as-> html-data x
    (trimmed-text-only x)
    (apply str x)
    (clojure.string/split x #"\.")
    (zipmap [:title :authors :journal :issue-ref] x)))

(extract-major-reference-components data-sample)
;=> 
{:title "AMPAR peptide values *snip*"
 :authors "Dambinova SA, Shikuev, Weissman JD, Mullins, JD",
 :journal "Military Medicine",
 :issue-ref "2013, 178 (3):285-290"}

Now you can map this extraction function over the sequence of references. With the output maps, you can do further transformations with update-in and regexps to e.g. separate the individual authors or year, issue number, and pages from the issue-ref.

A. Webb
  • 26,227
  • 1
  • 63
  • 95
  • While the original answer was fantastic and addressed my specific question, this answer applies to a much more general issue, which will be more relevant to the greater Clojure community. Also, the enlive library doesn't have very good docs, so we could use more of this kind of explanation. Thanks, I'll accept this answer, and edit my question to be more general. – Ben Jul 10 '14 at 17:27
  • Btw, why did you use `as->`? Wouldn't `->` have sufficed? – Ben Jul 10 '14 at 21:48
  • Nevermind - I just noticed you have to change the threading location of the expr for the `(clojure.string/split x #"\.")` line. – Ben Jul 10 '14 at 21:50