How to represent a simple document as an s-exp?

Question

I'm trying to understand how to express a simple document in an s-expression. Here's what I mean. Let's say I have this simple html structure:

<h1>Document Title</h1>
<p>Paragraph with some text.</p>
<p>Paragraph with some <strong>bold</strong> text.</p>

Also let's assume that I'm okay with losing the original tag provenance and just want to preserve the structure. How could this be expressed with an sexp? My initial try (using clojure) looks like this, but I'm not sure that it is correct:

(def sexp-doc '("Document Title"
                ("" ())
                ("This is a paragraph with some text." ())
                ("" ())
                ("This is a paragraph with" ("bold" ()) ("text." ()))))

score 3 · Accepted Answer · answered Feb 14 '17 at 17:00

I would recommend using Hiccup's syntax for cases like this:

(require '[clojure.string :as str]
         '[hiccup.core :as hiccup])

(def document
  [[:h1 "Document Title"]
   [:p "Paragraph with some text."]
   [:p "Paragraph with some " [:strong "bold"] " text."]])

(println (str/join "\n" (map #(hiccup/html %) document)))
;; <h1>Document Title</h1>
;; <p>Paragraph with some text.</p>
;; <p>Paragraph with some <strong>bold</strong> text.</p>
;;=> nil

If you don't need to convert back to an HTML string, then obviously you don't need the Hiccup dependency; I simply put it here to demonstrate that each of those three vectors is valid Hiccup.

Since this syntax uses vectors instead of lists, you don't need to quote things or use the list function directly, which gives you a couple of advantages:

If you quote things, you can't call functions to construct inner forms in "Hiccup literals"
If you have to call the list function for each form, it gets crowded and hard to read

And if you want to come up with your own tags to use instead of the existing HTML tags, there's nothing stopping you from doing that within Hiccup's syntax as well.

score 1 · Answer 2 · answered Feb 14 '17 at 09:51

1

S-expressions are trees and thus the example below can be a representation of a html document:

'(html (head (title "some title") (meta "some meta"))
       (body (h1 "Hello, World!") (p "This is the" (strong "body") "text"))

Attributes can be implemented by every element having a first element with the tag attributes:

'(p (attributes 
     (attribute (name "style") 
                (value "margin: 10px;"))
     (attribute (name "title") 
                (value "Ingress"))) 
    "Once upon a time ....)

It's not pretty since the attributes actually represents one level of key value data for each tag which needs to be structures. I think at one point W3C actually suggested something like this, but it makes the document much more complex.

answered Feb 14 '17 at 09:51

Sylwester

47,942
4
47
79

What I'm actually trying to do is represent the document content just in terms of a single tag, and then store the html data (presuming there is some) in a metadata attribute. But still figuring out how this would look. – fraxture Feb 14 '17 at 09:55
@fraxture With just a single tag you wont preserve the document structure. You don't need to model your structure to mimmick html totally, just the level of detail as you need and you can embed logic in the order. Eg. allowing only one headline then paragraphs can be done with a list of text with no tags whatsoever since you can recreate the html based on your own rules. – Sylwester Feb 14 '17 at 10:01
I guess by structure I just mean that: header, maybe section, and then paragraphs. What you say here: "allowing only one headline then paragraphs can be done with a list of text with no tags whatsoever" seems to be the direction I want to go. But I'm not sure what this might look like. Was my example above flawed? If so, any chance you could give me an alternate example? – fraxture Feb 14 '17 at 10:10
@fraxture Not at all, but I didn't understand what the empty lists and the lists with empty strings were for. A minimalistic version of your document IMO would be `'("Document Title" ("This is a paragraph with some text.") ("This is a paragraph with " (b "bold") " text."))`. The self defined rules in this example are that only top element lists are paragraphs and you can only have one heading (the first element). The lists under that are style or other supported structure like the `b` tag. – Sylwester Feb 14 '17 at 10:54
Thanks for that example. This is very helpful. A few more questions -- clearing out the confusions in my mind. The "rules" you refer to here are your imposed rules, right? Assumptions about the structure you provided? What do you mean by "top element lists"? And this `b` tag, again, is a tag that you're saying would be "supported" in your rules, rather than something that's part of sexps, right? – fraxture Feb 14 '17 at 11:02
Imposed yes. When you remove information you assume structure. The top level element is list elements under the containing list, thus not the list with the `b` symbol. The tags are to differentiate between what they represent and in my example I used the same as a html tag but you are free to choose what represent what modifier. Imagine if you need a table in your document. You need to have a tag to indicate that and other tags under that to show the row, columns, etc.. You need enough information so that you can parse and see it's a table. The whole thing is the s-expr – Sylwester Feb 14 '17 at 11:25

How to represent a simple document as an s-exp?

2 Answers2