Thursday 1 March 2012

Standard Library (4 of 5). XML and HTML Processing

In this post I show how Libretto handles XML and HTML documents – on both the server and client sides.

XML Document Model

The handling of XML structures is based on the following object model of an XML document:
package libretto/xml

class Node
class Namespace(var prefix: String, var uri: string)
class Named(var ns: Namespace, var name: String)
class Container(var contents: Node*)

// basic XML document components
class Text(var text: String) extends Node
class Attribute(var value: String) extends Named
class Element(var attributes: Attribute*) extends Named, Container, Node
class Comment(var comment: String) extends Node
class CDATA extends Text

// XML document 
class Document(root: Element) extends Container
A set of methods working with XML structures includes:
  • a: elem.a returns the sequence of attributes associated with the element elem
  • a: elem.a(key) returns the value of the attribute key of the element elem
  • add: elem.add(node) adds a node (an element, an attribute or a text) to the element or document elem
  • attr: attr(key, value) creates a new attribute
  • newDoc creates the top node of an XML document
  • e: elem.e returns the sequence of direct subelements of the node elem in the standard order
  • e: elem.e(name) returns the sequence of all direct subelements with the name name
  • ee: elem.ee returns the sequence of all subelements of the element (at any depth)
  • ee: elem.ee(name) returns the sequence of all subelements of elem (at any depth) having the name name
  • elem: elem(name) creates a new element named name
  • xmlEsc: str.xmlEsc is the XML escaping function, for instance,
             “<a>b</a>”.xmlEsc == “&lt;a&gt;b&lt;/a&gt;”
  • id: elem.id returns the id of the element, if it exists.
  • id: elem.id(idstring) looks in the context for the element with id == idstring.
  • text: text(str) creates a new text node containing the text str
  • t: elem.t returns the text from the direct text subnodes of the element elem
  • tt: elem.tt returns the text from all text subnodes of the element elem (at any depth).
These functions allow the programmer to work with XML documents in a style similar to that of XPath. For instance, the query “find the second step of cooking pasta” could look in XPath as follows:

//ingredient[@name="pasta"]//preparation/step[position()=2]/text()

and in Libretto as follows:
cb.ee('ingredient)?[name==“pasta”].ee('preparation).e('step)(1).t
Here cb is the root object of the XML cookbook. Note that indexing in XPath starts from 1, and in Libretto from 0.

XML Notation

Libretto programs can contain XML segments in the standard notation, for instance,
fix xml = <memo type="city">Kiel
This XML expression is translated into the following executable Libretto code:
elem('memo).add(attr('type, "city")).add(text("Kiel"))
Libretto allows parametric XML expressions with parameters enclosed in curly brackets. This is an example of a parametric description:
{
  fix newattr = attr('season, "summer")
  fix cnts = (“is ”, <c>Kiel</c>)

  fix xml = <memo type = {"ci" + "ty"} {newattr}>The city {cnts}</memo>

  xml.toXML
     //  <memo type="city" time="summer">The city is <c>Kiel</c></memo>
}
The expression assigned to the variable xml is translated into the following executable Libretto code:
elem("memo").add(attr('type, "ci" + "ty")).add(newattr).
                                  add(text("The city ")).add(cnts)
The parameters of three types are allowed in curly brackets:
  • an attribute value (the expression "ci" + "ty" in the example)
  • an attribute (the variable newattr)
  • the contents of an element (the variable cnts).
The programmer can define functions containing parametric XML structures:
def city(n) = <city>{n}</city>

<cities>{city(“Kiel”)} - {city(“London”)}</cities>.toXML
  //  <cities><city>Kiel</city> - <city>London</city></cities>

Handling HTML Documents

Libretto is designed to process in distributed information environments. This feature is accompanied by ability to port Libretto programs to various computing platforms. In particular, a Libretto program can run on the JVM, or be translated to JavaScript, and run in a thin client environment. There is a sub-language of Libretto, which can be translated to SQL for database manipulations. This technology allows the programmer to develop all components of compound and heterogeneous applications running in multiplatform environments.

The Libretto library, which supports HTML and thin client handling, is based on the Libretto DOM – a Libretto object model of an HTML document. Currently a large sub-language of Libretto can be translated to JavaScript with the exception of inheritance hierarchies and some other advanced components of the language.

HTML Document Object Model (Libretto DOM)

The Libretto DOM is based on the specification of HTML 5. The classes representing HTML elements have names corresponding to the names of HTML tags. For instance, the class META has the following definition
class META extends Element, 
                   HasEventHandlerContentAttributes,
                   HasName, 
                   HasCharset {
  var content: String*
  var httpEquiv: String
}
Each class representing an element has fields corresponding to the attributes of this element. Those attributes, which are common to many elements, are inherited from special Has-classes, in which such attributes are accumulated. Each element class contains
  • fields valid for all elements (id, class, hidden, etc.)
  • event fields valid for all elements except BODY (onclick, onblur, onfocus, etc.)
  • fields specific to a given element (for META this is content and httpEquiv).
The field style is associated with all element classes. This field contains an instance of the class Style, which defines the CSS-style of the element. Style is based on the CSS 2.1 specification. In it, the fields corresponding to the standard CSS properties are defined.

Besides, each element object has the field contents, which contains the sub-elements of the current element. Syntactic sugar based on using & (see section Field contents and &) allows the programmer to work with element contents in a compact way. Text data is stored in the objects of the class Text, which have the string field text.

Libretto DOM and LibrettoScript

Libretto programs working with the Libretto DOM are normal Libretto programs, which can be executed on all platforms supporting the language. But they are especially valuable in thin client environments, where only Javascript is widely supported. Thus, to practice the multiplatform idea, a translator from a sub-language of Libretto to Javascript has been implemented. This sub-language is called LibrettoScript. The use of LibrettoScript allows the programmer to develop different components of web services within the same language, both on the server and client sides. The following components of Libretto are included in LibrettoScript (note that all examples below are also valid in the standard environment of Libretto).

Paths. In LibrettoScript, paths can be used. For instance, the following path counts the total number of rows in all tables of a document rooted in the object doc:
doc &&TR. size
Here doc is an object representing the root of some HTML document. Note that doc &&TABLE is equal to the sequence of all table elements in the document doc (at any depth, because && denotes the transitive closure of the field contents).

Predicates. In LibrettoScript, predicates are also supported, which behave as filters in object selection. For instance, the following query selects all paragraphs in English, which are sub-elements of an element block:
block &P ?[lang == “en”]
The following path counts the number of rows in a table with id equal to “myTableId”:
doc &&TABLE ?[id == “myTableId”]. &TBODY. &TR. size
Navigation with &. By using &, transitions from elements to their sub-elements are performed:
node &TR  is syntactic sugar for node.contents?[TR]
&& collects all descendants of an element in the transitive closure of the field contents. This syntactic sugar can also be used in predicates. For instance, the following query selects all paragraphs, which are sub-elements of an element block, and contain at least one span element:
block &P ?[&SPAN]
Blocks {...} are used in LibrettoScript for the modification of element properties. For instance, we can assign a hotkey:
doc &&BUTTON ?[id == “button1”] {accesskey = “s”} 
or justify the text of all DIV elements:
doc&& DIV {align = “justify”}
or change the style properties of an element with id equal to “err”:
doc&& ?[id==“err”].style {color = “red”; fontSize = fontSize * 1.2} 
If operator. In LibrettoScript, if-expressions can be used, for instance,
 
doc&& ?[id==“err”].style.  
  if (color==“red”) {color=“green”; fontSize=10} 
  else {color=“red”; fontSize=20}
Sequence ordering. The sorting operator is also included in LibrettoScript. For instance, we can sort all unordered lists (ULs) of the document in the alphabetic order of their first item:
doc &&UL ^(&LI(0) &Text(0). text)
The context value $. The symbol $ also can be used for handling context values in paths. For instance, let us find in text nodes the strings, which contain a substring “hello”, and create a modified version of these strings:
doc &&Text. text ?[substring(“hello”)]. ($ + “ and goodbye”)
fix and var variables. As an example, let us collect in the variable divs all DIV elements of the document, the field class of which contains a substring “main”, and then remove in them all text sub-nodes:
{
  fix divs = doc &&DIV ?[`class`.substring(“main”)]
  divs &&Text --
}
Class definitions. In LibrettoScript, classes can be defined, but without inheritance:
class MyClass(myProp1: String) {
  myProp2: MyClass
}
We can also create instances of these classes, both declarative and dynamic:
object myObj extends MyClass(“string1”)
var mc = MyClass(“string2”)
External fields on elements. It is allowed to create external fields defined on the elements of an HTML document. For instance,
var DIV idx: Int
doc&& DIV index i {idx = i} 
doc&& DIV ?[idx == 5]. ~contents& ?[idx == 5] --
defines a new external field idx, in which the indices of DIV nodes are stored. Then we delete node 5 from the document.

Functions. In LibrettoScript, both predefined (e.g., size or substring), and user-defined functions can be used. For instance, the following function counts nodes containing a certain substring:
def countTxt(s: String) {
  doc &&Text ?[text.substring(s)]. size
}
Recursive functions also can be defined, for instance, a function, which finds the paragraph node containing the maximum number of text subnodes:
def maxP(col) {
  if (P) {
    fix cc = &Text. size
    if (cc > col.count) col {el = this; count = cc)}
  }
  &.maxP(col)
  col
}
This is the query:
fix res = doc. maxP(Any() # {var el; var count = -1 })
Data is collected in an object with two fields, which is created on the spot in a duck-typing style. The current maximum node is stored in el, and the current maximum number is saved in count. Note that the iterative tools of Libretto allow us to avoid the recursion (as in many other cases):
var col = Any() # {var el; var count = -1 }
doc&& P as p. {
  fix cc = & Text.size 
  if (cc > col.count) col {el = p; count = cc}
}. col 
Anonymous functions. In LibrettoScript anonymous functions can be defined, which are used mainly for implementing dynamic features in HTML documents (like anonymous functions function(...) {...} in Javascript), for instance,
window {onload = %{&&A {`class` += "green"}}}
Assignment operators. In LibrettoScript, all assignment operators of Libretto are defined:
  • The block assignments =, .=, +=, and the deletion operator --. For instance, let us add the ellipsis to the text of the first paragraph containing a substring “hello”:
    def hasTxt(str) = &&Text.text.substring(str)
    
    doc &&P ?[hasTxt(“hello”)](0)& += Text(“...”)
    

  • The operators as and index. For instance, let us add to all sub-elements P of elements DIV the class attributes of the super-nodes:
    doc&& DIV as d. &P {`class` += d.`class`} 
Boolean and arithmetic operators. In LibrettoScript, the standard arithmetic operators +, -, *, div, mod, the relations ==, !=, <, <=, >, >=, eq and the boolean connectives and, or, not are defined. The following query finds in a document all nodes with the attribute class equal to “main”, and containing more than one table (table sub-elements):
doc&& ?[`class` == “main” and &TABLE.size > 1]
Sequences. In LibrettoScript, sequences can be used explicitly. For instance, the following query finds empty paragraphs and insert the text Hello, <br/> world! in them:
doc. {&&P ?[not &] & = (Text(“Hello, ”), BR(), Text(“world!”))}
Comments. In LibrettoScript, comments // and /*…*/ also can be used.

No comments:

Post a Comment