Commit 899aa0de authored by Pietro Abate's avatar Pietro Abate
Browse files

[r2005-07-30 14:11:51 by afrisch] Empty log message

Original author: afrisch
Date: 2005-07-30 14:11:51+00:00
parent 52c8e1a8
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>
<page name="ocaml">
<title>OCaml + CDuce</title>
<title>OCamlDuce</title>
<left>
<local-links href="index,documentation"/>
<p>On this page:</p>
<boxes-toc/>
</left>
<box>
<p>
OCaml+CDuce is a modified version of OCaml 3.08.3 with CDuce
extensions (expressions, types, patterns).
OCamlDuce is a merger between <a
href="http://caml.inria.fr/">OCaml</a> and
<local href="index">CDuce</local>. It comes as a modified
version of OCaml which integrates CDuce features: expressions, types,
patterns.
</p>
<p>
......@@ -22,18 +27,25 @@ simple examples. There is also an ocamldoc-generated documentation
for the <a href="http://pauillac.inria.fr/~frisch/ocamlcduce/doc/">support library</a>.
</p>
</box>
<box title="Download and installation" link="install">
<p>
The package contains a bootstrapped compiler. The build procedure is
the same as for OCaml (<tt>configure, make world, make install</tt>).
The build procedure for OCamlDuce is exactly the same as for OCaml:
<tt>configure, make world, make install</tt>. The names of the tools
are unchanged: <tt>ocaml,ocamlc,ocamlopt</tt>. Currently, OCamlDuce
is based on CVS snapshots of OCaml (between 3.08.3 and the current
<tt>release308</tt> branch) and CDuce (between 0.3.91 and the head).
</p>
<ul>
<li><a
href="http://pauillac.inria.fr/~frisch/ocamlcduce/download/cduce-ocaml-0.0.5.tar.gz">Compiler,
version 0.0.5</a></li>
<li><a
<!--<li><a
href="http://pauillac.inria.fr/~frisch/ocamlcduce/download/xml-support-0.0.4.tar.gz">Support
library, version 0.0.4</a></li>
library, version 0.0.4</a></li>-->
</ul>
<p>
......@@ -45,12 +57,637 @@ GODI_BUILD_SITES += http://pauillac.inria.fr/~frisch/ocamlcduce/godi
</sample>
<p>
and by forcing a recompilation of the <tt>godi-ocaml-src</tt>
and <tt>godi-ocaml</tt> packages. They should also build
the <tt>godi-xml-support</tt> library.
and <tt>godi-ocaml</tt> packages. <!--They should also build
the <tt>godi-xml-support</tt> library.-->
</p>
<!--
<p>
Some simple examples can be found <a -->
<!--href="http://pauillac.inria.fr/~frisch/ocamlcduce/tests/">here</a>.</p>
-->
</box>
<box title="Overview" link="overview">
<p>
In a nutshell, OCamlDuce extends OCaml with new kind of values
(<em>x-values</em>) which represent XML documents, fragments, tags, Unicode
strings. To describe these values, it also extends the type algebra
with so-called <em>x-types</em>. The philosophy behind these types is that they
represent <em>set of x-values</em>. They can be very precise: indeed,
each value can be seen as a singleton type (a set with a single
value), and it is possible to form Boolean combinations of x-types
(intersection, union, difference).
</p>
<p>
OCamlDuce's type system can be understood as a refinement of OCaml.
For each sub-expression which is inferred to be of the x-kind (using
OCaml unification based type-system), OCamlDuce will try to infer to
best possible sound x-type. Here, best means smallest for the natural
subtyping relation (set inclusion). The inference algorithm is
actually a data-flow analysis: the x-type will collect all the values
that can be produced by the expression, considering all the possible
data-flow in the program. It it sometimes necessary to provide
explicit type annotations to help the type checker infer this type, in
particular when you define recursive functions or when you use
iterators.
</p>
<p>
Subtyping is implicit for x-types: if an expression is inferred to be
of x-type <code>t</code>, which is a subtype of <code>s</code>, then
it is possible to use this expression in any context which expects a
value of type <code>s</code>.
</p>
</box>
<box title="Getting started" link="start">
<p>
Most of the new language features are enclosed within double curly braces
<code>{{ON}}{{...}}</code>. For instance, the following code sample
defines a value <code>x</code> as an XML element (with tag
<code>a</code>, an attribute <code>href</code>, and a simple
string as content):
</p>
<sample><![CDATA[{{ON}}
# let x = {{ <a href="http://www.cduce.org">['CDuce'] }};;
val x : {{<a href=[ 'http://www.cduce.org' ]>[ 'CDuce' ]}} =
{{<a href="http://www.cduce.org">[ 'CDuce' ]}}
]]></sample>
<p>
What appears between the curly braces is called an x-expression.
Similarly, there are x-types (as seen above), and also x-patterns.
The delimiters <code>{{ON}}{{...}}</code> are only used
for syntactical reasons, to avoid clashed between OCaml and CDuce
syntaxes and lexical conventions. As a matter of fact,
an OCaml expression need not be a syntactical x-expression
(delimited by double curly braces) to evaluate to an x-value.
For instance, once <code>x</code> has been declared as above,
the expression <code>x</code> evaluates to an x-value.
</p>
<p>
It is possible to use an arbitrary
OCaml expression as part of an x-expression: it must simply be
protected by a new pair of double curly braces. For instance, there is
no <code>if-then-else</code> construction for x-expressions, but you
can write:
</p>
<sample><![CDATA[{{ON}}
# {{ <a href={{if true then {{"a"}} else {{"z"}}}}>[] }};;
- : {{<a href=[ 'a' | 'z' ]>[ ]}} = {{<a href="a">[ ]}}
]]></sample>
<p>
Only the highlighted parts are parsed as x-expressions. The
<code>if-then-else</code> sub-expression is parsed as an OCaml
expression, but its type is an x-type (namely <code>{{ON}}{{[ 'a' |
'z' ]}}</code>).
</p>
</box>
<box title="X-values" link="values">
<p>
X-values are intended to represent XML documents and fragments
thereof: elements, tags, text, sequences. In this section, we
present the x-value algebra, the syntax of the corresponding
x-expression constructors and the associated x-types.
</p>
<p>
There are three kinds of atomic kind of x-values:
</p>
<ul>
<li>Unicode characters;</li>
<li>qualified names;</li>
<li>arbitrarily large integers.</li>
</ul>
<section title="Characters">
<p>
X-characters are different from OCaml characters. They can represent
the range of Unicode codepoints defined in the XML specification.
Character literals are delimited by single quotes. The escape
sequences \n, \r, \t, \b, \', \&quot;, \\ are recognized as usual. The
numerical escape sequence are written <code>\n;</code> where n is an integer
literal (note the extra semi-colon). The source code is interpreted as
being encoded in iso-8859-1. As a consequence, Unicode characters which are not
part of the Latin1 character set must be introduced with this
numerical escape mechanism. The x-types for x-characters are:
</p>
<ul>
<li>singletons;</li>
<li>intervals, written <code>c -- d</code>, where <code>c</code> and
<code>d</code> are literals (example: <code>{{ON}}type t = {{ 'a'--'z'
}}</code>);</li>
<li>the type of all x-characters, written <code>Char</code>;</li>
<li>the type of all Latin1 characters, written <code>Latin1Char</code>
(defined as <code>\0; -- \255;</code>).</li>
</ul>
</section>
<section title="Integers">
<p>
X-integers are arbitrarily large. Literals must be written in decimal.
Negative literals must be in parenthesis. E.g.: <code>(-3)</code>.
The x-types for x-integers are:
</p>
<ul>
<li>singletons;</li>
<li>intervals, written <code>i -- j</code>, where <code>i</code> and
<code>j</code> are literals (example: <code>{{ON}}type t = {{ 10--20
}}</code>); it is possible to replace <code>i</code> or <code>j</code>
with <code>**</code> to define open-ended intervals, e.g.
<code>{{ON}}type pos = {{ 1 -- ** }}</code>;
</li>
<li>the type of all x-integers, written <code>Int</code>;</li>
<li>the type of all the integers which can be represented by a
signed 32 (resp. 64) bit machine word, written <code>Int32</code> (resp.
<code>Int64</code>).</li>
</ul>
</section>
<section title="Qualified names">
<p>
Qualified names are intended to represent XML tag names. Conceptually,
they are made of a namespace URI and a local name. Since URIs tends
to be long, literals are of the form <code>`prefix:local</code>
where <code>local</code> is the local name and <code>prefix</code>
is an <em>namespace prefix</em> bound to some URI (in the scope of the
literal). The local name follows the definitions from
the XML Namespaces specification; a dot character must be protected
by a backslash and non-Latin1 characters are written as character
literals <code>\n;</code>. <a href="#ns">See below</a> for a
explanation on how to bind prefixes to URIs. To refer
to the default namespace (or the absence of namespace if not default
has been defined), the syntax is simply <code>`local</code>.
The x-types for qualified names are:
</p>
<ul>
<li>singletons;</li>
<li>the type of all qualified names, written <code>Atom</code>;</li>
<li>the type of all qualified names from a specified namespace,
written <code>`ns:*</code>.</li>
</ul>
</section>
<section title="Records">
<p>
X-records are mainly used to represent the set of attributes of an XML
element. An x-record is a binding from a finite set of <em>labels</em>
to x-values. Labels follows the same syntax as for qualified names
without the leading backquote. However, if the namespace prefix is not
given, the default namespace does not apply (the namespace URI is
empty). The syntax for record x-expressions is <code> { l1=e1
... ln=en }</code> where the <code>li</code> are labels and the
<code>ei</code> are x-expressions. Fields can also be separated with a
semi-colon. It is legal to omit the expression for a field; the label is then
taken as the content of the field (a value with this name must be
defined in the current scope), e.g.: <code>{{ON}}let x = ... and y = ...
in {{ {x y z=3} }}</code> is equivalent to <code>{{ON}}let x = ... and
y = ... in {{ {x=x y=y z=3} }}</code>. The types for x-records specify
which labels are authorized/mandatory, and what the types of the
corresponding fields are. There are two kind of record x-types:
</p>
<ul>
<li>
Closed record types, which only allow a finite number of fields:
<code>{ l1=t1 ... ln=tn }</code>;
</li>
<li>
Open record types, which allow additional fields (with arbitrary
type):
<code>{ l1=t1 ... ln=tn .. }</code> (the final two colons are
in the syntax).
</li>
</ul>
<p>
In both cases, it is possible to make one of
the fields optional by changing = to =?.
</p>
<p>
The x-type of all x-record is thus <code>{ .. }</code>,
and the x-type of x-records with maybe a field <code>l</code>
of type <code>Int</code> and maybe arbitrary other fields is
<code>{ l=?Int .. }</code>.
</p>
</section>
<section title="Sequences">
<p>
X-sequences are finite and ordered collections of x-values.
The syntax for a sequence x-expression in
<code>[ e1 ... en ]</code> (note that elements are <em>not</em> separated
by semi-colons as in OCaml list). Each item <code>ei</code>
can either be:
</p>
<ul>
<li>an x-expression;</li>
<li><code>!e</code> where <code>e</code> is an x-expression which
evaluates to a sequence (whose content is inserted in the sequence
which is currently defined); e.g.
<code>let x = [ 2 3 ] in [ 1 !x 4 ]</code> is equivalent to
<code>[ 1 2 3 4 ]</code>;</li>
<li>a string literal delimited by simple quotes; e.g.
<code>[ 'abc' ]</code> is equivalent to <code>[ 'a' 'b' 'c' ]</code>.</li>
</ul>
<p>
X-types for sequences are of the form <code>[R]</code>
where <code>R</code> is a regular expression over x-types which
describe the possible contents of the sequences. The possible
forms of regular expressions are:
</p>
<ul>
<li><code>t</code> (one single element of x-type <code>t</code>)</li>
<li><code>R*</code> (zero or more repetitions)</li>
<li><code>R+</code> (one or more repetitions)</li>
<li><code>R?</code> (zero or one repetition)</li>
<li><code>R1 R2</code> (sequence)</li>
<li><code>R1|R2</code> (alternation)</li>
<li><code>(R)</code></li>
<li><code>/t</code> (guard: the tail of the sequence must comply with
<code>t</code>).</li>
<li><code>PCDATA</code> (equivalent to Char*).</li>
</ul>
<note>sequence are actually encoded with embedded pairs and a
terminator, and sequences types are encoded with product types and
recursive types. The encoding is available to the programmer
but not described in this manual.
</note>
</section>
<section title="Strings">
<p>
Strings are nothing but sequences of characters. There are two
predefined types <code>String</code> and <code>Latin1</code>
(defined as <code>[ Char* ]</code> and <code>[ Latin1Char* ]</code>).
</p>
<p>
A string literal <code>[ '...' ]</code> can also be written
<code>"..." </code> (without the square brackets). Note that simple
(resp. double) quotes need to be escaped only when the string is
delimited with double (resp. simple) quotes.
</p>
</section>
<section title="XML elements">
<p>
An XML element is a triple of x-values. The syntax for
the corresponding x-expression constructor is
<code><![CDATA[<(e1) (e2)>e3]]></code>. When <code>e1</code> is a
qualified name literal, it is possible to omit the leading
backquote and the surrounding parentheses. Similarly,
when <code>e2</code> is an x-record literal, it is possible
to omit the curly braces and the parentheses. For instance,
one can simply write <code><![CDATA[<a href="abc">['def']]]></code>
instead of <code><![CDATA[<(`a) ({href="abc"})>['def']]]></code>.
</p>
<p>
XML element x-type are written <code><![CDATA[<(t1) (t2)>t3]]></code>,
and the same simplifications applies. For instance, if
the namespace prefix <code>ns</code> has been defined,
the following is a legal x-type <code><![CDATA[<ns:* ..>[]]]></code>;
it describes XML elements whose tag is in the namespace bound to
<code>ns</code>, with an empty content, and with an arbitrary set of
attributes. An underscore in place of <code>(t1)</code> is
equivalent to <code>(Atom)</code> (any tag).
</p>
</section>
</box>
<box title="X-expressions" link="expr">
<p>
In the previous section, we have seen the syntax for x-values
constructors (constant literals, sequence, record, element constructors).
In this section, we describe the other kinds of x-expressions.
</p>
<section title="Binary infix operators">
<p>
Some simple examples can be found <a href="http://pauillac.inria.fr/~frisch/ocamlcduce/tests/">here</a>.</p>
The arithmetic operators on integers follow the usual precedence.
They are written <code>+,*,-,div,mod</code> (they are all infix).
</p>
<p>
Record concatenation: <code>e1 ++ e2</code>. The x-expressions
<code>e1</code> and <code>e2</code> must evaluate to x-records.
The result is obtained by concatening them. If a field with the same
label is present in both records, the right-most one is selected.
</p>
<p>
Sequence concatenation: <code>e1 @ e2</code>, equivalent
to <code>[!e1 !e2]</code>.
</p>
</section>
<section title="Projections, filtering">
<p>
If the x-expression <code>e</code> evaluates to a record or an XML
element, the construction <code>e.l</code> will extract the value of
field or attribute <code>l</code>. Similarly, the construction
<code>e.?l</code> will extract the value of field or attribute
<code>l</code> if present, and return the empty sequence
<code>[]</code> otherwise.
</p>
<p>
If the x-expression <code>e</code> evaluates to a record,
the construction <code>e -. l</code> will produce a new record
where the field <code>l</code> has been removed (if present).
</p>
<p>
If the x-expression <code>e</code> evaluates to an x-sequence,
the construction <code>e/</code> will result in a new x-sequence
obtained by taking in order all the children of the XML elements
from the sequence <code>e</code>. For instance, the x-expression
<code><![CDATA[[<a>[ 1 2 3 ] 4 5 <b>[ 6 7 8 ] ]/]]></code>
evaluates to the x-value <code>[ 1 2 3 6 7 8 ]</code>.
</p>
<p>
If the x-expression <code>e</code> evaluates to an x-sequence,
the construction <code>e.(t)</code> (where <code>t</code> is an
x-type) will result in a new x-sequence
obtained by filtering <code>e</code> to keep only the elements
of type <code>t</code>. For instance, the x-expression
<code><![CDATA[[<a>[ 1 2 3 ] 4 5 <b>[ 6 7 8 ] ].(Int)]]></code>
evaluates to the x-value <code>[ 4 5 ]</code>.
</p>
</section>
<section title="Dynamic type checking">
<p>
If <code>e</code> is an x-expression and <code>t</code> is an x-type,
the construction <code>(e :? t)</code> returns the same
result as <code>e</code> if it has type <code>t</code>, and otherwise
raises a <code>Failure</code> exception whose argument explains
why this is not the case.
</p>
<sample><![CDATA[{{ON}}
# let f (x : {{ Any }}) = {{ (x :? <a>[ Int* ] ) }} in
f {{ <a>[ 1 2 '3' ] }};;
Exception:
Failure
"Value <a>[ 1 2 '3' ] does not match type <a>[ Int* ]\nValue '3' does not match type Int\n".
]]></sample>
</section>
<section title="Pattern matching">
<p>
OCamlDuce comes with a powerful pattern matching operation.
X-patterns are described <a href="#patterns">below</a>.
The syntax for the pattern matching operation is:
<code>match e with p1 -> e1 | ... | pn -> en</code>.
The type-system ensures exhaustivivity for the pattern matching
and infers precise types for the capture variables.
It is also possile to use x-pattern matching as a regular
OCaml expression; x-patterns must be surrounded by {{..}}, e.g.:
match e with {{p1}} -> e1 | ... | {{pn}} -> en
function {{p1}} -> e1 | ... | {{pn}} -> en
</p>
<note>
currently it is impossible to mix normal OCaml patterns and x-patterns
in a single pattern matching.
</note>
</section>
<section title="Local binding">
<p>
The x-expression <code>let p=e1 in e2</code> is equivalent to
<code>match e1 with p -> e2</code>. There is also an local binding
with an x-pattern in OCaml expressions: <code>let {{p}}=e1 in
e2</code>.
</p>
</section>
<section title="Iterators">
<p>
OCamlDuce comes with a sequence iterator
<code>map e with p1 -> e1 | ... | pn -> en</code> and
a tree iterator
<code>map* e with p1 -> e1 | ... | pn -> en</code>.
</p>
<p>
For both constructions, the argument must evaluate to a sequence.
The <code>map</code> iterator applies the patterns to each element
of this sequence in turns and produces a new sequence by concatenating
all the results (all the right-hand sides must thus produce a
sequence). The set of patterns must be exhaustive for all the possible
elements of the input sequence.
</p>
<p>
The tree iterator is similar except that the patterns need not be
exhaustive. If some element of the input sequence is not matched,
it is simply copied into the result unless it is an XML element. In
this case, the transformation is applied recursively to its content.
</p>
</section>
<section title="OCaml constructions">
<p>
As a convenience, some of the OCaml expression constructors
are allowed as x-expressions (without a need to go back to OCaml
with double curly braces): (unqualified) value identifiers and
function calls.
</p>
</section>
</box>
<box title="More on x-types" link="types">
<p>
We have seen how to write simple x-types. We can then combine
them with Boolean connectives:
</p>
<ul>
<li><code>t1 &amp; t2</code>: intersection;</li>
<li><code>t1 | t2</code>: union;</li>
<li><code>t1 - t2</code>: difference.</li>
</ul>
<p>
The empty x-type is written <code>Empty</code> (it contains no value),
and the universal x-type is written <code>Any</code> (it contains
all the x-values) or <code>_</code>.
</p>
<p>
When an x-type has been bound to some OCaml identifier
(<code>{{ON}}type t = {{...}}</code>), it is possible to use
this identifier in another x-type. Recursive definitions
are allowed:
</p>
<sample><![CDATA[{{ON}}
type t1 = {{ <a>[ t2* ] }}
and t2 = {{ <b>[ t1* ] }}
]]></sample>
<p>
Note that x-values are always finite and acyclic. The type checker
detects type definition which would yield empty types:
</p>
<sample><![CDATA[{{ON}}
# type t = {{ <a>[ t+ ] }};;
This definition yields an empty type
]]></sample>
<p>
If <code>t1</code> and <code>t2</code> are record x-types,
we can combine them with the infix <code>++</code> operator, which
mimics the corresponding operator on expressions (record
concatenation). Similarly, we can use the infix <code>@</code>
concatenation operator on sequence x-types.
</p>
</box>
<box title="X-patterns" link="patterns">
<p>
X-patterns follow the same syntax as X-types. In particular,
any X-type is a valid X-pattern. In addition to X-types constructors,
X-patterns can have:
</p>
<ul>
<li>capture variables (lowercase OCaml identifiers);</li>
<li>constant bindings <code>(x := c)</code> where x is a capture
variable and c is
a literal x-constant (this pattern always succeeds and returns the
binding x->c).</li>
</ul>
<p>
In record x-patterns, it is possible to omit the <code>=p</code> part of a field.
The content is then replaced with the label name considered as
a capture variable. E.g. <code>{ x y=p }</code> is equivalent to
<code>{ x=x y=p }</code>.</p>
<p>It is also possible to add an "else" clause:
<code>{ x = (a,_)|(a:=3) }</code>
will accept any record with atmost the field <code>x</code>. If the content
is a pair, the capture variable a will be bound to its component;
otherwise, it is set to <code>3</code>.</p>
<p>
In regular expressions, it is possible to extract whole subsequences
with the notation <code>x::R</code>, e.g.: <code>[ _* x::Int+ _* ]</code>
</p>
<p>
If the same sequence capture variable appears several times (or below a
repetition) in a regexp, it is bound to the concatenation of all
matched subsequences. E.g.: <code>[ (x::Int | _)* ]</code> will
collect in <code>x</code> all the elements of type <code>Int</code> from
a sequence.</p>
<p>
The regexp operators +,*,? are greedy by default (they match as long
as possible). They admit non-greedy variants +?,*?,??.
</p>
</box>
<box title="Namespace bindings" link="ns">
<p>
The binding of namespace prefixes to URIs
can be done either by toplevel phrases (structure items) or