Getting started

CDuce is a strongly-typed functional programming language adapted to the manipulation of XML documents. Its syntax is reminiscient of the ML family, but CDuce has a completely different type system.

Let us introduce directly some key concepts:

The expression binds two strings to value identifiers x and y, and then concatenates them. The general form of the local binding is:

where %%p%% is a pattern and %%e%%, %%e'%% are expressions.

CDuce uses its own notation to denote XML documents. In the next table we present an XML document on the left and the same document in CDuce notation on the right:

Clara Pål André clara@lri.fr 314-1592654 Bob Alice Anne Charlie 271828 66260 ]]> [ [ "Clara" [ [ ['Pål ' 'André'] [] ] ] ['clara@lri.fr'] "314-1592654" ] [ "Bob" [ [ "Alice" [] ] [ "Anne" [ [ "Charlie" [] ] ] ] ] "271828" "66260" ] ] ]]>

Note the straightforward correspondence between the two notations: instead of using an closing tag, we enclose the content of each element in square brackets. In CDuce square brackets denote sequences, that is, heterogeneous (ordered) lists of blank-separated elements. In CDuce strings are not a primitive data-type but are sequences of characters.

To the purpose of the example we used different notations to denote strings as in CDuce "xyz", ['xyz'], ['x' 'y' 'z'], [ 'xy' 'z' ], and [ 'x' 'yz' ] define the same string literal. Note also that the "Pål André" string is accepted as CDuce supports Unicode characters.

The program on the right hand-side in the previous section starts by binding the variable parents to the XML document. It also specifies that parents has the type ParentBook: this is optional but it usually allows earlier detection of type errors. If the file XML on the left hand-side is stored in a file, say, parents.xml then the same binding can be obtained by loading the file as follows

as load_xml converts and XML document stored i a file into the CDuce expression representing it.

First, we declare some types:

[Person*];; type Person = FPerson | MPerson;; type FPerson = [ Name Children (Tel | Email)*];; type MPerson = [ Name Children (Tel | Email)*];; type Name = [ PCDATA ];; type Children = [Person*];; type Tel = ['0'--'9'+ '-'? '0'--'9'+];; type Echar = 'a'--'z' | 'A'--'Z' | '_' | '0'--'9';; type Email= [ Echar+ ('.' Echar+)* '@' Echar+ ('.' Echar+)+ ];; ]]>

The type ParentBook describes XML documents that store information of persons. A tag <tag attr1=...; attr2=...; ...> followed by a sequence type denotes an XML document type. Sequence types classify ordered lists of heterogeneous elements and they are denoted by square brackets that enclose regular expressions over types (note that a regular expression over types is not a type, it just describes the content of a sequence type, therefore if it is not enclosed in square brackets it is meaningless). The definitions above state that a ParentBook element is formed by a possibly empty sequence of persons. A person is either of type FPerson or MPerson according to the value of the gender attribute. An equivalent definition for Person would thus be:

[ Name Children (Tel | Email)*];; ]]>

A person element is composed by a sequence formed of a name element, a children element, and zero or more telephone and e-mail elements, in this order.

Name elements contain strings. These are encoded as sequences of characters. The PCDATA keyword is equivalent to the regexp Char*, then String, [Char*], [PCDATA], [PCDATA* PCDATA], ..., are all equivalent notations. Children are composed of zero or more Person elements. Telephone elements have an optional (as indicated by =?) string attribute whose value is either ``home'' or ``work'' and they are formed by a single string of two non-empty sequences of numeric characters separated by an optional dash character. Had we wanted to state that a phone number is an integer with at least, say, 5 digits (of course this is meaningful only if no phone number starts by 0) we would have used an interval type such as <tel kind=?"home"|"work">[10000--*], where * here denotes plus infinity.

Echar is the type of characters in e-mails addresses. It is used in the regular expression defining Email to precisely constrain the form of the addresses. An XML document satisfying these constraints is shown