Algebraic data types are a very convenient and expressive way
of modeling data. You're familiar with them if you've ever
used any ML-like language: you have products,
which are pairs or records:
("hello", 7) : String * Int
and sums, which are discriminated unions:
Left "hello" : String + Int
Right 7 : String + Int
Usually, modern ML-like languages provide a mechanism for
creating named sums, where you provide your own names for
each discriminant, or constructor:
data IntList = Cons (Int * IntList) | Nil
You also have the so-called unit or top type, which
is the type with a single possible value, usually written
()
,
and—at least in the underlying theory—the so-called bottom
type, which has no values. The top and bottom types
are sometimes written with the symbols \(\top\) and \(\bot\), and sometimes
with the numerals 1
and 0
.
This is because
they act as the identity elements in the algebra of types.
The types a+0
and 0+a
are both isomorphic to a
.
Likewise, a*1
and 1*a
are also both isomorphic to a
.
Algebraic data types first appeared as a usable feature in the
Hope programming language,
which was a very early statically typed functional
language. Despite some syntactic weirdness, they worked very
much like modern algebraic data types: the operator for product
types was written as
#
instead of *
, and sum constructors
were written with ++
in between variants, but
Hope type declarations
still look remarkably familiar 45 years later:data numlist == cons ( num # numlist ) ++ nil ;
Pattern-matching on those types also looks remarkably similar to
modern functional languages, complete with definition-level
pattern-matching as found in languages like Haskell:
dec sum : numlist -> num ;
--- sum ( nil ) <= 0 ;
--- sum ( cons ( x, xs ) ) <= x + sum xs ;
Algebraic data types also featured in early ML implemenations, but in a
very awkward way. The first DEC-10 ML implementation didn't have a
mechanism for defining named sums; instead, you had to use
the built-in sum operator
+
with its constructors:
The symbols *
and **
in these signatures are
type variables, like 'a
in modern versions of ML.inl : * -> * + **
inr : ** -> * + **
You couldn't pattern-match on those values: you had to use the
accessor functions
outl
and outr
instead, which would throw
exceptions if you tried to access the wrong value. You could
explicitly check the tag with the function isl
:outl : * + ** -> *
outr : * + ** -> **
isl : * + ** -> bool
We could create a data type by giving the algebraic
representation of the type in terms of products, sums, and existing
types. Declarations used the
abstype
or
absrectype
keywords, depending on whether the type was
recursive or not. Like Hope, ML used #
for pairs, and
also used the symbol .
to stand for unit
:absrectype intlist = (int # intlist) + . ;;
This definition would create a new type as well as two
functions,
absintlist
and repintlist
, with
the following types:absintlist : ((int # intlist) + .) -> intlist
repintlist : intlist -> ((int # intlist) + .)
Whenever we defined a type
T
as isomorphic to a
representation R
, the language would
define the functions absT : R -> T
and
repT : T -> R
, filling in whatever particular type
name and representation we give.
Those functions allow us to, for example, construct values
of type intlist
by "wrapping" a value of the
representation type:val mylist = absintlist(inl(1, absintlist(inr ())));;
Of course, constructing all our values like that would be
a nightmare, so DEC-10 ML programs often used the
with
construct.
The with
keyword is preceeded by a set of declarations and followed
by a different set of declarations. The value definitions
before the with
keyword are available to the
rest of the declarations in the with
construct, but
only the ones after the with
keyword are introduced
into the surrounding scope. In our case, this allows us
to write a type with a set of constructor and accessor
functions, but keep the abs
and rep
functions
hidden:absrectype intlist = (int # intlist) + .
with cons(x, xs) = absintlist(inl(x, xs))
and nil = absintlist(inr())
and head lst = fst(outl(repintlist lst))
and tail lst = snd(outl(repintlist lst))
and isempty lst = not(isl(repintlist lst));;
The semantics of
with
ensure that the absintlist
and
repintlist
functions don't escape the declaration block,
and lists can only be constructed the user-written
constructor functions. We can then construct our values much more
tersely:val mylist = cons(1, nil)
The successor to DEC-10 ML, VAX ML, did include the same datatype
DEC-10 ML and VAX ML are both collectively referred to as Edinburgh ML.
The names come via Luca Cardelli, who wrote VAX ML as a successor to and
improvement on the original DEC-10 ML.
mechanism as described above,
but also included both named sums and named products. Named products
corresponded to what are called structs or records, but
unlike structs in languages like C or Pascal, they didn't need to
have their types named or declared, as structural descriptions of
the types would suffice. The VAX ML syntax used
so-called "decorated parentheses":
(| a = "foo"; b = 7 |) : (| a: string; b: int |)
The named sum mechanism also did not need any declarations: the
type instead was written as the set of possible constructor names along
with their argument types:
[| c = "foo" |] : [| c: string; d: int |]
[| d = 7 |] : [| c: string; d: int |]
Unlike sums and products in DEC-10 ML, these sums and products
could be picked apart by pattern matching, using a syntax
very similar to the syntax used to create the values:
case e : [| c: string; d: int |] of
[| c = s . 0;
d = n . n + n
|]
The effort to create Standard ML started in the early 80's, and
it drew on features from both Hope and the existing ML languages. In
A Proposal for Standard ML, Robin Milner explains that:
…a difficult decision had to be made concerning HOPE's treatment of data types—present only in embryonic form in the original ML—and the labelled records and variants which Cardelli introduced in his VAX version. The latter have definite advantages which the former lack; on the other hand, the HOPE treatment is well-rounded in its own terms. Though a combination of these features is possible, it seemed (at least to me, but some disagreed!) to entail too rich a language for the present proposal. Thus the HOPE treatment is fully adopted here.
Consequently, the first versions of SML drew their algebraic data
types from Hope's implemenation, abandoning the tedious and difficult
DEC-10 ML versions. Later versions did end up drawing on Cardelli's
named products to create records, with a change in notation
from
(| ... |)
to { ... }
, and thus we get the
algebraic data types we all know and love.