Data types

This section lists all Welly's data types. For each one, it specifies:

  • the Welly source code syntax for the type,
  • the Welly value that represents the type,
  • the meaning of the type, and
  • the way values of the type are represented in memory.

There is a design document explaining why we've chosen these particular types.

This documentation is partly aspirational. Most features work as described, but there are some differences.

Every value has...

Before listing the available types, we can say a few things that apply to all values.

  • All values are represented by a contiguous string of bytes in memory, possibly including pointers to other values.
  • Assignment, copying, referencing, dereferencing, and function call and return are all defined as operations on the string of bytes, and cannot be overridden.
  • All values have properties needed to allow them to be included as fields of structs.
  • All values obey disciplines needed to support garbage collection.
  • All values can be "const" (immutable) or "var" (mutable).

Strings of bytes

A "value" is something that can be stored in a variable. Welly values are (conceptually) always stored in memory as a contiguous string of bytes. To be more specific, values will always be stored somewhere inside a block of memory allocated in the heap (there is no stack allocation in Welly). There is a wide variety of different kinds of value, but they all satisfy this universal property.

In practice, values will sometimes be held in virtual machine registers, which may in turn be held in x86 registers. However, that is an optimisation which is invisible to programmers. Similarly, where the specification says that Welly will copy a value, the optimiser might decide not to bother, and if it can prove that no harm will come of sharing the original. For simplicity, this specification describes the unoptimised version.

Some values require other data to be stored elsewhere in memory. For example, an array value is just a short header, including a pointer to another heap block containing the array elements. The other heap block is not considered to be part of the value, because it is not part of the contiguous string of bytes. This distinction matters when a value is copied.

Universal operations

Every value supports a few universal operations, which are defined in terms of the string of bytes that represents the value:

  • When you write the assignment statement a = b, it always means the same thing: find the string of bytes that represents b, and the range of memory addresses that represents a, and copy the former into the latter. No exception is made for any type of value.
  • When you write the expression new a it always means the same thing: find the string of bytes representing a, allocate a new mutable heap block that is just big enough, copy the bytes into it, and return a reference to it. No exception is made for any type of value.
  • When you write the expression &a it always means the same thing: find the range of memory addresses that represents a, and return a reference to them. No exception is made for any type of value.
  • When you write the struct expression (a, b) it always means the same thing: make a string of bytes containing the bytes that represent a and those that represent b. The layout is chosen in a defined way, depending on the sizes and alignments of a and b.
  • When you pass a value to a function, or return a value from a function, it is always done by allocating a fresh immutable heap block and copying the bytes of the value into it. This means that the value is passed by value. To pass a value by reference, you have to take its address and pass that instead. You can only pass one value; to pass multiple values you have to pass a struct instead.

Size and alignment

Every value has a size, being the number of bytes it occupies. The size is not stored in memory (in general) but is instead calculated at compile time.

Every value also has an alignment, which can be 0, 1, 2, 4 or 8. The address at which a value is stored in memory must be a multiple of its alignment. Like the size, the alignment of a value is calculated at compile time.

Values with alignment zero must all be stored at address zero. This alignment is only used for values of zero length, such as the empty struct, of type void. It is used for all such values. It is not possible to read or write bytes from/to such values, so it does not really matter what their address is; giving them an alignment of zero is just a way of saying "don't care".

On most processors, words of size n can (efficiently) only be read from or written to addresses that are multiples of n. For example, if you want to copy a large value eight bytes at a time, it has to be eight-byte-aligned. To allow itself to copy values in a reasonable number of operations, Welly therefore imposes a minimum alignment on values, equal to a quarter of their size, or eight, whichever is smaller. For example, a string of nineteen bytes will have an alignment of eight, and can therefore be copied as two words of eight bytes, one word of two bytes, and one more byte.

In summary, the allowed combinations of size and alignment are as follows:

AlignmentSize
00.
1More than zero and less than 4.
2More than zero and less than 8.
4More than zero and less than 16.
8More than zero.

Heap blocks in Welly always have addresses that are a multiple of eight. Therefore, any value can be stored at the start of a heap block without any further checks. It is to ensure this property that Welly does not support alignments larger than eight.

Pointers

Every value has the ability to enumerate the pointers inside it. It is important that the garbage collector is aware of all pointers inside values, as this is needed to define which heap blocks are reachable and which are garbage. It is also needed to allow the garbage collector to update pointers when it moves heap blocks around in memory.

Like the size and alignment of a value, the locations of the pointers within a value are not stored in memory as part of the value. Instead, they are calculated at compile time, depending only on the value's type. At run time, the program can manipulate the value efficiently, without having to manipulate type information as well.

Having said that, some metadata is stored in a separate area of memory whenever a heap block is allocated. The metadata is not accessible from within Welly, but is only accessed during garbage collection. The metadata is designed to be general enough to support other languages besides Welly, and uses a format that is simpler than Welly's types.

Pointers are eight bytes long. Any value that contains a pointer must have an alignment of eight.

Welly supposes that every non-zero 64-bit value can in principle be a pointer. Pointers always point inside allocated heap blocks, but not always to the start of those blocks, so pointers are not always multiples of 8. Welly does not "steal" any of the bits of a pointer for purposes other than storing the address of the pointee.

Constness

Every value is either const or var. Values that are const cannot be modified by anyone, and can be relied upon not to change over time. Values that are var can be modified by anyone, and the changes will be seen by everyone. Both contnesses are useful.

The constness of a value depends on where a value is stored. Constness is orthogonal to type; i.e. either constness can be combined with any type. The constness is determined when the value is first created, and cannot be changed without making a copy of the value.

In addition to const and var values, one sometimes encounters values of unknown constness. There is no third constness really; every such value is either const or var. Saying that the constness of a value is unknown simply means that your code must work in both cases. Values of unknown constness cannot safely be modified, but nor can they be trusted not to change their value.

Types of value

Every storage location and every value in Welly has a type. A value can be stored in a location if and only if their types are compatible.

The type of a value tells you how it is represented in memory. The representation of a value is chosen to be compact, while still being able to distinguish it from other values of the same type.

We say that a type t is a subtype of a type u if all values that can be stored in a location of type t can also be stored in a location of type u. Subtyping arises in Welly in several ways. Read more about subtypes.

Unlike in many programming languages, types are themselves first-class values, which can be stored in variables and passed to functions. Therefore, types have a type. The type of types is called type, and it has the following definition:

union constness {
    BASE
    CONST
    VAR
}

union type {
    PRIMITIVE(uint8, uint8)
    REF(constness, type)
    ARRAY0
    ARRAY(constness, type)
    STRUCT(array const type)
    UNION(array const struct {name: str, type: array const type})
    FUNCTION(array const type, type)
    DYNAMIC
    MACRO
    WITH(type, module)
    // TODO...
} with { ... }

This means that every type is one of the following:

  • A primitive type.
  • One of three kinds of ref or array type.
  • The type of empty arrays (written array0 in source code).
  • A fn type.
  • A struct type.
  • A union type.
  • One of the special types dynamic and macro.
  • A with type.

Primitive

[Aspirational - the current implementation is simpler.]

The primitive types defined in Welly's standard library are:

Type nameSizeAlignmentDescription
bool11Only values are `true` or `false`.
char44Unicode code point.
byte11Synonym for `uint8`.
uint811Unsigned 8-bit integer.
int811Signed 8-bit integer.
uint1622Unsigned 16-bit integer.
int1622Signed 16-bit integer.
uint3244Unsigned 32-bit integer.
int3244Signed 32-bit integer.
int8864-bit integer.
float324432-bit floating-point number.
float8864-bit floating-point number.

There is no separate 64-bit uint type, because 64-bit integers are never implicitly sign-extended. Read more about signedness.

These types will all be represented as instances of PRIMITIVE(size, alignment). The differences between them will be represented by wrapping the type in a with type. This arrangement allows libraries written by third parties to define new primitive types.

References

It is possible to take the address of any existing Welly value, without copying it into a heap block all of its own. Such an address is called a "reference". The Welly syntax for taking the address of a value x is &x. The Welly syntax for the type of a reference to type t is ref t, ref const t or ref var t.

The difference between the three kinds of reference is as follows:

  • ref neither guarantees that the value won't change, nor allows the value to be modified.
  • ref const guarantees that the value pointed to won't change.
  • ref var allows the value to be modified.

ref t offers the weakest guarantee. It is useful because it is a supertype of the other two; a function that accepts ref t can accept any reference to type t.

References occupy a single word: a pointer to the referee. They have a size of 8 and an alignment of 8. References can refer to locations in the middle of a heap block.

Note the distinction between references and pointers. References are first-class values that are represented as pointers. However, pointers are a low-level concept that is also used for other purposes, and pointers are not always first-class values.

Arrays

An array block is a heap block that contains many values of the same type, packed together as closely as possible. The values are called the "elements" of the array. The size of an array block is the size of each value plus padding, times the number of elements. The number of elements is not stored in the array block; anyone with a reference into the array block is expected to know how many elements they can access there. Array blocks are not first-class values in Welly; you can't stored them in variables, or pass them to functions.

An array value is a pointer into an array block (not necessarily to the first element) along with the number of elements that may be accessed via the pointer. An array value is a first-class value.

The Welly syntax for an array literal is [a, b, c], where a, b and c are values of the same type (there can be any number of values, from 0 upwards). The "array" module in the standard library provides other ways of constructing array values.

The Welly syntax for the type of an array with element type t is array t, array const t or array var t (see the section "References" above for an explanation of what these mean).

Arrays have a size of 16 and an alignment of 8.

Strings

[Aspirational - the current version treats strings as a special case.]

Strings in Welly are represented as immutable arrays of bytes, i.e. strings are assignment-compatible with array const byte. The characters of the string are packed into the array using the UTF-8 encoding. The type of strings is called str.

Structs

Values can be concatenated into a struct. The Welly syntax for writing the concatenation of values a, b and c is (a, b, c). The Welly syntax for writing the type of the struct is struct (t, u, v), in which t, u and v are the types of a, b and c respectively.

The constituent values of a struct are called its "fields". The size and alignment of a struct is calculated from the size and alignment of its fields. The offset of each field within a struct (by which I mean the difference between the address of the field and the address of the struct) is similarly calculated.

Structs are built from left to right. Each value is placed at the smallest offset that is consistent with its alignment and that does not cause it to overlap with any previous value. Note that it is possible for later fields to have a smaller offset than earlier fields, if they have a smaller alignment. When the struct is complete, the gaps are filled with padding bytes. The size of the struct is the offset of the first unused byte, and the alignment is the largest alignment of any member.

Example 1: struct (a: byte, b: int, c: byte)
a has offset 0; b has offset 8; c has offset 1. There are six bytes of padding at offset 2. The size of the struct is 16 and its alignment is 8.
Example 2: struct (struct (a: byte, b: int), c: byte)
a has offset 0; b has offset 8; c has offset 16. There are seven bytes of padding at offset 1. The size of the struct is 17 and its alignment is 8. c cannot be placed at offset 1 because it would overlap with the padding of the inner struct.
Example 3: struct (struct (a: int, b: byte), c: byte)
a has offset 0; b has offset 8; c has offset 9. There is no padding. The size of the struct is 10 and its alignment is 8.
Example 4: struct (a: byte, struct (b: int, c: byte))
a has offset 0; b has offset 8; c has offset 16. There are seven bytes of padding at offset 1. The size of the struct is 17 and its alignment is 8. The inner struct cannot be placed at offset 1 because it has an alignment of 8.
Example 5: struct (a: byte, struct (b: byte, c: int))
a has offset 0; b has offset 8; c has offset 16. There are seven bytes of padding at offset 1, and another seven bytes of padding at offset 9, i.e. at offset 1 of the inner struct. The size of the struct is 24 and its alignment is 8. The inner struct cannot be placed at offset 1 because it has an alignment of 8.

Void

[Implemented, but likely to change.]

The type void is a synonym for struct (). It has only one value: (). Therefore, a function that returns void does not return any information.

Unions

Unions are Welly's algebraic types. Unions are useful when you want a storage location that is able to hold more than one type of value. Each type of value that the location can hold is called a "case". The cases are named.

Case names are written in capital letters; they are lexically distinguished from identifiers, which must always contain a lower-case letter.

A good example of an algebraic type is the type of a binary tree with rings on its branches and bells on its leaves (ring and bell being previously defined types). This type has two cases:

  • A LEAF contains a bell.
  • A BRANCH contains a tree, a ring and another tree.

The Welly syntax for this type is:

union tree {
    LEAF(bell)
    BRANCH(tree, ring, tree)
}

The Welly syntax for constructing a LEAF is LEAF(x) where x is a bell. x is called the "payload" of the union value. (In the case where the payload is empty, the brackets can be omitted, i.e. NONE() can be written NONE).

A union type with more cases than this one is a supertype of it (assuming the payloads are compatible). A union type with fewer cases is a subtype. Read more about subtypes.

A union value always occupies 16 bytes, and always has an alignment of 8. The case name is stored at offset 0, represented as a 64-bit hash. The payload value is stored in a separate heap block. A pointer to it is stored at offset 8.

Note that the word at offset 0 is never a pointer, and the word at offset 8 is always a pointer. That's all the garbage collector needs to know, because the payload is a separate heap block with its own metadata.

Note another restriction on the representation of union values: when a value is cast from a subtype to a supertype, its representation must not change. Thus, it is not possible for any union value to occupy less than 16 bytes.

Never

[Implemented, but likely to change.]

The type never is a synonym for union {}. It does not have any values. Therefore, a function that returns never cannot return (except by throwing an exception).

Functions

Functions are first-class values in Welly. The Welly syntax for a function literal is fn name(a:t, b:u, c:v): w { ... } where a, b and c are the parameter names, t, u and v are their types, and w is the return type (which can often be omitted). The syntax for the type of that function is fn(t, u, v): w.

Function values have a size of 16 and an alignment of 8. A pointer to the code for the function is stored at offset 0. The non-static values captured when the function was defined are packed into a struct and stored in a separate heap block. A pointer to the capture struct is stored at offset 8.

[TODO: Cross-reference a page about the calling convention, and check that it is up to date. Make sure it defines how to throw an exception.]

Dynamic

Dynamic values are useful when you want to store a value whose type is not a compile-time-known constant. The Welly syntax for wrapping up a value x as a dynamic value is box x. If y is a dynamic value, then y.type is the type of the value it contains. If in addition y is known at compile time, then y.value is the value it contains.

Dynamic values have a size of 40 and an alignment of 8. The constness of the value is stored at offset 0. The type of the value is stored at offset 16. The value itself is stored in a separate heap block, and a pointer to it is stored at offset 32.

It is quite wasteful to use 16 bytes for the constness, which only has three possible values. We are likely to optimise the representation at some point.

Macros

[Aspirational and unfinished.]

Macros are first-class values. The representation of macros is implementation-defined. The type of macros is called macro.

Modules

[Implemented, but likely to change.]

Modules are first-class values. The representation of modules is implementation-defined, but they typically use a representation that can be defined and manipulated by Welly source code, i.e. something described elsewhere in this document. The type of modules is called module, and it is an opaque type.

with types

[Principle agreed, mechanism not. See [TODO] "doc/extensions/compact.txt" for an alternative design.]

It is possible to define a type that is based on another type, and assignment compatible with it, but with different meanings for its operators. The Welly syntax for defining a type u based on t is:

const u = t with(wrap) { ... }

The braces have the same syntax as the body of a module, and typically contain macros defining the various operators, and any other API that values of type u support. The way in which this is done is described in "reference/operators.html".

Values of type u have the same representation as values of type t, and are inter-assignable with them. In particular, you can use the cast operator : to convert a value between the two types, in either direction.

Values of type u have exactly the same size and alignment as values of type t.