End-user interface of TiledArray is a Domain-Specific Language (DSL), embedded in C++, for performing basic arithmetic and user-defined operations on DistArray
objects. For example, contraction of order-3 tensors to produce an order-2 tensor can be expressed as
Even such a simple expression is implemented as a sequence of elementary operations, such as permutation and tensor contraction; understanding what happens under the hood is important for writing optimized TiledArray code. Furthermore, TiledArray DSL is extensible: users can write their own expressions to provide custom functions or optimizations. Understanding how expressions are evaluated is needed for such customizations.
This document is intended for users who
- want to understand how TiledArray DSL expressions are evaluated so that they can can control and optimize the evaluation of TiledArray expressions, or
- want to understand how to extend the DSL to suit their needs.
Implementation
TiledArray DSL is built using the expression template idiom. Lazy evaluation of DSL expressions allows to (heuristically) optimize their evaluation by minimizing the number of permutations, minimizing memory accesses, minimizing temporary storage, and combining arithmetic operations; DSL does NOT perform more extensive term rewriting, such as operation reordering, factorization (strength reduction), or common subexpression elimination. Such task can be performed by the user (or, by an optimizing compiler), provided sufficient understanding of how the TiledArray DSL expressions are evaluated.
DSL Overview
DSL expressions are evaluated in a multi-stage process, each step of which can be overridden to suit your needs.
- construct an expression object: expression objects are the nodes of the abstract syntax tree describing a composite expression
- construct expression engine: expression engines compute the metadata needed to evaluate the expression, such as the following properties of the result:
- index list
- structure
Permutation
TiledRange
- shape
- data distribution
- world
- process map
- evaluate the expression: decompose expression into a set of tasks and submit these tasks to the task queue
- tile operation
- construct the distributed evaluator
- schedule tile evaluation tasks
Expression Interface
Expression objects capture the minimum amount of information required to define the operation. Namely, the expression objects capture the syntax of expressions and do not deal with other details (such as how the expression is actually evaluated, sizes of intermediate quantities, etc.). Together the objects represent an Abstract Syntax Tree representation of an expression. Since the expression objects only capture compile-time syntax of an expression, runtime evaluation of the expression involves converting AST to a tree of objects (engines) that contain the information necessary to actually evaluate the expression at runtime (see the Engine Interface section for more details).
There are three basic types of expression objects:
- leaf – no arguments (e.g. an
Array
+ index list) - unary – one argument (e.g. negation)
- binary – two arguments (e.g. addition)
Non-leaf expression objects in TiledArray keep track of the argument expressions and expresson metadata (such as scaling factors). Leaf expression objects keep track of the target DistArray
object to which the expression is bound to and the corresponding Einstein convention annotation (if any).
Expression Example Walkthrough
To understand how expressions are represented and then evaluated let's consider a simple example:
Here we assume that a
, b
, are c
are declared elsewhere as DistArray
objects, and a
and b
have been initialized appropriately.
Since TA DSL is embedded into C++, evaluation of this expression follows C++'s operator precedence rules, which for this case amounts to:
- Function call operator (
TsrArray<float>::operator()(const std::string&)
) - Multiplication operator
- Direct assignment
Step 1: Form TsrExpr Instances
The call operator, TArray<float>::operator()(const std::string&)
, creates an instance of the TsrExpr<TArray<float>>
class. Each instance holds the string indices provided to the call operator as well as a reference to the TArray<float>
instance whose call operator was invoked. If we define some temporaries according to:
then conceptually our example becomes:
This step largely serves as the kicking-off point for creating the abstract syntax tree (AST) and does little else.
Step 2: Multiplication
Step 1 formed leaves for the AST, step 2 creates a branch by resolving the multiplication operator. In this case operator*(const Expr<tensor_expr>&,const Expr<tensor_expr>&)
is selected. The resulting MultExpr<tensor_expr, tensor_expr>
instance contains references to annotated_a
and annotated_b
, but again little else has happened. If we define:
then conceptually our example becomes:
Step 3: Assignment
This is where the magic happens. For our particular example the overload that gets called is annotated_c
's operator=
, which has the signature:
This passes tensor c
to the provided expression's eval_to
member. Inside eval_to
:
- An engine is made
- The engine is initialized
- The engine generates a distributed evaluator
- The distributed evaluator is run
- Data is moved from the distributed evaluator to
c
.
These five steps contain a lot of detail which we are glossing over at this level. More details pertaining to how the engine works can be found here.
Generalizing to More Complicated Expressions
The above three steps can be generalized to:
- Form leaves
- Form branches
- Evaluate resulting tree
Step 1, form leaves, converts DSL-objects into expressions which can enter into the expression layer. With the leaves formed, step 2 creates the branches by combining: leaves with leaves, leaves with branches, and branches with branches. The formation of branches is not associative and the exact structure is dictated by C++'s operator precedence rules. AST formation terminates when the code has been wrapped up into a single AST and that AST is being assigned to a TsrExpr
instance. Once AST formation is complete, step 3 occurs which is to evaluate the AST and assign the result to the TsrExpr
.
Custom Expression Examples
To help you implement your own custom expressions copy and paste the appropriate expression code templates below and fill in the implementation details.
Example: Custom Leaf Expression
Example: Custom Unary Expression
Example: Custom Binary Expression
Work in progress.
Expression Engine Interface
Engine objects result from interpreting the AST using the runtime data (such as annotation strings, tensor ranks, etc.). They form a tree that represents a ready-to-evaluate expression. Evaluation of the engine tree is accomplished by traversing the tree and generating distributed evaluator objects that implement particular operations as DAGs of tasks.
Example Walkthrough
Let's consider the same expression example as earlier to understand how engines are constructed and what they do:
The first engine is created when the AST is assigned to c("i,j")
. Inside the assignment operator, the eval_to
member of the AST is invoked (and provided c
). For our present purposes eval_to
does 3 relevant things:
- It constructs the overall engine
- It initializes the overall engine
- It creates a distributed evaluator
For reference, in the contraction example, the overall engine is of type:
Step 1: Construction
The input to constructor for the overall engine is the overall expression to be evaluated. The constructor itself does little aside from setting the default values and propagating the map from the input expression to the engine.
Step 2: Initialization
The overall engine's init
member is called and provided: the parallel runtime, the process map, and a IndexList
instance containing the final tensor indices (i.e., the indices of the tensor being assigned to). The init
member function ultimately calls:
init_indices
init_struct
init_distribution
in that order.
Step 2a: Variable Initialization
Variable initialization starts in MultEngine::init_indices(const IndexList& target_indices)
. The purpose of the init_indices
is to propagate the target indices (in this case, {"i","j"}
) for each computation step down the tree; the goal is to minimize the number of permutations (layout changes) needed to evaluate the tree (this is complicated by the fact that some operations can permute their arguments and/or result implicitly, namely GEMM). N.B. For some operations (e.g., reductions) the target index list is empty, hence the init_indices()
overload is used in that case.
Calling init_indices(target_indices)
on the top engine in the engine tree calls the init_indices
recursively on its descendants; in this case, init_indices
will be invoked for the engines of the expressions on the left and right sides of the *
operator (since the left and right arguments are leaves of the AST, these will be no-ops). After giving the expressions on the left and right a chance to initialize their indices the indices for the left and right side are used to determine what sort of multiplication this is (a contraction or a Hadamard-product) and the appropriate base class's perm_indices
is then called.
Step 2b: Struct Initialization
This function initializes the details for the resulting tensor, specifically:
- the permutation to go from the result of the expression to the result tensor
- the tiling of the result tensor
- the sparsity of the resulting tensor
Within the MultEngine
instance this is done first for both the left and right expressions. Next, taking the type of multiplication occurring into account, the results of calling struct_init
on the left and right expressions are combined to generate the details of the tensor resulting from the multiplication.
Step 3: Make a Distributed Evaluator
For our MultEngine
the type of the distributed evaluator is DistEval<Tensor<float>, DensePolicy>
. Again this process is repeated recursively for the left and right expressions in MultEngine
resulting in the distributed evaluators for the left and right expressions, which are then forwarded to the PIMPL for the DistEval returned by MultEngine
. The distributed evaluator is what actually runs the operation.
Complete Example: Multiplication by scalar
This example demonstrates how to implement scaling expressions by a scalar. The TiledArray tile operation Scal
is used for demonstration purposes, but you are free to substitute your own tile operation. See the User Defined Tiles section for more details.