In Python’s Numpy library lives an extremely general, but little-known and used, function called
einsum() that performs summation according to Einstein’s summation convention. In this tutorial article, we demystify
The only thing that the reader should need is an understanding of multidimensional Linear Algebra and Python programming. The reader will be better-prepared if he has, in the past, implemented the classic matrix multiplication using three nested loops in the language of his choice.
Why you should learn about it
The Einstein summation convention is the ultimate generalization of products such as matrix multiplication to multiple dimensions. It offers a compact and elegant way of specifying almost any product of scalars/vectors/matrices/tensors. Despite its generality, it can reduce the number of errors made by computer scientists and reduce the time they spend reasoning about linear algebra. It does so by being simultaneously clearer, more explicit, more self-documenting, more declarative in style and less cognitively burdensome to use. Its advantages over such things as matrix multiplication are that it liberates its user from having to think about:
- The correct order in which to supply the argument tensors
- The correct transpositions to apply to the argument tensors
- Ensuring that the correct tensor dimensions are lined up with one another
- The correct transposition to apply to the resulting tensor
The only things it requires are knowledge of:
- Along which dimensions to compute (inner/element-wise/outer) products.
- The desired output shape.
As an aside, Einstein summation is indeed named after the famed physicist and theoretician Albert Einstein. However, Einstein had no part in its development. He merely popularized it, by expressing his entire theory of General Relativity in it. In a letter to Tullio Levi-Civita, co-developer alongside Gregorio Ricci-Curbastro of Ricci calculus (of which this summation notation was only a part), Einstein wrote:
I admire the elegance of your method of computation; it must be nice to ride through these fields upon the horse of true mathematics while the like of us have to make our way laboriously on foot.
High praise indeed coming from Einstein himself!
NB: As a further aside, the most general formulation of Einstein summation involves topics such as covariance and contravariance, indicated by subscript and superscript indices respectively. For our purposes, we will ignore co-/contravariance, since we can and will choose the “basis” we operate in to make the complexities that they introduce disappear.
How it Works
einsum()‘s Numpy documentation may be totally opaque to some, it operates on a simple principle and is enlightening once understood.
The only dependency is Numpy.
It is invoked with a format string and any number of argument Numpy tensors, and returns a result tensor.
This format string contains commas (
,) that separate the specifications of the arguments, as well as an arrow (
->) that separates the specifications of the arguments from that of the resulting tensor. The number of argument specifications and the number of arguments must match, and exactly one arrow followed by a result specification must be present 1.
The specification of the arguments and result tensors are a series of (alphabetical, ASCII) characters. The number of characters in a tensor’s specification is exactly equal to the number of dimensions of this tensor.
Complete examples of
einsum() format strings are first shown below. Try to guess what the result will be! Note that 0-dimensional tensors (scalars) are valid as both argument and result; They are represented by empty strings
"". A reminder, again, that the specifications must be composed exclusively of ASCII lower or upper-case letters. Numpy will throw an error for anything else.
In both Einstein summation and Numpy’s
einsum(), one labels every axis of every tensor with a letter that represents the index that will be used when iterating over that axis.
einsum() is then easily expressed as a deeply-nested set of
for-loops. At the heart of these
for-loops is a summation of products of the arguments. Those arguments are indexed precisely as the format string suggests. An elaborate example is below.
It should be now clear how the format string specifications relate to the dimensionality of the arguments and result, and how indices plop into place in the order given by the format string.
The astute reader will have noticed that in the matrix multiplication example above, there are four axes in total between all input arguments, but only three different indices (
k is reused), and thus only three nested loops. That’s because we march in lockstep along the columns of
A and the rows of
B. We will get back later to the idea of index reuse, but first we must distinguish between two types of indices.
All indices in an Einstein summation or
einsum() format string can be partitioned in two sets: The set of free indices, and the set of summation indices. The difference is quite simple:
- Free indices are the indices used in the output specification. They are associated with the outer
- Summation indices are all other indices: those that appear in the argument specifications but not in the output specification. They are so called because they are summed out when computing the output tensor. They are associated with the inner
In light of this partition, the role of each is clear: Free indices are those used to iterate over every output element, while summation indices are those used to compute and sum over the product terms required.
Of course, the specific summations of products made are in part a function of the free indices. For instance, in the matrix multiplication example below, the output’s element
[i,j] depends on a sum of products along row
A and column
Meanwhile, summation indices are used exclusively within the inner-most loops:
It is time for a little break. To let the above sink in, view also the following four other examples of
einsum() calls, as well as the
for-loops they map to.
2-D Matrix Diagonal Extraction
Note how the index
i is reused twice in the same term. There is nothing about
for-loops or arrays that forbids you from doing this, so
einsum() doesn’t stop you from doing it. Of course, this only works if the matrix is square.
2-D Matrix Trace
Watch how the only difference between the trace and the diagonal is the absence of the output dimension
i. The consequence is that the output is scalar, and thus there are no free indices. The index
i, previously a free index, has become a summation index that collects the diagonal terms.
This is just a display of a 3-argument Einstein summation.
Batched Outer Product
If you had a matrix of
Q, and wanted to compute the outer product of the
B‘th row of
P and the
B‘th row of
Q for all
B, you would be at pains to do this with other functions, and could easily get it wrong. With
einsum(), it is trivial, even intuitive: Merely specify the batch index (here,
'B') in all terms.
If you wanted the outer-products to also be transposed, this is also trivial to accomplish. One only needs to use, instead of
"Bi,Bj->Bij", the format-string
"Bi,Bj->Bji". In no other Numpy function do arbitrary transpositions like this come at 0 additional cost in characters or time spent thinking.
Rules of Use
Almost all rules constraining your use of
einsum() derive directly from its formulation as a set of nested loops indexed by named indices that sum products of its arguments indexed in a given way. As above, we will assume that an axis to which index letter
'x' is assigned has length
'Nx' for the purposes of the code snippets below.
In the first example above, if we know that the two arguments to the matrix multiplication
"ik,kj->ij" are of shape
(Nk,Nj), then we know that the output will have shape
(Ni,Nj), because that is the length of the two surviving axes, one from each argument matrix.
Similarly, when we extract a diagonal, we implicitly assume the matrix to be square along those dimensions. If this were not so, a scalar loop would stop too early or late and possibly crash the program.
Here are examples of
Again, when viewed as a deeply-nested set of loops that compute (scalar) sums of products, many of these rules make sense.
- The number of argument specifications must match the number of arguments, or else you are indexing into air.
- The argument specification string must be as long as the argument’s dimensionality, or else the
totalbeing summed into won’t be a scalar 2.
- Using an output index that doesn’t exist amongst the input indices would be like using an undeclared variable.
- Reusing an index in the output specification would leave uninitialized elements outside a diagonal.
- Using the same index for axes of different lengths doesn’t make sense – which is the right length to use?
For real-world applications of
einsum(), let’s implement an MLP. An MLP has 3 axes of interest to us:
- Input unit axis
- Hidden unit axis
- Output unit axis
We give them the very appropriate labels/indices
'o', and give them lengths
No. We arbitrarily declare the weight matrices
V to have shape
(No,Nh); Their corresponding
einsum() specifications are
"oh". We can now write very elegantly the MLP’s
Note the total absence of any explicit transpositions in the code, and the total lack of care given to ordering of the parameters.
einsum() is resistant to permutations of its arguments, so long as the format string’s argument specifications follow suit.
But, what if we wanted to vectorize this procedure, by processing batches instead? If we had written the MLP using the classic
outer() and other similar functions, we’d be introducing transpositions left and right, and reordering the arguments so that dimensions line up in order to satisfy the requirements of those functions.
Witness instead what happens with an
einsum() implementation. In addition to the axes listed above, we introduce a new axis:
- Batch number axis
and give it the index/label
'B' and length
batch_size. We now make only a few edits to our implementation:
Almost the only thing it took us was to add
'B' to the output specifications of all activations and the gradients with respect to those activations! To obtain the gradient average across all examples, we instruct
einsum() to sum out the batch axis
'B' by not giving it in the output specification, and divide
einsum()‘s output by the length of that axis to normalize.
In general, the following points should be kept in mind when using
- Give each axis that notionally exists within the problem its own label. It is best if memorable ones can be chosen, like in the MLP problem above.
- If you want element-wise multiplication between the axes of two arguments, use the same index for both (
- If you want summation along a given axis, don’t put its index in the output specification (
- If you want an inner product, which is element-wise multiplication between two axes followed by the summing-out of those axes, then do both of the above (
- If a tensor is used as argument to
einsum(), simply copy-paste its specification from the
einsum()that created it. Input transpositions are automagically handled.
- For the output, simply state what is the form of the tensor that you want. The genie in
einsum()will give it to you, and you have infinite wishes.
Useful Format Strings
Here is a list of useful operations, given as format strings. Remember that you can always reorder the argument specifications and the arguments in like manner, and that transpositions of the output are done simply by rearranging the letters in the output specification. Feel free to suggest more clever formats!
- Vector inner product:
"a,a->"(Assumes two vectors of same length)
- Vector element-wise product:
"a,a->a"(Assumes two vectors of same length)
- Vector outer product:
(Vectors not necessarily same length.)
- Matrix transposition:
- Matrix diagonal:
- Matrix trace:
- 1-D Sum:
- 2-D Sum:
- 3-D Sum:
- Matrix inner product
"ab,ab->"(If you pass twice the same argument, it becomes a matrix L2 norm)
- Left-multiplication Matrix-Vector:
- Right-multiplication Vector-Matrix:
- Matrix Multiply:
- Batch Matrix Multiply:
- Quadratic form / Mahalanobis Distance:
What type of argument(s) do
",->" take, and what do they do with them?
- Actually, not quite. I’m presenting to you only the most explicit form of einsum’s format string. It is possible to omit
->and the output specification. If you do so, Numpy expands the format string automatically by making a “reasonable” guess at what the free indices, and thus the output specification, should be. Numpy assumes that all indices that are used only once in the format string are the free indices, and sorts them ASCIIbetically to create the output specification. For instance,
kare used once,
iis ASCIIbetically first),
"ii->"(no indices are used only once, so the output is a 0-dimensional, scalar tensor) and
call used only once and sorted ASCIIbetically).
- That’s another white lie.
einsum()supports broadcastable axes by specifying
...instead of a single letter. But that’s a truly advanced feature.
Pastebin link: http://pastebin.com/9GPvxwUn