ninepointsC++, graphics, deep learning, multithreading, Vulkan, OpenGL, DirectX, Math, compilers, and more
https://jeremyong.com/
Dual Quaternions for Mere Mortals<p>This article is written for people seeking intuition on dual quaternions (and perhaps even complex numbers and quaternions) beyond what is offered by traditional texts.
The standard way a dual quaternion is defined is by introducing a <em>dual</em> unit <script type="math/tex">\epsilon</script> which satisfies <script type="math/tex">\epsilon\epsilon = 0</script> and slapping a quaternion and a dual-scaled quaternion together.
A whole ton of algebra follows and <em>mechanically</em> at least, everything checks out, as if by magic.
At some point, the writer will point out how dual quaternions are isomorphic to Clifford algebras (or somesuch mumbo-jumbo pertaining to Lie algebras).
If you’ve taken a course in abstract algebra and are intimately comfortable with the notion of homogeneous coordinate systems already, maybe such a treatise was more than adequate.
Chances are though, the concept of dual quaternions (and likely quaternions as well) feels somewhat <em>hollow</em>, and you are left with a lingering suspicion that there is a better way of grasping matters.
Geometric algebra is there (and I recommend you take a look), but personally, having an appreciation of both is still useful, and I don’t actually find quaternions or dual quaternions to be unduly difficult or an inferior formulation in any way.
I can’t lay claim to having a perfect understanding of the subject, but I am documenting my understanding here in the hopes that it may point you, fellow mathematician (graphics programmer, AI researcher, masochist, or what have you), in the right direction.</p>
<h2 id="the-goal">The goal</h2>
<p>The ultimate goal is to have some sense for dealing with both quaternions:</p>
<script type="math/tex; mode=display">q = a + b\mathbf{i} + c\mathbf{j} + d\mathbf{k}</script>
<p>and dual quaternions:</p>
<script type="math/tex; mode=display">p + {\epsilon}q</script>
<p>along with the ways they express rotation:</p>
<script type="math/tex; mode=display">\mathbf{p}' = \mathbf{q}\mathbf{p}\mathbf{q}^{-1}</script>
<script type="math/tex; mode=display">\mathbf{q} = \cos{\frac{\theta}{2}} + (u_x\mathbf{i} + u_y\mathbf{j} + u_z\mathbf{k})\sin{\frac{\theta}{2}}</script>
<p>and both rotation and translation:</p>
<script type="math/tex; mode=display">\mathbf{r} + \frac{\epsilon}{2}(d_x\mathbf{i} + d_y\mathbf{j} + d_z\mathbf{k})\mathbf{r}</script>
<p>There’s a lot of “odd” things here that don’t seem very natural (the half angle, the “extra” dimension, the necessity of the conjugate operator, the mysterious <script type="math/tex">\epsilon</script>, etc),
but hopefully after reading this, you’ll feel like these oddities are actually well-motivated and the only way it could be.
Quick reader beware: this post may feel a little long-winded, mainly because I’m trying to make an effort of demonstrating what <em>doesn’t</em> work. For readers looking for more precise and direct exposition, I’ve included a number of links at the bottom.
Perfect rigor is not a goal here because I do not want to assume that you’ve had any prior background in abstract algebra (although I hope you’ll be interested in it afterwards).
I’m also going to try and stray away from formulae that may be more abstruse for most readers (Rodriguez rotation, Plücker coordinates, etc.).
If those are familiar and comfortable topics, great!
However, we’re going to try and loosely prove the results we want just the same without them.</p>
<h2 id="back-to-complex-numbers">Back to complex numbers</h2>
<p>Let’s review some things we know (and bear with me if it seems too slow).
Complex numbers provide a means for us to elegantly describe rotation and scaling on a plane.
Multiplication by the imaginary number <script type="math/tex">i</script> is understood to perform a <script type="math/tex">90^\circ</script> rotation.
Multiplication by <script type="math/tex">i^2</script> represents two such rotations, so <script type="math/tex">i^2 = -1</script>.
Note first, there is an asymmetry in this algebra. Compared to a typical cartesian plane, the “action” described by a multiplication
by <script type="math/tex">i</script> compared to a real number is different. No amount of multiplication of a real numbers amongst each other can ever produce
an imaginary number, but imaginary numbers manage to propel themselves from real units to imaginary numbers without much effort.
Such asymmetry is actually fairly common in algebras, the study of which often amounts to defining various types of entities and the
myriad ways they interact (taking care that they interact “nicely” with properties we know and love like associativity and the distribution
law and such).
Also, remember that not only can we multiply by <script type="math/tex">i</script>, we can add it too.
We interpret this as a sort of “y” coordinate, and this should surprise you (if it hasn’t before).
How can adding/subtracting units of <script type="math/tex">i</script> be compatible with the rotation action of multiplication by <script type="math/tex">i</script>?</p>
<p>Herein lies the unreasonable efficacy of complex numbers. Algebraically, if we take a complex number like <script type="math/tex">2 + 5i</script> and multiply
by <script type="math/tex">i</script>, we get <script type="math/tex">(2 + 5i)i = 2i + 5i^2 = -5 + 2i</script> which is our original number rotated <script type="math/tex">90^\circ</script>.
This is “obvious” perhaps, but without fully appreciating how this is so, future attempts to generalize the concept to quaternions and dual quaternions will indubitably fail.
For some clarity, we presume that addition is commutative (colloqially, it makes no difference if we go up and over, vs over and out).
By the distributive law, multiplying by <script type="math/tex">i</script> applies the rotation to both components of the number (in this case, <script type="math/tex">2</script> and <script type="math/tex">5i</script>)
leaving us with two rotated components, and we’re left with two components that summed up, resulting in the rotated form of the original number.
Incidentally, thinking in terms of the action as defined on the components is sometimes more helpful (the vector representation of a complex number is illustrative, but perhaps not the most fundamental).
Really, we just let <script type="math/tex">i</script> function as a “y-coordinate” as an algebraic convenience.
We could have left the vector in cartesian form <script type="math/tex">a\hat{x} + b\hat{y}</script> and defined <script type="math/tex">i\hat{x} = \hat{y}</script> and <script type="math/tex">i\hat{y} = -\hat{x}</script> to get the same results,
but this is redundant since replacing <script type="math/tex">\hat{x}</script> with <script type="math/tex">1</script> and <script type="math/tex">\hat{y}</script> with <script type="math/tex">i</script> produces the same effect.</p>
<p>This should feel pretty gentle so far. Another question to test your understanding is to consider how rotations of angles that aren’t
right-angle multiples are defined. Why does the number system we defined so far permit, say, a <script type="math/tex">15^\circ</script> rotation or a <script type="math/tex">\pi/9</script> radian
rotation? After all, multiplication by <script type="math/tex">i/2</script> doesn’t give us half of a quarter turn (it shouldn’t because we’re scaling <script type="math/tex">i</script> down by a real number).
Intuitively, <em>exponentiation</em> of <script type="math/tex">i</script> should result in fractional turns (or turns greater than a right angle) since this is a way to multiply <script type="math/tex">i</script> by varying degrees.
But what’s, say, <script type="math/tex">\sqrt{i}</script> and how can we get such a quantity if we’ve only imposed that <script type="math/tex">i^2 = -1</script>?
Easy. <script type="math/tex">\sqrt{i}\sqrt{i} = i</script> so multiplying by the square root of <script type="math/tex">i</script> must produce a <script type="math/tex">45^\circ</script> rotation.
Better yet, we can even represent <script type="math/tex">\sqrt{i}</script> as a normal complex number (pretend we only care about the positive root) <script type="math/tex">1/\sqrt{2}(1 + i)</script> which we recognize as a vector stabbing in the <script type="math/tex">45^\circ</script> direction with length unity.
Of course, we can whip out De Moivre’s Theorem here, but I’m trying to keep things tidy (avoiding transcendental functions).
Try to think less about cosines and sines, and more about how from simple definitions and requirements (commutivity, distributive law, <script type="math/tex">i^2 = -1</script>), we can suss out everything we need.</p>
<h2 id="upgrading-to-quaternions">Upgrading to quaternions</h2>
<p>Great, we have a notion of rotation now in two dimensions, so let’s go to three.
This is where tons of people get tripped up because quaternions behave somwhat differently to complex numbers (at least superficially).
For one thing, a general quaternion has three imaginary parts and one real part (as opposed to the “expected” two imaginary parts).
Furthermore, the rotation formula involves a strange conjugation and rotates by a half-angle. What gives?</p>
<p>The first thing to recognize is that “going from two dimensions to three dimensions” is a somewhat misleading characterization of our transition. Instead, consider that we’ve gone from one axis of rotation to three axes of rotation!
This alone should make the “extra” imaginary component less surprising.
The other “weird” thing happening is that we want to describe rotations in 3D space… but we now have a 4-dimensional algebraic entity.
Let’s see what happens if we try the “natural” thing of representing 3D space with entities of the form <script type="math/tex">a + bi + cj</script> where <script type="math/tex">i^2 = -1</script> and <script type="math/tex">j^2 = -1</script> represents <script type="math/tex">90^\circ</script> rotations.
Suppose we let units of <script type="math/tex">j</script> go “into” and “out of” the graph of the typical complex plane, so that multiplication by <script type="math/tex">i</script> rotates about the <script type="math/tex">j</script>-axis (as before).
We <em>immediately</em> run into a problem of deciding what axis <script type="math/tex">j</script> should rotate a point about.
The real axis? The <script type="math/tex">i</script>-axis? Something else?
Unfortunately, there are an infinite number of choices because we need three axes of rotation in 3D-space, not two.
What we really need, is a third “imaginary” unit that behaves similarly to the other two, all three of which together encode information about rotation.</p>
<p>Ergo, let’s add that fourth dimension.
First, mentally visualize the real number line.
Next, imagine that each imaginary unit <script type="math/tex">i</script>, <script type="math/tex">j</script>, and <script type="math/tex">k</script> are all orthogonal to it (we are in 4D now and think of these imaginary planes as extensions to the real number line if you will).
For every real number, each one defines something akin to the complex plane from before (<script type="math/tex">i^2 = j^2 = k^2 = -1</script>).
To be a well-defined system, multiplication by <script type="math/tex">i</script> needs to “work” not just when multiplying to a real number, but also to units of <script type="math/tex">i</script>, <script type="math/tex">j</script>, and <script type="math/tex">k</script> as well.
What should <script type="math/tex">ij</script> be for example? Well, multiplication of a real number by <script type="math/tex">i</script> rotates about an axis (or set of axes, remember we are in 4D) orthogonal to both the reals and the <script type="math/tex">i</script>-axis.
It would be extremely odd for multiplication by <script type="math/tex">i</script> to take <script type="math/tex">j</script> into the reals or <script type="math/tex">i</script> units, because subsequent multiplication would never “wrap back around” to the <script type="math/tex">j</script> units.
We need <script type="math/tex">i</script> to be a pure rotation, so that won’t do, not to mention that the operation is no longer invertible.
Similarly, <script type="math/tex">i</script> shouldn’t scale <script type="math/tex">j</script> by some real-valued amount since the reals already do that, and it seems unreasonable that the reals
and the imaginary units of <script type="math/tex">i</script> act the same in the <script type="math/tex">j</script> dimension but differently elsewhere.
Our only choice then, is that the action of multiplying <script type="math/tex">i</script> onto <script type="math/tex">j</script> produces <script type="math/tex">k</script> units (we’ll choose positive <script type="math/tex">k</script> by convention) so that <script type="math/tex">ij = k</script>.
Multiplying both sides into <script type="math/tex">k</script> gives <script type="math/tex">ijk = -1</script> and applying <script type="math/tex">i</script> to this gives <script type="math/tex">i^2jk = -i \Rightarrow jk = i</script>.
So there we have it, all the definitions associated with the canonical quaternion basis units.
This rhymes with the situation with the complex numbers, but is different in the sense that the action defined by multiplying each imaginary unit rotates not just the reals but the other imaginary units as well.
This is a <em>very</em> important distinction that will come into play when appreciating the rotation formula later.
Applying these definitions along with the usual rules of additive commutivity, associativity, and the distributive law gets us a well defined way of adding and multiplying quaternions together.</p>
<p>“Well, that’s all fine and dandy,” you might say, “but I’m interested in rotations in 3D, not 4D!”
Very true, and herein lies the trickier bit to grasp.
We need a way to express the effect of 4D quaternion multiplication in three-space.
What would be great is if we could just read off the imaginary components directly.
After all, they naturally correspond to three orthogonal axes with natural properties for component-wise vector summation.
Should two quaternions with different real parts represent the same 3D point/vector though?
Would such a system work at all?
Let’s try to use our quaternion algebra to do a really simple task first, rotating the point <script type="math/tex">(1, 0, 0)</script> about the <script type="math/tex">z</script> axis <script type="math/tex">90^\circ</script> (we should end up at <script type="math/tex">(0, 1, 0)</script>).
First, let’s represent our point as a quaternion <script type="math/tex">0 + i + 0j + 0k = i</script>.
We know we need this to end up at <script type="math/tex">j</script> and as it turns out, multiplying by <script type="math/tex">k</script> does exactly that which more or less mirrors our experience thus far with complex numbers (we don’t multiply by <script type="math/tex">j</script> to rotate into <script type="math/tex">j</script> because we are rotating from a different imaginary unit, not the reals). So far so good!
Spoiler alert, though, this approach doesn’t work generally as we’ll see soon enough.
Now let’s try something harder.
Let’s rotate the point <script type="math/tex">(1, 0, 1)</script> about the <script type="math/tex">z</script> axis (we’re expecting <script type="math/tex">(0, 1, 1)</script>).
Representing it as the quaternion <script type="math/tex">i + k</script> and multiplying by <script type="math/tex">k</script>, we get <script type="math/tex">ki + kk = j - 1</script>.
Ah, now we see the problem.</p>
<p>We <em>wanted</em> multiplication by <script type="math/tex">k</script> to rotate us by the <script type="math/tex">z</script> axis, but in actuality, this operation does multiple things, rotating not just units of <script type="math/tex">i</script>, but also <script type="math/tex">k</script> (<script type="math/tex">j</script> too although we didn’t see it in the example).
As a brief recap, we’re looking for an operation that can express a rotation about the <script type="math/tex">z</script> axis for <em>all</em> points.
Let’s try to brute force something for all points in the <script type="math/tex">xz</script>-plane.
We know that application of <script type="math/tex">k</script> makes the following “movements”:</p>
<script type="math/tex; mode=display">\mathbb{R} \overset{k}{\rightarrow} \mathbf{k} \overset{k}{\rightarrow} \mathbb{R}</script>
<script type="math/tex; mode=display">\mathbf{i} \overset{k}{\rightarrow} \mathbf{j} \overset{k}{\rightarrow} \mathbf{i}</script>
<p>Our spider senses here should be tingling.
A single application of <script type="math/tex">k</script> is going to generate the wrong types of units, but <em>two</em> applications will always get us back to where we started.
From our previous example, we want to keep the action of <script type="math/tex">k</script> that brought <script type="math/tex">i</script> to <script type="math/tex">j</script>, but don’t want the extra effect of bring <script type="math/tex">k</script> to the reals.
We have one trick up our sleeve though, which is the judicious use of left vs right multiplication.
When we move from the reals to an imaginary unit and back, it makes no difference whether we use left or right multiplication.
In contrast, rotating among the imaginary units will flip signs depending on whether we pre- or post-multiply.
Ergo, we have a hope of finding a pair of quantities such that upon pre <em>and</em> post multiplication to the vector we wish to rotate, we can cancel out rotation we don’t care about, and preserve rotation we do care about.</p>
<p>This is where the conjugation operation <script type="math/tex">pqp^{-1}</script> comes into play.
Let’s choose our rotator <script type="math/tex">p</script> to be of the form <script type="math/tex">a + b\mathbf{k}</script> (again, focusing on just rotation about the <script type="math/tex">z</script> axis)
and take <script type="math/tex">p^{-1} = a - b\mathbf{k}</script> to be the inverse (analogous to complex conjugates).
We can enforce that <script type="math/tex">a^2 + b^2 = 1</script> to ensure we aren’t changing the length of <script type="math/tex">q</script>, and intuitively, this makes sense because our operation will only have a single degree of freedom (two variables, one constraint).
Applying the conjugation to <script type="math/tex">i</script> gives:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}(a + b\mathbf{k})\mathbf{i}(a - b\mathbf{k}) &= (a + b\mathbf{k})(a\mathbf{i} - b\mathbf{i}\mathbf{k})\\
&= (a + b\mathbf{k})(a\mathbf{i} + b\mathbf{j})\\
&= a^2\mathbf{i} + ab\mathbf{j} + ab\mathbf{k}\mathbf{i} + b^2\mathbf{k}\mathbf{j}\\
&= (a^2 - b^2)\mathbf{i} + 2ab\mathbf{j}\end{align} %]]></script>
<p>In order to get the result we want, we require <script type="math/tex">a^2 = b^2</script> and <script type="math/tex">2ab = 1</script> which works if <script type="math/tex">a = b = 1/\sqrt{2}</script>.
Notice how the inclusion of the reals and the sign flip allow us get a quantity we can cancel as much or as little as we like by choosing <script type="math/tex">a</script> and <script type="math/tex">b</script> along with a secondary amount we can control (in this case, pointing along <script type="math/tex">j</script>).
We were able to do this by leveraging the anti-commutative aspects of the imaginary units compared to the reals.
Let’s try doing the same computation to our point offset by a unit of <script type="math/tex">k</script> (goal after rotation is <script type="math/tex">j + k</script>).</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}(a + b\mathbf{k})(\mathbf{i} + \mathbf{k})(a - b\mathbf{k}) &= (a + b\mathbf{k})(a\mathbf{i} - b\mathbf{i}\mathbf{k} + a\mathbf{k} - b\mathbf{k}\mathbf{k})\\
&= ab + a^2\mathbf{i} + ab\mathbf{j} + a^2 \mathbf{k}\\
&\phantom{= ab}+ b^2 \mathbf{k} + ab \mathbf{k}\mathbf{i} + b^2 \mathbf{k}\mathbf{j} + ab \mathbf{k}\mathbf{k}\\
&= (a^2 - b^2) \mathbf{i} + 2ab \mathbf{j} + (a^2 + b^2) \mathbf{k}\end{align} %]]></script>
<p>This is where something mindblowing happens.
The same choice of <script type="math/tex">a</script> and <script type="math/tex">b</script> as before (<script type="math/tex">1/\sqrt{2}</script>) produces the desired rotation of <script type="math/tex">i + k</script>!
If you follow the algebra closely, what we did was the <script type="math/tex">k</script> component of the vector we were rotating moved into the reals and back again twice.
That is, for our choice of <script type="math/tex">p</script>, we’ve arranged it so that application of the conjugation to any vector with a component along <script type="math/tex">k</script> will preserve that <script type="math/tex">k</script> provided that <script type="math/tex">a^2 + b^2 = 1</script>.
Now, this resembles the pythagorean identity <script type="math/tex">\cos^2\theta + \sin^2\theta = 1</script>.
The other term that shows up <script type="math/tex">2ab</script> looks a lot like the sine-half angle formula <script type="math/tex">\sin{2\theta} = 2\sin{\theta}\cos{\theta}</script>.
This leaves our last term <script type="math/tex">a^2 - b^2</script> which looks a lot like the cosine-half angle formula <script type="math/tex">\cos{2\theta} = \cos^2{\theta} - \sin^2{\theta}</script>.
In reality, this should be all too surprising since we’ve taken what amounted to a linear combination of <script type="math/tex">90^\circ</script> rotations (some arbitrary rotation in 4-space) and applied them twice with a sign adjustment on the second application.</p>
<p>Recapping, we were seeking a simple operation that, applied to any imaginary vector representing a direction in 3-space, would perform a rotation about the <script type="math/tex">z</script> axis.
We needed to ensure that any <script type="math/tex">k</script>-component in the rotating vector needed to be preserved, and to do this, we leveraged the reals by
applying two multiplications (moving to the reals and back).
We still needed a rotation effect to linger, so we took advantage of combining pre and post multiplication along with a sign flip so that
rotation of the imaginary components doesn’t cancel, but the rotations into and out of the reals does.
What we ended up with was a conjugation for which all the half angle formulas popped out, which intuitively, is great because we’re doing two applications after all.
Without going through the full derivation, it should make sense in a handwavey fashion that this reasoning applies equally well when rotating about a different axis, and due to the nice linear properties of our algebra, the general formula for rotation about an axis <em>should</em> be correct.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}\mathbf{q}' &= \mathbf{p}\mathbf{q}\mathbf{p}^{-1}\\
\mathbf{p} &= \cos{\frac{\theta}{2}} + \sin{\frac{\theta}{2}} (u_x \mathbf{i} + u_y \mathbf{j} + u_z \mathbf{k})\end{align} %]]></script>
<h3 id="why-did-we-go-through-all-that-trouble">Why did we go through all that trouble</h3>
<p>Let’s talk about some properties of quaternion rotation.
First, it’s extremely easy to describe the rotation of a vector about another one (in contrast, try writing the rotation matrix by hand).
Applying the rotation matrix just relies on the standard rules of algebra.
Also, we can compose multiple rotations with algebra just as well.
Suppose we want to rotate <script type="math/tex">q</script> by <script type="math/tex">p_1</script> and <script type="math/tex">p_2</script>.</p>
<script type="math/tex; mode=display">\mathbf{q}' = (\mathbf{p_2}\mathbf{p_1})\mathbf{q}(\mathbf{p_1}^{-1}\mathbf{p_2}^{-1})</script>
<p>We have an identity <script type="math/tex">(\mathbf{p_2}\mathbf{p_1})^{-1} = \mathbf{p_1}^{-1}\mathbf{p_2}^{-1}</script> so we can work out the product <script type="math/tex">p_2p_1</script> in advance and take its inverse later.
Thus, quaternion rotation is just as composable as rotation using matrices (in fact, we can derive a matrix representation of the conjugation operation without too much difficulty).</p>
<p>The real kicker is that because the application of quaternion rotation is done through <em>purely linear algebraic manipulation</em>, we can take derivatives of the rotated result <script type="math/tex">\mathbf{q}'</script> with respect to changes in the rotation <script type="math/tex">\mathbf{p}</script>.
This does <em>not work</em> with matrices element by element, because for a general rotation matrix, the elements are actually coupled to each other nonlinearly.
This is why we can efficiently and accurately interpolate between quaternions (either linearly or spherically depending on the velocity desired).</p>
<h2 id="upgrading-to-dual-quaternions">Upgrading to dual quaternions</h2>
<p>Hopefully at this point, we’re ready to make the leap to dual quaternions.
Like before, we want to develop a well-behaved algebra, but in this case, we want to extend quaternions to permit both rotations and translations.
It’s not wrong to wonder why this is difficult, after all, most people are introduced to translations simply as component-wise addition of vectors.
We could, if we wanted, proceed in this way with a bit of bookkeeping, remembering what quantities should be interpreted as translations, and what quantities should be rotations.
Difficulty will ensue, however, if we wish to compress multiple transformations (both rotations and translations) together, since the operations don’t commute in general.
If we chose to represent everything with matrices in projective space, this is possible but we would lose the nice compact representation we just developed with quaternions, as well as the ability to cleanly interpolate or differentiate.
So, we need to <em>extend</em> our currently developed quaternion algebra to encode translations.</p>
<p>Let’s pretend first that we haven’t already seen the formulation of dual numbers and all that.
Left to our own devices, we know that we want our “upgraded quaternion” to apply transformations via the same conjugation operator as before.
That way, the transformations can compose via simple multiplication as before and we can reuse our rotation formulation as well.
To separate the effect of a translation from that of a rotation, we’ll introduce a new unit called <script type="math/tex">\epsilon</script> so that our upgraded quaternion has the form <script type="math/tex">\mathbf{p} + \epsilon\mathbf{q}</script>.
Note that we aren’t so much “requiring” that this is the correct form.
The introduction of a new unit <script type="math/tex">\epsilon</script> imposes this form that now encompasses all possible linear combinations of units that make up our dual quaternion.
Let’s consider how the conjugation operator behaves acting on identity (we’ll define the inverse based on the quaternion inverses as well as a sign flip on the dual element).</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
(\mathbf{p} + \epsilon\mathbf{q})(\mathbf{p}^{-1} - \epsilon\mathbf{q}^{-1}) &= \mathbf{p}\mathbf{p}^{-1} - \epsilon\mathbf{p}\mathbf{q}^{-1} + \epsilon\mathbf{q}\mathbf{p}^{-1} - \epsilon^2\mathbf{q}\mathbf{q}^{-1} \\
&= 1 + \epsilon(\mathbf{p}^{-1}\mathbf{q} - \mathbf{p}\mathbf{q}^{-1}) - \epsilon^2
\end{align} %]]></script>
<p>If instead of conjugating the identity, we conjugated a versor (purely imaginary quaternion) scaled by <script type="math/tex">\epsilon</script>, we’d get the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
(\mathbf{p} + \epsilon\mathbf{q})\epsilon\mathbf{v}(\mathbf{p}^{-1} - \epsilon\mathbf{q}^{-1}) &= \epsilon\mathbf{p}\mathbf{v}\mathbf{p}^{-1} - \epsilon^2\mathbf{p}\mathbf{v}\mathbf{q}^{-1} + \epsilon^2\mathbf{q}\mathbf{v}\mathbf{p}^{-1} - \epsilon^3\mathbf{q}\mathbf{v}\mathbf{q}^{-1} \\
\end{align} %]]></script>
<p>At this point, let’s make the following observation.
<em>If</em> we let <script type="math/tex">\epsilon^2 = 0</script>, then the conjugation by a dual quaternion of an versor scaled by <script type="math/tex">\epsilon</script> is just <script type="math/tex">\epsilon \mathbf{p}\mathbf{v}\mathbf{p}^{-1}</script>.
This is <em>precisely</em> the rotation operator of the standard quaternion from the previous section.
Conversely, the conjugation operator on the identity reduces down to <script type="math/tex">1 + \epsilon(\mathbf{p}\mathbf{q}^{-1} - \mathbf{p}^{-1}\mathbf{q})</script>.
To proceed then, we need to consider the quantity <script type="math/tex">\mathbf{p}^{-1}\mathbf{q} - \mathbf{p}\mathbf{q}^{-1}</script>.
If this could somehow represent translation, then we’d have both our bases covered. Expanding:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbf{p}^{-1}\mathbf{q} - \mathbf{p}\mathbf{q}^{-1} &= (p_r - \mathbf{p_i})(q_r + \mathbf{q_i}) - (p_r + \mathbf{p_i})(q_r - \mathbf{q_i}) \\
&= p_rq_r +p_r\mathbf{q_i} -q_r\mathbf{p}_i - \mathbf{p_i}\mathbf{q_i} \\
&\phantom{== } - p_r q_r + p_r\mathbf{q_i} - q_r\mathbf{p_i} + \mathbf{p_i}\mathbf{q_i} \\
&= 2p_r\mathbf{q_i} - 2 q_r\mathbf{p_i}
\end{align} %]]></script>
<p>At this point, to “arrange” that we get a translation, let’s arbitrarily choose <script type="math/tex">p_r = 1</script>, <script type="math/tex">q_r = 0</script> and <script type="math/tex">\mathbf{q_i}</script> to point in some direction <script type="math/tex">\mathbf{d}</script> and have unit length
so that the expression above reduces to <script type="math/tex">2\mathbf{d}</script>.
This corresponds to a motion in the direction <script type="math/tex">\mathbf{d}</script> with a displacement of 2!
This indicates that the conjugation of a dual quaternion of the form <script type="math/tex">1 + \frac{\epsilon}{2}\mathbf{d}</script> performs a pure translation along <script type="math/tex">\mathbf{d}</script>.
Meanwhile, because <script type="math/tex">\mathbf{p_i}</script> drops out of the expression because <script type="math/tex">q_r = 0</script>, we can continue to use it to represent rotation.</p>
<p>To finalize things, we simply compose the two actions. Given a quaternion representing a rotation <script type="math/tex">\mathbf{q_r}</script> and a vector of desired displacement <script type="math/tex">\mathbf{d}</script> (both unit in length), we combine the effects in the only reasonable way (multiplication).
We have associativity after all, because we have a cleanly defined algebra, so let’s use it!</p>
<script type="math/tex; mode=display">\left(1 + \frac{\epsilon}{2}(d_x\mathbf{i} + d_y\mathbf{j} + d_z\mathbf{k})\right)\mathbf{q_r} = \mathbf{q_r} + \frac{\epsilon}{2}(d_x \mathbf{i} + d_y \mathbf{j} + d_z\mathbf{k})\mathbf{q_r}</script>
<p>Like our quaternion, this quantity has a well defined derivative (not developed here) and can thus be used for rigid transformations, unlike a matrix formulation (again with weird non-linear cross terms).
Remember that when we are trying to transform a point, it now needs to be in the form <script type="math/tex">1 + \epsilon(p_x\mathbf{i} + p_y\mathbf{j} + p_z\mathbf{k})</script> (refer to the development of the equation from the beginning of this section if you forget why).</p>
<h2 id="review">Review</h2>
<p>We started with a new set of units <script type="math/tex">i</script>, <script type="math/tex">j</script>, and <script type="math/tex">k</script> to encode rotations about 3 axes.
Things needed to be different from our familiar complex numbers because we jumped from a single axis of rotation to three.
This required an additional dimension to keep things clean (in the same way that we needed two dimensions to handle a single axis of rotation).
The action of multiplication by each individual unit performed two effects: rotation into and out of the reals from that unit, and a rotation in units for the others in a manner that anti-commutes (commutes with a sign flip).
Thus, encoding a rotation is done quadratically with a sign flip to cancel the effect we don’t want, and persist the effect we <em>do</em> want.
This doubled the effect of the rotation, hence the presence of half-angles in our final formulae.</p>
<p>Moving to dual quaternions, we wanted to introduce a way of encoding translation by a vector that would not disrupt the mechanics of the conjugation operator.
This way, we could in effect re-use the machinery of the quaternion while maintaining the nice algebraic property of associativity (which is what lets us compose successive transformations with multiplication).
To separate the encoding of the translation, we introduced a new unit <script type="math/tex">\epsilon</script>.
Because our conjugation operation produces quadratic terms, we simply impose that the dual unit has the property <script type="math/tex">\epsilon^2 = 0</script> (in fancy terms, <script type="math/tex">\epsilon</script> is nilpotent).
We then performed the conjugation on a real number and a pure imaginary quaternion to see the effect.
By choosing the non-dual and dual parts carefully, we could easily produce a pure rotation and a pure translation.
Composing the two operations multiplicatively (again exploiting associativity), we were able to arrive at the final expression representing a combined rotation and translation.</p>
<p>Both quaternions and dual quaternions can be continuously mapped to each other so we can differentiate them which comes into play when implementing animation systems or simulations that rely on rigid motion.
We didn’t develop any formulism in this regard, but hopefully, texts that define the spherical interpolation and quaternion/dual-quaternion derivatives will be more accessible.
As a final takeaway, it’s worth stepping back and appreciating the efficacy of abstract algebra as a tool for encoding actions.
Typically, the development of a new algebra stems from identifying a desired behavior, attempting to arrange for it to “be so,” and observing the fallout that results.
To continue your studies, I’ve assembled a number of helpful papers and resources I’ve used below.
Feel free to message or tweet your feedback on the article using the various social media links below.
Thanks for reading and if you managed to get through the whole thing, give yourself a proverbial pat on the back.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="https://en.wikipedia.org/wiki/Quaternion">Wikipedia:Quaternions</a></li>
<li><a href="https://en.wikipedia.org/wiki/Dual_number">Wikipedia:Dual Numbers</a></li>
<li><a href="https://en.wikipedia.org/wiki/Dual_quaternion">Wikipedia:Dual Quaternion</a></li>
<li><a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576712/">3D kinimatics using dual quaternions: theory and applications in neuroscience</a></li>
<li><a href="https://pdfs.semanticscholar.org/05b1/8ede7f46c29c2722fed3376d277a1d286c55.pdf">Applications of Dual Quaternions in Three Dimensional Transformation and Interpolation</a></li>
<li><a href="https://www.cs.utah.edu/~ladislav/kavan06dual/kavan06dual.pdf">Dual Quaternions for Rigid Transformation Blending</a></li>
<li><a href="https://pdfs.semanticscholar.org/7bf0/ff0c2a6161f25f1f9d669da65ee896a8e99c.pdf">Dual-Quaternions: From Classical Mechanics to Computer Graphics and Beyond</a></li>
</ul>
Mon, 05 Aug 2019 00:00:00 +0000
https://jeremyong.com/math/2019/08/05/dual-quaternions-for-mere-mortals/
https://jeremyong.com/math/2019/08/05/dual-quaternions-for-mere-mortals/Render Graph Optimization Scribbles<p>When I implemented my own render graph, I needed to make a number of decisions for proceeding and wanted to
record the various considerations for authoring such a library.
By way of background, a <em>render graph</em> is an acyclic directed graph of nodes, each of which may consume a set of
resources and produce a set of resources.
Edges in the graph denote an execution dependency (the child node should happen after the parent), with the caveat
that the dependency need not be a full pipeline barrier. For example, a parent node may produce results A and B,
but a child node may only depend on A.
Another caveat is that the dependencies may be relaxed to allow some overlapping.
For example, a parent node may produce some result in its fragment stage which is consumed in the fragment stage of a child node.
In such a case, the child node can execute its vertex stage while the parent node is also executing its vertex stage.</p>
<p>In this post, I’m going to focus on the optimization aspects of a render graph as there are many benefits of a render graph in general,
and touching on all of them would result in a much larger post.
Restricting ourselves to the optimization effort, we can then summarize the goals of any render graph processing as the following:</p>
<ol>
<li>Allow overlapping execution as much as possible (maximize chip occupancy)</li>
<li>Express dependencies such that flushes are deferred for as long as possible and span the minimum surface area of memory that needs to be flushed</li>
</ol>
<p>These goals must be accomplished without violating the dependencies encoded in the node edges.
With this, here are some pointers for accomplishing the above without too much fuss.</p>
<h2 id="we-need-a-good-answer-to-the-following-question">We need a good answer to the following question</h2>
<p>Suppose you have a graph that looks like the following</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--A---
\
C
/
--B---
</code></pre></div></div>
<p>Here, <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">B</code> are dependencies of <code class="highlighter-rouge">C</code> and they themselves are dependent on other nodes that are not pictured.
The question is, <em>how should I schedule</em> <code class="highlighter-rouge">A</code> <em>and</em> <code class="highlighter-rouge">B</code>?
It’s important to understand why this is an important question for which the answer isn’t immediately obvious.
The first observation one should make is that doing both at the last hour (right before <code class="highlighter-rouge">C</code> is submitted) is not necessarily ideal,
as <code class="highlighter-rouge">C</code> will now need to wait for the whichever of <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">B</code> completes the slowest.
Perhaps there was an opportunity for <code class="highlighter-rouge">A</code> to cleanly slot into available chip space and finish before <code class="highlighter-rouge">B</code> (or vice versa).
Maybe <code class="highlighter-rouge">A</code> depends on a number of caches being flushed from a parent node, and scheduling <code class="highlighter-rouge">B</code> after <code class="highlighter-rouge">A</code> would unnecessarily
delay <code class="highlighter-rouge">B</code> from starting earlier.</p>
<p>OK, so there’s a decision to be made here, but what’s the right way of scheduling <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">B</code>?
The simple heuristic I’ll offer is the following:</p>
<blockquote>
<p>A node should be scheduled <strong>later</strong> if it consumes many results that are expensive to produce.
A node should be scheduled <strong>earlier</strong> if it is expensive to execute and its products are consumed by many dependent nodes.</p>
</blockquote>
<p>The two statements above are two sides of the same coin, and while imprecise, still concisely summarize the main idea.
If a node relies on data being produced by many preceding jobs or by jobs that take a long time to run, early submission is wasteful
as execution will be stalled by the parent jobs.
By the same token, if a node is relied on by many downstream nodes, it behooves us to submit it as early as possible so that
by the time those dependent nodes are scheduled, the data is ready to be consumed.</p>
<p>Of course, there is a tension here. What if a node relies on a lot of data from upstream nodes and also produces a lot of
data for downstream nodes?
In such a case, one can view this node as a potential bottleneck and it will be difficult practically to avoid
occupancy bubbles from forming either in front of the node, behind it, or both, depending on when we submit it.</p>
<p>Another consideration that makes things even more difficult is that different nodes have different occupancy characteristics.
We would want to avoid scheduling two transfer-heavy nodes simultaneously if possible as the usage of the
same limited resource prevents maximum chip utilization.
The same is true for co-scheduling nodes that compete for other resources (LDS, registers, etc).</p>
<h2 id="implementation-strategy">Implementation strategy</h2>
<p>Keeping track of all the above is a pretty daunting ordeal, and a perfect solution is likely to span thousands of lines
of code and possibly be expensive to run per frame if taken to the extreme.
There is a tradeoff between having a “perfect” scheduling algorithm which becomes hard to maintain, but produces
sufficiently good results that required frame times are honored.</p>
<p>Of course, every situation will be different. For your own implementation, you should consider the different hardware
profiles you’ll need to ship for, how dynamic the graph workloads will be, and also how much performance is really required
for your engine to do what it needs to do.</p>
<p>I don’t want to leave this scribble on a “do what works for you” quip, so I’ll offer a modest starting point for more general
purpose engines. Here was my approach for my engine which you can either take wholesale or use as a starting point for your
design:</p>
<p>First, reshape your render graph into an n-ary tree that looks something like the following:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+---------------+---------------+
| A B C | D E |
+-------+---+---+---+---+-------+
| F | G | H | I | J | K |
+---+---+---+---+---+---+---+---+
| L | M | | N | O |
+---+---+ +---+---+
</code></pre></div></div>
<p>In the above example, each “box” can be thought of as a bucket that contains nodes that can be ordered anywhere among the
nodes in the boxes below it and each other. Here, <code class="highlighter-rouge">A</code>, <code class="highlighter-rouge">B</code>, and <code class="highlighter-rouge">C</code> can happen in any order.
Furthermore, they can happen in any order relative to the nodes in any box below.
Boxes that are adjacent to
each other must happen one after the other. So, <code class="highlighter-rouge">F</code> needs to happen before <code class="highlighter-rouge">G</code>, and <code class="highlighter-rouge">G</code> needs to happen before <code class="highlighter-rouge">F</code> (but
any of <code class="highlighter-rouge">A</code>, <code class="highlighter-rouge">B</code>, or <code class="highlighter-rouge">C</code> could happen before, after, or in between <code class="highlighter-rouge">F</code> and <code class="highlighter-rouge">G</code>).
Each vertical line (<code class="highlighter-rouge">|</code>) denotes a memory dependency that must be inserted in the final submission.
An alternative way of viewing this tree is to realize that any node can be scheduled as early as the <code class="highlighter-rouge">|</code> to the left and as
late as the <code class="highlighter-rouge">|</code> to the right. As a result, this data structure effectively encodes the state space of all permissible submission orders that we can select from.</p>
<p>There’s another important invariant of this tree to discuss. Let us define the cardinality of each
box as the number of constituent nodes (i.e. the cardinality of the upper-right box is 2 in this example).
Note that we can enforce that the cardinality of a box at a lower depth (higher up pictorially) must be greater than or equal to the cardinality of any of its children.
To see why this is so, consider if we had a tree that looked like the following:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+-----+
| A |
+-----+
| B C |
+-----+
</code></pre></div></div>
<p>This tree indicates that <code class="highlighter-rouge">A</code> can go anywhere between <code class="highlighter-rouge">B</code> and <code class="highlighter-rouge">C</code>, but also that <code class="highlighter-rouge">B</code> and <code class="highlighter-rouge">C</code> can be reordered.
Thus, this tree is equivalent to the following:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+-------+
| A B C |
+-------+
</code></pre></div></div>
<p>Q.E.D.</p>
<p>After we have reshaped the tree as above (or possibly built it in this fashion from the get-go),
we can proceed by assigning each job a weight to eventually arrive at a total ordering.
Let us define the weight, loosely, as the cost to execute the job (more on this later). For this simple implementation, we will assume that jobs
occupy the chip uniformly. If we didn’t want to make this assumption, we could adapt this algorithm to use parametric weights,
but I will avoid doing this here for simplicity’s sake. From here, we can collapse the box-tree into a single flat array recursively as follows:</p>
<ol>
<li>Starting from a box <script type="math/tex">X</script>, look at its descendents <script type="math/tex">X_d</script></li>
<li>If the descendents are leaves, collapse <script type="math/tex">X</script> and <script type="math/tex">X_d</script> into an ordered list ordered by the execution weight of each item descending.
Then proceed to the next box (sibling)</li>
<li>If the descendents aren’t leaves, look at each descendent individually and recusively start from step 1 of this algorithm. Once they have been
flattened, continue as before.</li>
</ol>
<p>At the end of this procedure, we’ll have collapsed all our boxes into a single depth of submission-ordered nodes.
When doing this procedure of course, we need to remember to inject memory barriers corresponding to the box separators.</p>
<p>Following this heuristic, we have a rough way of ensuring that prior to each barrier, the worst offenders in terms of execution time have
had the most time to complete their tasks. In addition, by bucketing things the way we did, items that are dependencies for many downstream
nodes get bubbled to the front of the queue naturally (they would end up down and to the left in the original reshaped tree).</p>
<p>We still need to take care of the <em>execution weight</em> which we assumed existed arbitrarily. One option is to allow the user to take a stab
at guessing this, possibly with benchmarked values. A more flexible approach is to use actual numbers from a previous frame to decide
the ordering. This has the benefit of making the submission list adaptible to changing scene rendering conditions.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We have as stated, a few high level aspirations for maximizing chip occupancy by scheduling tasks so that they can overlap using
permissive barriers between them (as necessary). In addition, we looked at a one option (of many) to accomplishing our goals.
I should mention at this point, that many (most?) engines will not necessarily benefit from this abstraction depending on the
anatomy of a frame. If there aren’t many opportunities to reorder work or many independently describable nodes, trying to come
up with this overly generalized approach is likely not worth it. For engines that have deep pipelines and highly variable workloads,
however, the abstraction is well worth the investment.</p>
Fri, 28 Jun 2019 00:00:00 +0000
https://jeremyong.com/rendering/2019/06/28/render-graph-optimization-scribbles/
https://jeremyong.com/rendering/2019/06/28/render-graph-optimization-scribbles/Optimizing C++ by Avoiding Moves<p>This is a quick post, but I wanted to document an evolution in how my thinking with respect to move operators and move construction.
Back when C++11 was released and we were getting used to the concepts, C++ moves were groundbreaking in their ability to greatly
accelerate STL containers, which were often forced to invoke copy constructors wholesale due to reallocation (e.g. a <code class="highlighter-rouge">std::vector</code> grows
in size and copies <code class="highlighter-rouge">N</code> elements as a result). A move constructor allowed the programmer to create a “shallow copy” so to speak which
is much faster than the default (presumably) deep copy. Ergo, to think that avoiding moves entirely might be a performance win is
somewhat paradoxical. Of course, it isn’t without it’s caveats, but for me, it’s been well worth it to go all in on possibly never
writing a move constructor again.</p>
<p>Let’s dive in to just a few of my gripes with move constructors.</p>
<h2 id="verbosity">Verbosity</h2>
<p>This should go without saying, but transitively applying <code class="highlighter-rouge">std::move</code> to all constituent members of a class or struct is a huge drag
for something that could very nearly be automated. There are also common repetitive patterns (in a destructor, check if a thing is
<code class="highlighter-rouge">nullptr</code>, reclaim it if so, <code class="highlighter-rouge">std::move</code> other things) and idioms. Generally, repetition in code carries a nasty code-smell, and
this is no different in my opinion.</p>
<h2 id="performance">Performance</h2>
<p>Invoking a move constructor can often be a deoptimization relative to what you could do yourself. Here’s a simple motivating example
to show what I mean.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">foo</span> <span class="p">{</span>
<span class="n">foo</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>
<span class="n">foo</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="n">foo</span><span class="o">&&</span> <span class="n">f</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="k">this</span> <span class="o">==</span> <span class="o">&</span><span class="n">f</span><span class="p">)</span> <span class="k">return</span> <span class="o">*</span><span class="k">this</span><span class="p">;</span>
<span class="k">auto</span> <span class="n">tmp</span> <span class="o">=</span> <span class="n">x</span><span class="p">;</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">f</span><span class="p">.</span><span class="n">x</span><span class="p">;</span>
<span class="c1">// Ensure we don't double free
</span> <span class="n">f</span><span class="p">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">tmp</span><span class="p">;</span>
<span class="k">return</span> <span class="o">*</span><span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="o">~</span><span class="n">foo</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">delete</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span><span class="o">*</span> <span class="n">x</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="nf">moves</span><span class="p">(</span><span class="n">foo</span><span class="o">*</span> <span class="n">f</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="n">foo</span><span class="o">*</span> <span class="n">dest</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">!=</span> <span class="n">count</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">dest</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">f</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>If you compile this with <code class="highlighter-rouge">O2</code>, you’ll probably see what you expect:</p>
<pre><code class="language-asm">moves(foo*, unsigned long, foo*): # @moves(foo*, unsigned long, foo*)
test rsi, rsi
je .LBB0_6
mov r8d, esi
and r8d, 1
cmp rsi, 1
jne .LBB0_7
xor eax, eax
test r8, r8
jne .LBB0_4
jmp .LBB0_6
.LBB0_7:
sub rsi, r8
xor eax, eax
.LBB0_8: # =>This Inner Loop Header: Depth=1
cmp rdx, rdi
je .LBB0_10
mov r9, qword ptr [rdx + 8*rax]
mov rcx, qword ptr [rdi + 8*rax]
mov qword ptr [rdx + 8*rax], rcx
mov qword ptr [rdi + 8*rax], r9
.LBB0_10: # in Loop: Header=BB0_8 Depth=1
cmp rdx, rdi
je .LBB0_12
mov r9, qword ptr [rdx + 8*rax + 8]
mov rcx, qword ptr [rdi + 8*rax + 8]
mov qword ptr [rdx + 8*rax + 8], rcx
mov qword ptr [rdi + 8*rax + 8], r9
.LBB0_12: # in Loop: Header=BB0_8 Depth=1
add rax, 2
cmp rsi, rax
jne .LBB0_8
test r8, r8
je .LBB0_6
.LBB0_4:
cmp rdx, rdi
je .LBB0_6
mov rcx, qword ptr [rdx + 8*rax]
mov rsi, qword ptr [rdi + 8*rax]
mov qword ptr [rdx + 8*rax], rsi
mov qword ptr [rdi + 8*rax], rcx
.LBB0_6:
ret
</code></pre>
<p>This looks pretty bad to my eye for such a simple snippet of code. Yes, the loop is unrolled and all, but it’s forced to act
element by element and perform a lot of operations that aren’t necessary if we’re just relocating an object from place to place.
Of course, for <em>safety</em> purposes, this is how it must be written. The code in the <code class="highlighter-rouge">moves</code> function mirrors what you might expect when a <code class="highlighter-rouge">std::vector</code> resizes;
that is, new memory is allocated, and each element is moved to the other side before its destructor is ultimately invoked. Here’s a bit
of equivalent code (with <code class="highlighter-rouge">foo</code> renamed to <code class="highlighter-rouge">bar</code>) if we <em>only care about supporting relocation</em>.</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">bar</span> <span class="p">{</span>
<span class="n">bar</span><span class="p">()</span> <span class="o">=</span> <span class="k">default</span><span class="p">;</span>
<span class="o">~</span><span class="n">bar</span><span class="p">()</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="k">delete</span> <span class="n">x</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span><span class="o">*</span> <span class="n">x</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="p">};</span>
<span class="kt">void</span> <span class="nf">copies</span><span class="p">(</span><span class="n">bar</span><span class="o">*</span> <span class="n">b</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">count</span><span class="p">,</span> <span class="n">bar</span><span class="o">*</span> <span class="n">dest</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="n">dest</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">bar</span><span class="p">)</span> <span class="o">*</span> <span class="n">count</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In contrast to the first example, this will just be a call to <code class="highlighter-rouge">memcpy</code> (we’ll touch on <code class="highlighter-rouge">memcpy</code> soon). What happened to the original memory
pointed copied from? Note that we aren’t invoking any destructors and semantically, for this code anyways, <em>nothing breaks</em> because those
destructors are guaranteed to be noops for elements moved from for this struct. For completeness, here’s the assembly.</p>
<pre><code class="language-asm">copies(bar*, unsigned long, bar*): # @copies(bar*, unsigned long, bar*)
mov rax, rdi
lea rcx, [8*rsi]
mov rdi, rdx
mov rsi, rax
mov rdx, rcx
jmp memcpy # TAILCALL
</code></pre>
<p>The important intuition here is that <strong>if your move turns the destructor into a noop, you’re paying for instructions you do not need</strong>.
This is something that’s bothered me since C++11 came out, but has only recently gotten more attention with the proposal regarding
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1144r2.html">trivial relocation</a>.</p>
<h2 id="lets-talk-about-memcpy">Let’s talk about <em>memcpy</em></h2>
<p>The point of the post isn’t to say that <code class="highlighter-rouge">memcpy</code> is necessarily the answer (although it often will be), but that there’s a swath of
optimization available to the compiler that the C++ semantics don’t allow you to express. Furthermore, it does this at the cost of
additional code (and a new class of bugs). If a <code class="highlighter-rouge">memcpy</code> is possible though, we can also consider further optimizations (loop unrolled
SIMD copy, if not already provided). Also, we can make use of non-temporal moves to avoid cache pollution. Suffice it to say, we
skip 2 function calls to a move constructor/assignment and destructor and can vectorize the operation.</p>
<h2 id="what-to-do-about-it">What to do about it?</h2>
<p>What I’ve done, is avoid using move constructors entirely. None of my classes support moves, and they only support a copy if the copy
is trivial or the <em>very rare</em> case that the destructor is well-defined on copy. Of course, this would violate the bulk of the STL
containers, which is why doing this approach forces you to avoid them. Provided your containers have iterators that are API compatible,
you can still use other parts of the STL without issue (e.g. <code class="highlighter-rouge"><algorithm></code>>).</p>
<p>Here’s a simple example of a mostly drop-in vector replacement I’ve been using from one of my own codebases. The main differences are that
it also implements the small-buffer optimization and its growth factor interpolates from 2 initially to 1.5 as it grows. API-wise, it
avoids the initializer-list shenanigans and doesn’t support <code class="highlighter-rouge">push_back</code> (I’ve never needed it since copyable things can accept an object
via <code class="highlighter-rouge">emplace_back</code> and use the copy constructor just the same).</p>
<div class="language-cpp highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma once
</span>
<span class="cp">#include <cstdlib>
#include <cstring>
</span>
<span class="k">namespace</span> <span class="n">alloy</span>
<span class="p">{</span>
<span class="c1">// SBS: Small Buffer Size
</span><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">SBS</span> <span class="o">=</span> <span class="mi">8</span><span class="o">></span> <span class="k">class</span> <span class="nc">vector</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">vector</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">static_assert</span><span class="p">(</span><span class="n">SBS</span> <span class="o">></span> <span class="mi">1</span><span class="p">,</span> <span class="s">"Small buffer size must be at least 1"</span><span class="p">);</span>
<span class="n">data_</span> <span class="o">=</span> <span class="n">sb_</span><span class="p">;</span>
<span class="p">}</span>
<span class="o">~</span><span class="n">vector</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">clear</span><span class="p">();</span>
<span class="k">if</span> <span class="p">(</span><span class="n">capacity_</span> <span class="o">></span> <span class="n">SBS</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">data_</span><span class="p">);</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="n">T</span><span class="o">*</span> <span class="n">begin</span><span class="p">()</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="n">T</span><span class="o">*</span> <span class="n">end</span><span class="p">()</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span> <span class="o">+</span> <span class="n">size_</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="k">const</span> <span class="n">T</span><span class="o">*</span> <span class="n">cbegin</span><span class="p">()</span> <span class="k">const</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="k">const</span> <span class="n">T</span><span class="o">*</span> <span class="n">cend</span><span class="p">()</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span> <span class="o">+</span> <span class="n">size_</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="n">T</span><span class="o">*</span> <span class="n">rbegin</span><span class="p">()</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">end</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="n">T</span><span class="o">*</span> <span class="n">rend</span><span class="p">()</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">begin</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="k">const</span> <span class="n">T</span><span class="o">*</span> <span class="n">crbegin</span><span class="p">()</span> <span class="k">const</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">cend</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="k">const</span> <span class="n">T</span><span class="o">*</span> <span class="n">crend</span><span class="p">()</span> <span class="k">const</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">cbegin</span><span class="p">()</span> <span class="o">-</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="kt">bool</span> <span class="n">empty</span><span class="p">()</span> <span class="k">const</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">size_</span> <span class="o">==</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">size</span><span class="p">()</span> <span class="k">const</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">size_</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">[[</span><span class="n">nodiscard</span><span class="p">]]</span> <span class="k">constexpr</span> <span class="kt">int</span> <span class="n">capacity</span><span class="p">()</span> <span class="k">const</span> <span class="k">noexcept</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">capacity_</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">T</span><span class="o">&</span> <span class="n">front</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">const</span> <span class="n">T</span><span class="o">&</span> <span class="n">front</span><span class="p">()</span> <span class="k">const</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">[</span><span class="mi">0</span><span class="p">];</span>
<span class="p">}</span>
<span class="n">T</span><span class="o">&</span> <span class="n">back</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">[</span><span class="n">size_</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">const</span> <span class="n">T</span><span class="o">&</span> <span class="n">back</span><span class="p">()</span> <span class="k">const</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">[</span><span class="n">size_</span> <span class="o">-</span> <span class="mi">1</span><span class="p">];</span>
<span class="p">}</span>
<span class="n">T</span><span class="o">&</span> <span class="k">operator</span><span class="p">[](</span><span class="kt">size_t</span> <span class="n">index</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
<span class="p">}</span>
<span class="k">const</span> <span class="n">T</span><span class="o">&</span> <span class="k">operator</span><span class="p">[](</span><span class="kt">size_t</span> <span class="n">index</span><span class="p">)</span> <span class="k">const</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">data_</span><span class="p">[</span><span class="n">index</span><span class="p">];</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">reserve</span><span class="p">(</span><span class="kt">int</span> <span class="n">next_capacity</span><span class="p">)</span>
<span class="p">{</span>
<span class="kt">void</span><span class="o">*</span> <span class="n">next</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">malloc</span><span class="p">(</span><span class="n">next_capacity</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">));</span>
<span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="n">next</span><span class="p">,</span> <span class="n">data_</span><span class="p">,</span> <span class="n">size_</span> <span class="o">*</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">));</span>
<span class="k">if</span> <span class="p">(</span><span class="n">capacity_</span> <span class="o">></span> <span class="n">SBS</span><span class="p">)</span> <span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">free</span><span class="p">(</span><span class="n">data_</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">data_</span> <span class="o">=</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="n">T</span><span class="o">*></span><span class="p">(</span><span class="n">next</span><span class="p">);</span>
<span class="n">capacity_</span> <span class="o">=</span> <span class="n">next_capacity</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">clear</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="k">const</span> <span class="n">T</span><span class="o">&</span> <span class="n">elem</span> <span class="o">:</span> <span class="o">*</span><span class="k">this</span><span class="p">)</span> <span class="p">{</span>
<span class="n">elem</span><span class="p">.</span><span class="o">~</span><span class="n">T</span><span class="p">();</span>
<span class="p">}</span>
<span class="n">size_</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">void</span> <span class="n">resize</span><span class="p">(</span><span class="kt">int</span> <span class="n">next_size</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">next_size</span> <span class="o">></span> <span class="n">capacity_</span><span class="p">)</span> <span class="p">{</span>
<span class="n">reserve</span><span class="p">(</span><span class="n">next_size</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">next_size</span> <span class="o"><</span> <span class="n">size_</span><span class="p">)</span> <span class="p">{</span>
<span class="k">for</span> <span class="p">(</span><span class="kt">int</span> <span class="n">i</span> <span class="o">=</span> <span class="n">next_size</span><span class="p">;</span> <span class="n">i</span> <span class="o">!=</span> <span class="n">size_</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span> <span class="p">{</span>
<span class="n">data_</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="o">~</span><span class="n">T</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">cursor</span> <span class="o">=</span> <span class="n">size_</span><span class="p">;</span>
<span class="n">size_</span> <span class="o">=</span> <span class="n">next_size</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(;</span> <span class="n">cursor</span> <span class="o">!=</span> <span class="n">size_</span><span class="p">;</span> <span class="o">++</span><span class="n">cursor</span><span class="p">)</span> <span class="p">{</span>
<span class="k">new</span> <span class="p">(</span><span class="n">data_</span> <span class="o">+</span> <span class="n">cursor</span><span class="p">)</span> <span class="n">T</span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span><span class="p">...</span> <span class="n">Args</span><span class="o">></span> <span class="n">T</span><span class="o">&</span> <span class="n">emplace_back</span><span class="p">(</span><span class="n">Args</span><span class="o">&&</span><span class="p">...</span> <span class="n">args</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">size_</span> <span class="o">>=</span> <span class="n">capacity_</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Interpolate the growth factor from 2 initially to 1.5
</span> <span class="k">auto</span> <span class="n">growth_factor</span> <span class="o">=</span> <span class="n">capacity_</span> <span class="o">></span> <span class="mi">1024</span>
<span class="o">?</span> <span class="mi">3</span> <span class="o">*</span> <span class="mi">1024</span>
<span class="o">:</span> <span class="mi">3</span> <span class="o">*</span> <span class="n">capacity_</span> <span class="o">+</span> <span class="mi">4</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1024</span> <span class="o">-</span> <span class="n">capacity_</span><span class="p">);</span>
<span class="n">reserve</span><span class="p">(</span><span class="n">capacity_</span> <span class="o">*</span> <span class="n">growth_factor</span> <span class="o">/</span> <span class="mi">2048</span> <span class="o">+</span> <span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">new</span> <span class="p">(</span><span class="o">&</span><span class="n">data_</span><span class="p">[</span><span class="n">size_</span><span class="o">++</span><span class="p">])</span> <span class="n">T</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">forward</span><span class="o"><</span><span class="n">Args</span><span class="o">></span><span class="p">(</span><span class="n">args</span><span class="p">)...);</span>
<span class="k">return</span> <span class="n">back</span><span class="p">();</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">T</span> <span class="n">sb_</span><span class="p">[</span><span class="n">SBS</span><span class="p">];</span>
<span class="n">T</span><span class="o">*</span> <span class="n">data_</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">size_</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">int</span> <span class="n">capacity_</span> <span class="o">=</span> <span class="n">SBS</span><span class="p">;</span>
<span class="p">};</span>
<span class="p">}</span> <span class="c1">// namespace alloy
</span></code></pre></div></div>
<p>Of course, the same approach can apply to many other container types as well. Empirically, I have
not observed lifetime bugs related to them and my code is a hell of a lot cleaner and less verbose.
Part of me sometimes feels like this is the way C++ was meant to be written, and there’s a feeling
of liberation to just write a constructor, destructor, and leave it at that.</p>
Tue, 12 Mar 2019 00:00:00 +0000
https://jeremyong.com/c++17/metaprogramming/2019/03/12/optimizing-cpp-by-avoiding-moves/
https://jeremyong.com/c++17/metaprogramming/2019/03/12/optimizing-cpp-by-avoiding-moves/Vulkan Synchronization Primer - Part II<p>This is part II (and possibly the final part) on this series titled the <em>Vulkan Synchronization Primer</em>. For the first
part, click <a href="/c++/graphics/gpu/vulkan/2018/11/22/vulkan-synchronization-primer.html">here</a>.</p>
<p>In the last part, we introduced the Flurble Factory and the difficulties encountered when trying to compute fine-grained
dependencies asynchronously. We noted that memory barriers are configured across two orthogonal dimensions: pipeline
masks, and access masks. In other words, we could specify that <em>prior</em> to certain types of memory access from certain
stages, we require that certain types of memory writes from certain stages were made visible or flushed.</p>
<p>This part expands on the last part and also introduces a few new concepts into the mix, occasionally referring again
to the Flurble Factory (if you skipped to this part, you can get the gist from the first couple paragraphs of my
last post linked above).</p>
<h2 id="memory-vs-execution-barrier">Memory vs Execution Barrier</h2>
<p>I glossed over this point a bit in the last part, and before continuing, I really need to address it, as the distinction
between an execution barrier and a memory barrier is an important concept. An execution barrier specifies that
instructions submitted prior to the execution barrier occur before instructions submitted after. To newer programmers,
it might be a surprise that such a concept is even important. To illustrate things, take a look at this snippet of
C++ code:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">reordering_happens</span><span class="p">(</span><span class="kt">int</span> <span class="n">num</span><span class="p">)</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">numx10</span> <span class="o">=</span> <span class="mi">10</span> <span class="o">*</span> <span class="n">num</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">num</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">num</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">numx10</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here, we have a simple function which appears to initialize <code class="highlighter-rouge">numx10</code> to an integer <code class="highlighter-rouge">10 * num</code>, check if <code class="highlighter-rouge">num</code> is negative, and
return either <code class="highlighter-rouge">num</code> or <code class="highlighter-rouge">numx10</code> depending on the result. Here’s the assembly generated (from godbolt.org gcc 8.2 with -std=c++17 and -O2):</p>
<pre><code class="language-asm">reordering_happens(int):
mov eax, edi
test edi, edi
js .L1
lea eax, [rdi+rdi*4]
add eax, eax
.L1:
ret
</code></pre>
<p>If you don’t grok assembly, the easy way to read this snippet is:</p>
<ol>
<li>First, move the integer argument in register <code class="highlighter-rouge">edi</code> to the return register <code class="highlighter-rouge">eax</code></li>
<li>Next, test if <code class="highlighter-rouge">edi</code> is zero (<code class="highlighter-rouge">test</code> does a bitwise and between the two operands) and set some CPU flags to encode the result</li>
<li>Jump to <code class="highlighter-rouge">.L1</code> if negative (by checking the sign flag <code class="highlighter-rouge">SF</code>), effectively just returning the original argument if true</li>
<li>If the argument in <code class="highlighter-rouge">edi</code> (aka <code class="highlighter-rouge">num</code>) wasn’t negative, do a load of <code class="highlighter-rouge">(num * 4 + num)</code> into <code class="highlighter-rouge">eax</code> and then add the result to itself (assembly shorthand for multiplying by ten)</li>
</ol>
<p>In this case, the variable <code class="highlighter-rouge">numx10</code> isn’t used at all (it’s been elided), but the point is that the compiler was free to
move the operation of multiplying <code class="highlighter-rouge">num</code> by 10 behind the branch. In general, the compiler has a good amount of freedom
to move loads and stores of variables around to improve performance. If however, we were to change the type of <code class="highlighter-rouge">numx10</code> from
an <code class="highlighter-rouge">int</code> to a <code class="highlighter-rouge">std::atomic<int></code>, the compiler would not be able to make this optimization, and the generated code becomes the
following (same flags, -std=c++17 and -O2):</p>
<pre><code class="language-asm">reordering_happens(int):
lea eax, [rdi+rdi*4]
add eax, eax
mov DWORD PTR [rsp-4], eax
mfence
test edi, edi
js .L1
mov edi, DWORD PTR [rsp-4]
.L1:
mov eax, edi
ret
</code></pre>
<p>Here, we see that the usage of an atomic variable (which imposes a memory dependency for stores before and loads behind),
the compiler can no longer move the multiply-by-10 behind the branch. Note that if you try this in the compiler explorer yourself,
you won’t see the <code class="highlighter-rouge">mfence</code> instruction unless you pass the atomic by reference to the function. Even if you use a local
atomic though, you should still see the multiply occur before the branch.</p>
<p>The reason this is relevant is that just as the compiler can reorder code with restrictions on the CPU, the GPU dispatcher
also has a good amount of flexibility. If I submit a draw command and a compute command in the same command buffer, there
is nothing there that informs the driver one must happen before the other or vice versa. If I needed to enforce an ordering
between them, I would need an execution barrier. To sequence the compute command after the draw command, for example, we
could invoke <code class="highlighter-rouge">vkCmdPipelineBarrier</code> with the source stage mask set to <code class="highlighter-rouge">VK_PIPELINE_STAGE_ALL_GRAPHICS_BIT</code> and the
destination stage mask set to <code class="highlighter-rouge">VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT</code> and no memory barriers at all.</p>
<h2 id="wait-why-do-we-need-memory-barriers-again">Wait, why do we need memory barriers again?</h2>
<p>If we can sequence commands one after another in the execution pipeline, why do we need these complicated memory barriers
at all? The answer is that memory writes are not “instantaneous” in the sense that they don’t immediately become visible
everywhere. In general, memory is hierarchical in nature, and to keep things fast, caches of the various memory types are
not flushed every time a write happens. On the CPU, for example, if one thread writes to a variable on a core, it can’t
(without additional barriers/fences) immediately be available on a different core (possibly spanning millimeters of distance
in silicon!). Memory barriers are thus a necessary abstraction to ensure reads are consistent <em>globally</em> in the context
of a CPU, and in the context of the GPU, we specify which stages of the pipeline need to provide visibility where.</p>
<h2 id="wait-why-do-we-need-execution-barriers-again">Wait, why do we need execution barriers again?</h2>
<p>:) So tying it all together, let’s remind ourselves once and for all, why we need <em>both</em> execution and memory barriers.
The only insight you need here really is that <em>when</em> the memory barrier happens is just as important as what the contents
of the barrier are. The barrier is, after all, just another command that gets passed down the queue. A barrier submitted
too late won’t be very useful at all, since everything that may have tripped a memory hazard has already been dispatched
into the void. If the memory barrier is submitted <em>without</em> an execution barrier, there would be a chance that a command
submitted earlier slips past the memory barrier, or vice versa. Thus, we need <em>both</em> the memory and the execution barrier
to be submitted together for everything to operate coherently. Fortunately, this is reflected exactly in the API, to wit
memory barriers are passed as optional arguments to the pipeline barrier (which acts as the execution barrier).</p>
<p>Note that the API allows us to submit a pipeline (i.e. execution) barrier <em>without</em> any memory barriers supplied. There
is at least one common use case where this is useful. If I submit a command that simply reads a resource that must later
be modified, this is known as a Write-After-Read (aka WAR) hazard. For example, suppose I have a compute job that downsamples
the framebuffer to draw a bloom effect later. I also submit a render pass containing UI that will write into the same
framebuffer, blending on top of it. This is an example of the WAR pattern because the downsampling doesn’t modify the
framebuffer directly (it will likely output to other render targets), but the UI pass does. Here, no memory barrier is
needed between the compute job and the UI pass (indeed, there is no memory to enforce visibility on), but we do need
to ensure that the UI pass does not start until the bloom job is done. This can be done by executing the pipeline barrier
with 0 memory barriers attached. It should not be too difficult to come up with other examples where such a technique
is useful.</p>
<h2 id="render-passes-and-subpasses">Render Passes and Subpasses</h2>
<p>“Wait,” you might interject, “what about render passes?” Yup, getting to that. Let’s consider a simple sequence of draws
that draw 3 objects on the screen. Each of these objects are transparent and happen to overlap one another. Thus, we need
to employ a back-to-front drawing algorithm (i.e. Painter’s algorithm) with blending-on in order to draw the desired result.
At this point, hopefully, your spidey-senses are tingly and you should wonder, “how are these draw calls synchronized?”
That is to say, if you just submit them one after another, how are you guaranteed that the first draw will complete before
the second, etc? This is an excellent question. Based on everything we’ve learned so far, the only way we know how to do
this is to inject a memory barrier between each draw call, ensuring that we wait for color attachment writes from the blend
pipeline stage to complete each time.</p>
<p>Obviously, this would be a huge pain to do for every draw call, and this is where a render pass comes in. All draw calls
are issued inside render passes which provide a set of implicit synchronization guarantees. If you heard other reasons
for render passes, my guess is that they are related somehow to these guarantees, but providing synchronization between
draw calls over a shared set of render targets is the <em>primary reason render passes exist</em>. Render passes give you the
following:</p>
<ol>
<li>A way to describe load and store operations at the start and end of each subpass (which normally would require an image
memory barrier)</li>
<li>A way of describing barriers between subpass dependencies</li>
<li>An implicit guarantee for draw calls within a subpass to be executed possibly concurrently, but not in a way that
violates the pipeline order</li>
</ol>
<p>Unpacking this a bit, at the start of every subpass and render pass, there is an image barrier (much like the ones you
saw in part I) that ensures that all relevant memory that has been modified prior to the start of the pass has been
made visible. Once entering the pass, draws are able to be invoked and scheduled by GPU without needing overly gratuitous
barriers because the sequencing is guaranteed not to violate pipeline order. That is to say, if we start fragment shading
for object <script type="math/tex">A</script>, we can start vertex processing for object <script type="math/tex">B</script>, but we can’t start fragment shading for <script type="math/tex">B</script> until
the fragment shading for <script type="math/tex">A</script> has finished (at least, within the same render pass). That way, we know that results
within a render pass (or subpass) will be correct, assuming we set up our draws correctly. Finally, the pass executes
the store command on any of its render targets as necessary, discarding temporary buffers/memory, and readying the
framebuffers up for the next render pass.</p>
<p>If you squint slightly at the definition of a <a href="https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VkSubpassDependency.html"><code class="highlighter-rouge">VkSubpassDependency</code></a>, you should see that
it mirrors the memory barrier very closely. In fact subpass dependencies and memory barriers are one and the same,
except subpass dependencies provide a nicer syntactic sugar for most common operations you might encounter in applications
that use multiple render targets (e.g. deferred shading).</p>
<p>The only remaining point on render passes worth touching on are the dependency flag bits which can be supplied to a
<code class="highlighter-rouge">VkSubpassDependency</code>. See <a href="https://www.khronos.org/registry/vulkan/specs/1.1-extensions/man/html/VkDependencyFlagBits.html">here</a>
for the official scoop. The <code class="highlighter-rouge">VK_DEPENDENCY_BY_REGION_BIT</code> flag is the primary one I want to touch on (although you are
encouraged to read about the others), as this one comes up the most frequently. Essentially, when I am issuing draw
calls, we generally don’t want to violate pipeline order specifically in the case when the later draw would affect the
results of the earlier draw (for example in blending). Astute readers would be correct in thinking that a full barrier
per pipeline stage of the <em>entire</em> framebuffer is a bit heavy-handed. If I draw a teapot in the top left corner of the
screen after all, surely that shouldn’t interfere with the teapot draw in the bottom right corner. Well, if you thought
this on your own, you should give yourself a pat on the back for paying attention. Indeed, this sort of optimization is
highly relevant in modern GPU architectures that do tiled-rendering. Without getting too far into the weeds, there is a class
of hardware that rasterizes triangles in two passes. The first pass bins triangles into small screen-space tile bins.
The second pass executes the full graphics pipeline on a tile-by-tile basis for each of the triangles binned to a given
tile. Without the <code class="highlighter-rouge">VK_DEPENDENCY_BY_REGION_BIT</code> set, intermediate render-targets must be fully synchronized and flushed
for each draw, regardless of the tiles they affect. Thankfully, with the bit set, we can inform the driver that our
interaction for a given framebuffer dependency is localized entirely based on where the fragment is being drawn. This
allows (as you might imagine), all sorts of optimizations to come about, including reducing memory requirements for
transient framebuffer storage. Of course, don’t set this flag if your usage of the framebuffer dependency is not, in fact,
localized. For example, if you sample the framebuffer inside your fragment shader to do screen space reflections, you
could be accessing arbitrary pixels of your framebuffer regardless of where you are rendering.</p>
<h2 id="gpu-to-cpu-synchronization">GPU-to-CPU synchronization</h2>
<p>Moving forward from here, you should be comfortable reasoning about the various dependencies we can describe to the GPU
that exist between all the various commands that can execute on the GPU. We can describe buffer dependencies, image
dependencies, and framebuffer dependencies (which really are just a special case of memory). We also learned about the
various implicit guarantees and simpler (but similar) API we get with render passes to save a ton of boilerplate.
From this point on then, the remaining concepts (semaphores, fences, and events) should hopefully be straightforward to explain.</p>
<p>The first observation to make to dive into the concepts is that pipeline barriers and memory barriers are submitted
asynchronously (just like any other command we submit to the GPU). This means, for example, that if you record a series
of commands to a buffer, submit the buffer, and then submit a pipeline barrier as well, you aren’t actually free to
<code class="highlighter-rouge">free</code> (heh) that buffer just yet. After all, just because you submitted the pipeline barrier, there’s no guarantee
that the barrier itself has made its way through the GPU’s command queue. The pipeline barrier is <em>only</em> useful for
sequencing actions that are submitted on the same queue, asynchronously to the GPU.</p>
<p>The easiest (and most heavy-handed) way to solve this problem is by using a wait command to wait until the queue you
submitted the commands to is idle, and you will often see this in tutorial code as its a bit easier to wrap your
head around. Unfortunately, this approach does not scale well in the real world; waiting for a queue to be idle
means you cannot submit more work to that queue in the meantime (lest you wind up waiting forever).</p>
<p>The better way of dealing with GPU-to-CPU synchronization is with a fence. Fences can be submitted along with a command
buffer as part of the <code class="highlighter-rouge">VkQueueSubmit</code> function, and this fence will later be signaled on the <em>CPU</em> in a way that is
visible and waitable via <code class="highlighter-rouge">vkWaitForFences</code>.</p>
<h2 id="gpu-to-gpu-synchronization">GPU-to-GPU synchronization</h2>
<p>We covered this briefly in part I, specifically when we needed to address describing a memory dependency across different
queues. However, sometimes, we need synchronization between different queues on the GPU for something that is not
described easily as a memory dependency. The most common example of this is managing the rendering of a frame, and its
presentation on a different queue (if the graphics and present queues wind up being different). The noun used to describe
this is a <code class="highlighter-rouge">VkSemaphore</code> and it works very similarly to the <code class="highlighter-rouge">VkFence</code>. It is submitted to the queue as part of the
<code class="highlighter-rouge">VkSubmitInfo</code> struct, but instead of waiting for it on the CPU, it is waited on by a different queue (specified in a
different part of the <code class="highlighter-rouge">VkSubmitInfo</code> struct).</p>
<h2 id="events">Events</h2>
<p>Vulkan events resemble the “split barrier” which is also available in D3D12. Imagine if you could split the pipeline
barrier into two parts. The first part emits a signal, and the second part waits for the signal with the appropriate
masks to indicate which pipeline stages need to flush what memory. This can be more performant that a a normal barrier
in the event that the barrier includes too much. For example, imagine you had 3 commands you wanted to submit, <code class="highlighter-rouge">A</code>, <code class="highlighter-rouge">B</code>,
and <code class="highlighter-rouge">C</code>, and <code class="highlighter-rouge">C</code> is dependent on <code class="highlighter-rouge">A</code>. Suppose further that <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">B</code> were recorded on a separate thread into a
secondary command buffer. How should you synchronize <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">C</code>? One thing you could do is insert a pipeline barrier
between <code class="highlighter-rouge">B</code> and <code class="highlighter-rouge">C</code> to flush the memory written by <code class="highlighter-rouge">A</code>. However, if <code class="highlighter-rouge">B</code> also writes memory in this pipeline stage,
you’ll end up waiting longer to flush more than you might have otherwise. You like the fact that you could record
<code class="highlighter-rouge">A</code>/<code class="highlighter-rouge">B</code> and <code class="highlighter-rouge">C</code> independently on two separate threads on the host CPU, but you don’t like that the barrier between
<code class="highlighter-rouge">B</code> and <code class="highlighter-rouge">C</code> synchronizes too much. Alternatively, putting a barrier between <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">B</code> may prevent parallelism between
<code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">B</code> since they don’t technically have any hard dependency between them. The solution is the <code class="highlighter-rouge">VkEvent</code> which
is more granular. The thread recording <code class="highlighter-rouge">A</code> and <code class="highlighter-rouge">B</code> can insert an event signal using <code class="highlighter-rouge">vkCmdEventSignal</code> between <code class="highlighter-rouge">A</code> and
<code class="highlighter-rouge">B</code>, while the thread recording <code class="highlighter-rouge">C</code> can insert a <code class="highlighter-rouge">vkCmdEventWaits</code> just before <code class="highlighter-rouge">C</code> solving all problems as stated.</p>
<p>Events are very powerful, but I strongly encourage first leveraging the render pass and subpass abstractions first,
as these map well to hardware. In addition, fine grained events can occasionally perform worse than a single barrier,
so consider a simpler architecture at first. Remember that if you guard every resource with barriers and fences out
the wazoo, you’re slinking back to pre-OpenGL performance and will likely not get much of what Vulkan can offer.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This concludes part II and likely the final part for now on Vulkan synchronization. It is not a <em>hard</em> topic, in my
opinion, but it is certainly opaque and broad. Unlike the CPU, you have to contend with asynchronous dispatch, multiple
programmable pipeline stages (with a distinct order), multiple dispatch queues, and multiple types of memory writes.
As a result, the surface area for describing synchronization dependencies is vast and the API is relatively large
also. It has been my experience though that once you get the <em>gist</em> of what the problem is and why the API needs to be
what it is, the documentation and usage of the API becomes much more approachable. As always, if you have any feedback
or corrections, feel free to email or message me using the links below.</p>
Fri, 23 Nov 2018 00:00:00 +0000
https://jeremyong.com/vulkan/graphics/rendering/2018/11/23/vulkan-synchonization-primer-part-ii/
https://jeremyong.com/vulkan/graphics/rendering/2018/11/23/vulkan-synchonization-primer-part-ii/Vulkan Synchronization Primer - Part I<p>The intent of this post is to provide a mental model for understanding the various synchronization
nouns, verbs, and adjectives Vulkan offers. In particular, after reading this series, hopefully, you’ll
have a good understanding of what problems exist, when you should use which sychronization feature, and
what is likely to perform better or worse. There are no real prerequisites to reading this, except that
you’ve at least encountered a few barriers in tutorial or sample code. I would guess that many readers
might have also tried to read the standard on synchronization (with varying degrees of success).</p>
<p>Part I of this series will primarily cover memory barriers (<code class="highlighter-rouge">VkMemoryBarrier</code>,
<code class="highlighter-rouge">VkBufferMemoryBarrier</code> and <code class="highlighter-rouge">VkImageMemoryBarrier</code>) as well as some of the fundamentals that will be
important later on. This is a bird’s eye view approach and not intended to ultimately replace careful
reading of the documentation and standard. I definitely won’t be able to summarize everything the
standard offers on this topic, but hopefully a good chunk of it should be immediately more accessible
to you.</p>
<p>(Part II is <a href="/c++/graphics/gpu/vulkan/2018/11/23/vulkan-synchonization-primer-part-ii.html">out!</a>)</p>
<h2 id="the-factory">The “Factory”</h2>
<p>Suppose you own a factory that makes a complicated item, the flurble. There are many types of flurbles,
each type with different constituent components and unique assembly instructions. Furthermore, market
demand for a specific type of flurbles can fluctuate greatly, so you often have to change up the type
of flurbles being generated from the factory. To make matters worse, the flurble factory is located in
a remote region, and instructions between you and the factory are delivered via email. The workers at
the flurble factory are precise workers, and will follow your instructions to the letter.</p>
<p>As the principal flurble designer, you need to deliver your directions precisely. Otherwise, you risk
requiring a lot of back-and-forth. Furthermore, if you don’t deliver an <em>efficient</em> set of instructions,
you risk the factory churning out fewer flurbles than you might like. Let’s look at a hypothetical
example to see what I mean. Suppose you design a flurble that has three parts, <script type="math/tex">F_a</script>, <script type="math/tex">F_b</script>, and <script type="math/tex">F_c</script>.
To combine them, <script type="math/tex">F_a</script> plugs into <script type="math/tex">F_b</script>, and the combination <script type="math/tex">F_{ab}</script> plugins into <script type="math/tex">F_c</script> to make the
finished product, <script type="math/tex">F_{abc}</script>.</p>
<p>A very poor way to create this product might look like the following.</p>
<script type="math/tex; mode=display">F_a \rightarrow F_b \rightarrow F_{ab} \rightarrow F_c \rightarrow F_{abc}</script>
<p>In my made up notation, the equation above prescribes a purely sequential set of instructions, and combined
subscripts like <script type="math/tex">F_{ab}</script> imply a combination effect between two or more dependent objects. Creating
<script type="math/tex">{F_b}</script> doesn’t happen until <script type="math/tex">F_a</script> is finished for example. <script type="math/tex">F_c</script> is made until just before it’s
needed at the very end. It should be pretty clear that we can do a lot better than this. For example,
we could have different people at the factory create the separate flurble parts <script type="math/tex">F_a</script> through <script type="math/tex">F_c</script>
in <em>parallel</em>, and do the assembly afterwards. To annotate this, we might describe this variant like so:</p>
<script type="math/tex; mode=display">\begin{bmatrix} F_a \\ F_b \\ F_c \end{bmatrix} \rightarrow F_{ab} \rightarrow F_{abc}</script>
<p>This seems better in the sense that <script type="math/tex">F_a</script>, <script type="math/tex">F_b</script>, and <script type="math/tex">F_c</script> can all be built in parallel, but
if you stop and stare for a bit, you should be able to find problems with this description as well.</p>
<p>The issue is, of course, that we don’t want to wait for <script type="math/tex">F_c</script> to finish construction before
starting the assembly of <script type="math/tex">F_{ab}</script>. We like that the creation of <script type="math/tex">F_c</script> starts in parallel with the
other components, but unfortunately, our description is still overconstrained, and we’ve introduced
a potential <em>pipeline stall</em> in our program.</p>
<h2 id="from-flurble-land-to-gpu-land">From Flurble-land to GPU-land</h2>
<p>I won’t carry the analogy any further, because hopefully, the simple example above should clarify
the types of issues you might encounter. First, I’ll summarize the GPU’s execution and memory model
a bit:</p>
<ul>
<li>Every command we issue to the GPU may read or write memory in some way, shape, or form</li>
<li>Every command is submitted in a <code class="highlighter-rouge">command buffer</code> which determines a dispatch order between all
the commands inside the buffer</li>
<li>Each command buffer is submitted to a particular queue, and different GPUs may or may not have
multiple queues</li>
<li>Different queues might only support a subset of the available commands (e.g. a dedicated transfer
queue supports image transfers and buffer transfers)</li>
<li>The GPU operates in a well-defined pipeline, and any given command may or may not have work for
each stage of the pipeline</li>
</ul>
<p>And things you need to worry about:</p>
<ul>
<li>Just because a command was submitted early in a buffer doesn’t mean it has finished all of its
instructions before a command submitted later</li>
<li>Just because a command buffer was submitted to a queue earlier than other command buffers on the
same queue, doesn’t mean that all of its commands (or indeed, any of them) are guarantted to be
finished before a later command buffer starts</li>
<li>Command buffers submitted on different queues similarly do not provide any guarantees</li>
</ul>
<p>Why does the world operate this way? Well, this is where having an understanding of the CPU’s memory
model can help. On the CPU, we also need to be worried about read and write hazards. If a thread
writes data to a particular memory address, how can we be sure that a thread (that may be running on
a different core) can read that same memory address in a well-defined way? The answer is that we need
some form of synchronization primitive. If you’re familiar already with <code class="highlighter-rouge">std::memory_order</code>, you’ll
know that most processors provide different types of memory barriers. The memory barriers on the CPU
impose restrictions of varying levels of strictness. These barriers, when executed
essentially say something akin to “wait until all writes executed prior to this point are visible”
before continuing (in reality, there are different types of barriers that impose different levels of
consistency).</p>
<p>On the GPU though, things are a bit more complicated than the CPU. The CPU is also deeply pipelined,
but in general, the programmer doesn’t think about the different pipeline stages. The entire thing is
more or less treated as a black box. Also, full memory/execution barriers on the GPU are <em>very</em>
expensive. The GPU’s pipelines are deep and comparatively expensive to run compared to the CPU’s
pipeline. For example, just rasterization of triangles alone is a boatload of instructions and occupies
its own stage in the GPU’s pipeline. This is another way of saying that the GPU is optimized for
throughput; at least, relative to the CPU. The final difference we’ll consider (there are more, but
arguably less important differences) is that the GPU has many different <em>types</em> of memory writes.
For example, a fragment shader might write to a framebuffer. Alternatively, a command might transition
an image buffer from one memory layout to another (for more optimized sampling, or swap chain
presentation). Maybe a vertex shader writes out transform feedback data to a storage buffer. Thus,
when we issue a barrier that says “make <em>all</em> writes prior to this point visible,” this could be a
very expensive barrier indeed, since all the various buffers now need to perform all necessary
operations (some of which are fairly expensive indeed) before continuing.</p>
<p>The way we have to approach things then, is a much more <em>explicit</em> barrier. We need a <code class="highlighter-rouge">barrier</code> that says:
given a pipeline stage and all the types of memory we’re about to care about, make sure we’re
finished with those writes before proceeding to access this type of memory at these specific
stages. A bit of a mouthful? Let’s look at an example:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vk</span><span class="o">::</span><span class="n">MemoryBarrier</span> <span class="n">barrier</span><span class="p">{</span>
<span class="n">vk</span><span class="o">::</span><span class="n">AccessFlagBits</span><span class="o">::</span><span class="n">eTransferWrite</span><span class="p">,</span> <span class="c1">// Src access mask
</span> <span class="n">vk</span><span class="o">::</span><span class="n">AccessFlagBits</span><span class="o">::</span><span class="n">eVertexAttributeRead</span> <span class="c1">// Dst access mask
</span><span class="p">};</span>
<span class="n">command_buffer</span><span class="p">.</span><span class="n">pipelineBarrier</span><span class="p">(</span>
<span class="n">vk</span><span class="o">::</span><span class="n">PipelineStageFlagBits</span><span class="o">::</span><span class="n">eTransfer</span><span class="p">,</span> <span class="c1">// Src pipeline stage mask
</span> <span class="n">vk</span><span class="o">::</span><span class="n">PipelineStageFlagBits</span><span class="o">::</span><span class="n">eVertexInput</span> <span class="c1">// Dst pipeline stage mask
</span> <span class="mi">1</span><span class="p">,</span> <span class="c1">// Memory barrier count
</span> <span class="o">&</span><span class="n">barrier</span> <span class="c1">// Memory barriers
</span><span class="p">);</span>
</code></pre></div></div>
<p>We can read this code in two parts. First, the memory barrier defines what memory must be
visible (here, transferred memory from something like a <code class="highlighter-rouge">vkCmdBufferCopy</code>) to where (subsequent
commands that rely on reading vertex attribute memory). The second part, the <code class="highlighter-rouge">vkCmdPipelineBarrier</code>,
informs the driver that the memory barrier kicks in when we reach the vertex input stage of the
pipeline, and the relevant memory written by the transfer stage must have been published at
this point in time. This barrier applies to <em>all</em> commands submitted prior to the same
command buffer, and <em>all</em> commands submitted in a different barrier earlier on the same queue.
Remember also that each command may or may not invoke each pipeline stage. In this example,
if commands submitted before the barrier did not have a <code class="highlighter-rouge">transfer</code> stage, they would not
factor into the execution of the barrier. Similarly, commands submitted after the barrier
that do not have the <code class="highlighter-rouge">VertexInput</code> stage enabled will execute as though this barrier didn’t
exist. The memory and pipeline barrier together then, can be thought of as a specification of
“filters” that define dependencies between a subset of commands submitted after to a subset
of commands submiitted before.</p>
<p>We should now be able to simply <em>read</em> arbitrary barriers and understand what information
they encode (regardless of whether or not they make sense). For example:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">vk</span><span class="o">::</span><span class="n">MemoryBarrier</span> <span class="n">barrier</span><span class="p">{</span>
<span class="n">vk</span><span class="o">::</span><span class="n">AccessFlagBits</span><span class="o">::</span><span class="n">eUniformWrite</span> <span class="o">|</span>
<span class="n">vk</span><span class="o">::</span><span class="n">AccessFlagBits</span><span class="o">::</span><span class="n">eDepthStencilAttachmentWrite</span><span class="p">,</span> <span class="c1">// Src access mask
</span> <span class="n">vk</span><span class="o">::</span><span class="n">AccessFlagBits</span><span class="o">::</span><span class="n">eIndexRead</span> <span class="c1">// Dst access mask
</span><span class="p">};</span>
<span class="n">command_buffer</span><span class="p">.</span><span class="n">pipelineBarrier</span><span class="p">(</span>
<span class="n">vk</span><span class="o">::</span><span class="n">PipelineStageFlagBits</span><span class="o">::</span><span class="n">eEarlyFragmentTests</span><span class="p">,</span> <span class="c1">// Src pipeline stage mask
</span> <span class="n">vk</span><span class="o">::</span><span class="n">PipelineStageFlagBits</span><span class="o">::</span><span class="n">eGeometryShader</span> <span class="o">|</span>
<span class="n">vk</span><span class="o">::</span><span class="n">PipelineStageFlagBits</span><span class="o">::</span><span class="n">eTransfer</span> <span class="c1">// Dst pipeline stage mask
</span> <span class="mi">1</span><span class="p">,</span> <span class="c1">// Memory barrier count
</span> <span class="o">&</span><span class="n">barrier</span> <span class="c1">// Memory barriers
</span><span class="p">);</span>
</code></pre></div></div>
<p>This “nonsense” barrier basically reads like “before trying to read any index buffers from
either the transfer or geometry shading stage (or later), make sure that writes to any uniforms and
depth-stencil attachments from the early fragment tests stage (or earlier) have completed.”</p>
<p>We should include a few caveats. First, masking memory access bits for a stage that the
stage doesn’t actually use doesn’t make a lot of sense. For example, defining a memory
barrier that waits for all shader writes from late fragment tests stage is odd because
no shader invocation happens in that pipeline stage whatsover. Furthermore, it doesn’t make
sense to place a barrier between two stages, where the source stage happens <em>after</em> the
destination stage in the pipeline. Third, it doesn’t make sense to mask access bits
that correspond to reads (e.g. <code class="highlighter-rouge">eShaderRead</code>) in the source access mask. Reading from
memory from different stages without a write issued in between is well-defined and requires
no barrier. Last, the scopes defined by a pipeline barrier refer to commands submitted
<em>prior</em> to the barrier and commands submitted after. Thus, in the above example, if you
were to submit the pipeline barrier after a draw command that uses the geometry shader,
the barrier won’t apply to it (and you may be violating a memory hazard if that shader
accessed a uniform).</p>
<h2 id="multiple-queue-submission">Multiple Queue Submission</h2>
<p>The definitions above applied to submissions that occurred on the same queue. As we
mentioned earlier though, GPUs have multiple queues. Most generally, they will have
some number (0 or more) graphics, compute, and transfer queues. Submitting to multiple
queues unlocks parallel submission on one hand, but on the other hand, there is now an
entirely separate class of possible memory hazards we need to watch out for. Going back
to the Flurble factory, imagine if there were multiple people emailing the factory
workers with interdependent instructions. Of course, we’d like to not have to deal
with this complexity at all, but in general, applications can be hugely bandwidth-bound,
and this abstraction simply offers too much performance to leave off the table.</p>
<p>There are two main options for synchronizing work between queues. First, there is
the broad-phase synchronization known as the <code class="highlighter-rouge">VkSemaphore</code>. The way it works is much
like the standard OS-provided semaphore. Semaphores to be signaled are submitted
to one queue at submission time, and semahores are submitted to a different queue
to be waited on (also at submission time). I call this “broad-phase” for pretty obvious
reasons; it’s heavy-weight and blocks all execution on the second queue until all
operations of the first queue finish. Sometimes, this is simply exactly what you want.
For example, finishing all rendering on a graphics queue before attempting to send
the framebuffer to the presentation queue.</p>
<p>Other times, we need a more finer grained form of synchronization. The most common
examples of this are a transfer job to submit buffers or images to the GPU that
get consumed later by the graphics queue. Alternatively, we might have interdependencies
between the graphics queue and the compute queue or vice-versa. Luckily, encoding
this information is not so difficult as we have two memory barrier types for dealing
with these cases specifically: the <code class="highlighter-rouge">VkBufferMemoryBarrier</code> and <code class="highlighter-rouge">VkImageMemoryBarrier</code>.
Both of these structures contain member fields to encode the source and destination
queue families. These barriers can also be used usefully on a single queue since they
let you encode a barrier on a sub-region of either the buffer or image, or an image
format transition in addition to every thing else in the case of an image barrier.
When used to describe a queue transfer however, these barriers need to be submitted
to <em>both queues</em> with the source and destination queue reversed. Depending on the
queue they are submitted to, the barriers define a release or consume dependency
between the queues. Another difference is that when these barriers are used to describe
a queue ownership transfer, when a <em>release</em> is defined, the <code class="highlighter-rouge">dstStageMask</code> is ignored -
after all, the commands submitted afterwards in the <em>same queue</em> do not care about
the barrier. Similarly, the <code class="highlighter-rouge">srcStageMask</code> is ignored in the consume operation on the
other side for an analogous reason.</p>
<h2 id="render-passes-and-subpasses-fences-and-events">Render Passes and Subpasses, Fences, and Events</h2>
<p>Render passes operate similarly to the other barriers but are much
more convenient to use in the context of draw commands and optimizations for tiled
rendererers. Fences and events will have to wait for later as well, but in all, I expect
any subsequent discussion to be a bit easier once the contents in this post are firmly
grasped (these topics are covered in <a href="/c++/graphics/gpu/vulkan/2018/11/23/vulkan-synchonization-primer-part-ii.html">part ii</a>).</p>
<h2 id="conclusion">Conclusion</h2>
<p>And that’s it for this part of the “primer.” Hopefully, this much is enough that you can reason about
when and why dependencies occur in Vulkan (or any other modern graphics API), and how
to encode them. To actually use them effectively in the wild, remember not to encode
redundant dependencies. For example, if <script type="math/tex">C</script> depends on both <script type="math/tex">A</script> and <script type="math/tex">B</script>, but
<script type="math/tex">B</script> depends on <script type="math/tex">A</script>, you can encode this using two dependencies <script type="math/tex">A \rightarrow B</script>
and <script type="math/tex">B \rightarrow C</script>. The dependency <script type="math/tex">A \rightarrow C</script> is redundant. Also, trying
to get a perfect representation of all your resources in the application is often
counterproductive. It’s better to think of synchronization as necessary for making
your application correct, but not in and of themselves free. After all, it takes
some amount of computational work to evaluate the dependency graph for the driver as
well. How granular your dependencies should be is definitely outside the scope of this
article, but experimentation is encouraged, and personally, I would opt for less granularity
upfront, with additional changes once profiling has identified a bubble. As a final
point, if you’ve tried reading the spec before and perhaps got discouraged or disinterested,
you may want to try giving it another go :). As always, feedback is welcome and you
can reach me via email or twitter at the links below.</p>
Thu, 22 Nov 2018 00:00:00 +0000
https://jeremyong.com/vulkan/graphics/rendering/2018/11/22/vulkan-synchronization-primer/
https://jeremyong.com/vulkan/graphics/rendering/2018/11/22/vulkan-synchronization-primer/Best Practices for Authoring Generic Data Structures<p>This is a collection of ideas I’ve developed over the years that have resulted in higher
quality and more ergonomic code. In this article, I’m going to say the caveat once (right now)
that you should always code and architect for your particular workflow, and these ideas may or may not
apply. Henceforth, I’m going to be prescriptive about what I think a good set of patterns for,
and do my best to provide the rationale. I’m not going to talk about actual data structures
themselves, but instead about design principles and coding practices that I think apply to
all data structures as it relates to C++. In the code examples, pretend I did all the <code class="highlighter-rouge">constexpr</code>,
<code class="highlighter-rouge">[[nodiscard]]</code>, <code class="highlighter-rouge">noexcept</code>, and any other aspects of the attribute and modifier zoo properly
(omitted for brevity).</p>
<h2 id="moving-away-from-the-standard-way">Moving away from the “standard” way</h2>
<p>For the sake of example throughout this article, we’ll talk about authoring the most primitive
of useful data structures (the quadratically resizing array I’ll just call <code class="highlighter-rouge">vector</code> here).
If I asked most people to sketch the implementation, they’d probably come up with something that
rhymes with the following:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="k">class</span> <span class="nc">vector</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="c1">// A bunch of constructors handling various types of initialization, moves, and copies
</span> <span class="c1">// A destructor that invokes `delete` on `data_` if non-null
</span> <span class="c1">// An iterator type, and `begin`, `end`, `rbegin`, `rend`, and their const equivalents
</span> <span class="c1">// Methods for mutating the `vector` like `push_back`, `emplace_back`, `clear`, etc.
</span> <span class="c1">// Operators and accessors for reading and writing to constituent data
</span>
<span class="k">private</span><span class="o">:</span>
<span class="kt">void</span> <span class="n">grow</span><span class="p">();</span> <span class="c1">// Invoked when the size_ is about to grow beyond the capacity_
</span>
<span class="n">T</span><span class="o">*</span> <span class="n">data_</span> <span class="o">=</span> <span class="nb">nullptr</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">size_</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">capacity_</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>This is not a bad place to start; after all, most standard containers you use will resemble
this style of implementation. And I would contend that often times, we can improve on this
a good deal.</p>
<h2 id="interface-weaknesses-of-the-vanilla-approach">Interface weaknesses of the vanilla approach</h2>
<p>The primary gripe I have with data structures authored as above is the template parameter <code class="highlighter-rouge">T</code>.
On the surface, this seems necessary. The data structure should know how <code class="highlighter-rouge">T</code> is constructed,
moved, and destructed, right? After all, such operations on <code class="highlighter-rouge">T</code> all need to happen over the
lifetime of the data structure. Furthermore, <code class="highlighter-rouge">T</code> comes with information about the size and
alignment requirements for allocating memory to store it.</p>
<p>To understand why this might be a problem, consider the issue of writing a common interface
to, say, a rendering backend. You want to provide a class that contains data structures that
hold information like shader handles, pipeline objects, buffer handles, and texture data.
This interface may be implemented numerous times for this project, to provide an OpenGL, Vulkan,
Metal, and D3D backend (possibly across multiple versions of each). Furthermore, for various
types of data, the alignment or size requirements may not even be known until runtime. Many
buffer types have extended alignment restrictions that must be queried by the GPU.</p>
<p>For this type of interface, using the naive <code class="highlighter-rouge">vector</code> class implemented above, we cannot house it
in the common interface layer. It needs to be duplicated within each backend implementation
of the renderer. Furthermore, we may need to choose a very pessimistic alignment requirement
to ensure that things work portably. All this means that we both waste memory, and also will
have a tough time sharing common code that needs to operate on these data structures. For
example, we’d love to have a common set of functions for managing the lifetimes of each resource
type, and tracking memory budgets and usage.</p>
<h2 id="an-alternative-design">An alternative design</h2>
<p>The problem, in my opinion, is that there is too much internal coupling with this design. The
data structure simultaneously manages both the memory and data structure algorithms <em>and</em> the
lifecycle operations of the type <code class="highlighter-rouge">T</code>.</p>
<p>This is where my own code takes a sharp left turn from other generic code I’ve seen. I believe
that data structures should in fact be provided as two separate classes, the <em>storage</em> class,
and the <em>view</em> class. The <em>storage</em> class should only be concerned with the size and alignment
restrictions of its internals, as well as its policy for when data needs to move around or
allocate. The <em>view</em> class should be a thin type-safe adaptor that can access the data without
the need for excessive casting.</p>
<p>Here’s an example, of what this might look like:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">vector_storage</span><span class="p">;</span> <span class="c1">// Forward declaration of non-templatized vector
</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="k">class</span> <span class="nc">vector_view</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">vector_view</span><span class="p">(</span><span class="n">vector_storage</span><span class="o">&</span> <span class="n">v</span><span class="p">)</span> <span class="o">:</span> <span class="n">vector_storage_</span><span class="p">{</span><span class="n">v</span><span class="p">}</span> <span class="p">{}</span>
<span class="c1">// An iterator type, and `begin`, `end`, `rbegin`, `rend`, and their const equivalents
</span> <span class="c1">// Methods for mutating the `vector` like `push_back`, `emplace_back`, `clear`, etc.
</span> <span class="c1">// Operators and accessors for reading and writing to constituent data
</span><span class="nl">private:</span>
<span class="n">vector_storage</span> <span class="n">vector_storage_</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">class</span> <span class="nc">vector_storage</span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">vector</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">element_size</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">alignment</span><span class="p">);</span>
<span class="c1">// template member functions for push_back, iterators, etc.
</span>
<span class="c1">// For example:
</span> <span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="kt">void</span> <span class="n">push_back</span><span class="p">(</span><span class="n">T</span><span class="o">&&</span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">static_assert</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o">==</span> <span class="n">element_size_</span><span class="p">,</span> <span class="s">"Type size mismatch"</span><span class="p">);</span>
<span class="k">static_assert</span><span class="p">(</span><span class="k">alignof</span><span class="p">(</span><span class="n">T</span><span class="p">)</span> <span class="o"><=</span> <span class="n">alignment_</span><span class="p">,</span> <span class="s">"Alignment restriction violation"</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">size_</span> <span class="o">==</span> <span class="n">capacity_</span><span class="p">)</span> <span class="n">grow</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">();</span>
<span class="o">*</span><span class="p">(</span><span class="n">begin</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">()</span> <span class="o">+</span> <span class="n">size_</span><span class="p">)</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">forward</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">(</span><span class="n">value</span><span class="p">);</span>
<span class="o">++</span><span class="n">size_</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">template</span> <span class="o"><</span><span class="n">T</span><span class="o">></span>
<span class="kt">void</span> <span class="n">grow</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">capacity_</span> <span class="o">*=</span> <span class="mi">2</span><span class="p">;</span>
<span class="kt">void</span><span class="o">*</span> <span class="n">temp</span> <span class="o">=</span> <span class="n">data_</span><span class="p">;</span>
<span class="n">_data</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">malloc</span><span class="p">(</span><span class="n">capacity_</span> <span class="o">*</span> <span class="n">element_size_</span><span class="p">);</span>
<span class="k">if</span> <span class="nf">constexpr</span> <span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">is_trivially_copyable</span><span class="o"><</span><span class="n">T</span><span class="o">>::</span><span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="n">_data</span><span class="p">,</span> <span class="n">temp</span><span class="p">,</span> <span class="n">size_</span> <span class="o">*</span> <span class="n">element_size_</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="c1">// Loop through old memory and for each element_size_ block,
</span> <span class="c1">// perform a cast and move to the new location
</span> <span class="p">}</span>
<span class="p">}</span>
<span class="kt">void</span><span class="o">*</span> <span class="n">data_</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">size_</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">capacity_</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">element_size_</span><span class="p">;</span>
<span class="kt">size_t</span> <span class="n">alignment_</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Hopefully, this stub code is enough to get the gist of the idea. We now have two classes, one
templatized, one not. The one that isn’t templatized is concerned only with the actual
storage algorithm, but defined templatized member functions so it can essentially transmute
itself as necessary. Now, with this interface, we can have a common class interface define
the structure directly, with common functions that operate on it. The various platform
specific implementations can create a “view” to operate on the structure in a type-safe manner.</p>
<p>Note that we an define another class like so:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="k">class</span> <span class="nc">vector</span> <span class="o">:</span> <span class="k">public</span> <span class="n">vector_view</span><span class="o"><</span><span class="n">T</span><span class="o">></span>
<span class="p">{</span>
<span class="k">public</span><span class="o">:</span>
<span class="n">vector</span><span class="p">()</span>
<span class="o">:</span> <span class="n">vector_storage_</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">),</span> <span class="k">alignof</span><span class="p">(</span><span class="n">T</span><span class="p">))</span>
<span class="p">,</span> <span class="n">vector_view</span><span class="o"><</span><span class="n">T</span><span class="o">></span><span class="p">(</span><span class="n">vector_storage_</span><span class="p">)</span>
<span class="p">{}</span>
<span class="k">private</span><span class="o">:</span>
<span class="n">vector_storage</span> <span class="n">vector_storage_</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>and this completely recovers the original semantics of the vector! However, now, we have
several significant advantages. For the price of the storage space for the element size
and alignment requirement, we have an adaptor type we can use to alias memory if necessary.
We can handle runtime element sizing and alignment requirements. We can reuse more generic
code (even if different platform requirements store different types of data in the structure).
And we always have the fully type-safe version to fall back on if necessary.</p>
<h2 id="views-for-days">Views for days</h2>
<p>There is another strength to this approach which I haven’t touched on yet.
The type-safe <code class="highlighter-rouge">view</code> adaptor can be templatized on the <code class="highlighter-rouge">storage</code> class itself. This means
that we can have a single <code class="highlighter-rouge">view</code> adaptor for an entire class of data structures. For example,
all traversable, random access data structures could be wrapped by the same view template.
Similarly, we can provide a single uniform view for all associative containers. Looking
forward at C++ concepts, each view template accepts a data structure constrained to a
particular concept.</p>
<p>The possibilities of this are myriad. Here are some ideas for view-types you could provide:</p>
<ul>
<li>Want your container to be threadsafe? Keep the storage abstraction, but tack on a thread-safe view</li>
<li>Have a version of the view that is fully instrumented with debugging and tracking that you enable as needed.</li>
<li>Provide a view that takes a <code class="highlighter-rouge">thread_id</code> as a parameter and enforces that all access to the data
structure originate from that thread.</li>
</ul>
<p>With this approach, you can write such views <em>once</em> for a particular data structure concept and it’s there forever.
That’s generic code at work!</p>
<h2 id="providing-the-allocatorarena">Providing the allocator/arena</h2>
<p>Another point worth making is that all my data structures accept an allocator of some
sort as an argument. Exposing the allocator as a template parameter really doesn’t make
sense, since you want to usually allocate the memory in a specific <em>instance</em> of the
allocator. As I generally work with pretty performance sensitive code, this is a must,
and I rarely ever rely on the system malloc (although I have an allocator that passes
through to malloc occassionally for convenience in testing). This completely bypasses
the need for polymorphic allocators, which are needed to permit type comparisons between
data structures that would otherwise by the same except for the allocator type. To me,
this is a hot mess that I think is worth avoiding entirely in your own code.</p>
<h2 id="yet-more-opinions">Yet more opinions</h2>
<p>Consider making your default views and type-safe wrappers fail when passed a non-trivially-copyable
template parameter if performance is crucial. Usually, if you’re writing a custom
data structure anyways, this is because you have particular needs that aren’t met
by the out-of-box containers. As such, if you’re storing rich objects that have
fancy moves and resource management semantics, you’re likely not doing yourself a
favor. Enforcing various type traits, size/alignment restrictions, etc at the view
level at compile-time is <em>great</em> for forcing the user to design the data layout
correctly upfront.</p>
<h2 id="conclusion">Conclusion</h2>
<p>TL;DR For writing high-performance generic data structures</p>
<ul>
<li>Separate type-specific access and mutation code to a separate class for forwards the type to
a storage class. The storage class has templatized member functions, but is itself non-templatized.</li>
<li>Provide a single view for each broad class of data structures that behave with similar semantics.</li>
<li>Provide different view types to handle differences in access patterns (thread safety, logging, etc).</li>
<li>Always accept an allocator as an argument with an interface exposed in your code somewhere. Don’t
bake it into your type-signature.</li>
<li>Enforce better defaults for encouraging data-oriented SoA design and hide less-performant patterns
behind more verbose interfaces.</li>
</ul>
Sat, 17 Nov 2018 00:00:00 +0000
https://jeremyong.com/c++/graphics/2018/11/17/best-practices-for-authoring-generic-data-structures/
https://jeremyong.com/c++/graphics/2018/11/17/best-practices-for-authoring-generic-data-structures/Thoughts on the Cpp Graphics Proposal<p>Earlier this year (February 2018), I sent <a href="https://groups.google.com/a/isocpp.org/forum/?fromgroups#!topic/sg13/gUr98RZMU7M">an email</a> to the ISO
SG13 C++ group to the effect of why I felt the C++ graphics proposal was, in short, <em>not a good idea</em>. You’re welcome to read it if you want,
but this post is an attempt at presenting a more complete and better-organized argument.</p>
<h2 id="brief-graphics-history-lesson">Brief Graphics History Lesson</h2>
<p>First, it is important to understand the evolution of graphics APIs as they have evolved over the years. In the last two decades, we’ve gone through
multiple paradigm shifts over several generations of architecture. As such, reviewing the changes and sheer velocity will give us a good idea of
the challenges any C++ graphics proposal will need to keep in mind.</p>
<h3 id="fixed-function-programming">Fixed-function Programming</h3>
<p>In the very beginning of course, we had a straightfoward fixed-function, immediate mode of rendering. I’m not going to talk about this mode much,
because it is such a violent departure from how modern GPUs (integrated or otherwise) would like to consume data. In short, in this era, programmers
had access to an API that would directly mutate state on the GPU, send data to the GPU, and issue draw commands.</p>
<p>The issue of course is that this style of communication necessitates an absurd amount of round-trip stalls between the GPU and the CPU. Operating
completely in lockstep, bubbles of idle waiting are inevitable on one device or the other.</p>
<h3 id="naïve-programmable-pipelines">Naïve Programmable Pipelines</h3>
<p>Next we had what I will call “naïve” programmable pipelines. Not that they aren’t productive or a huge improvement over their ancestors; they are. To
wit the mere existence of a “vertex shader” and a “fragment shader” opened up massive opportunities for exploring new shading techniques and
increased productivity. Now, instead of prescribing all operations for each primitive, programmers could create and link a “shader” program which
would later operate on the GPU. Effectively, the innermost loop was offloaded from the CPU to the GPU, removing many round-trip cycles and resultant
latency. This type of evolution would prove to be a common theme for future evolutions of graphics APIs. I would lump into this category of pipeline
OpenGL 2, OpenGLES 2, and DirectX 9.</p>
<h3 id="fatter-and-more-robust-programmable-pipelines">“Fatter” and more Robust Programmable Pipelines</h3>
<p>The next evolution of graphics APIs brought about yet more expressiveness, primarily in the form of of new ways of reading and writing data between
the GPU and CPU (reads and writes both ways). The main observation here is that graphically impressive content consumed heavily <em>variable</em> data. To
compose a single frame of a film, game, or even application, a program would need to conjure up and stream hundreds of megabytes of vertex data,
texture data, and raw data (for transforming, animating, or blending data). Programmers started to get used to the idea of needing to pack memory
upfront and think about hardware compression availability. We got vertex array objects, texture arrays, more binding points, and more. All this in an
effort to keep the GPU fed. Also, graphics engineers were blessed with far more mechanisms for actually authoring shaders and reading back data from
the GPU. Examples of APIs that allowed this style of code include Opengl 4, OpenGLES 3, and DirectX 11.</p>
<h3 id="approaching-zero-driver-overhead-style-engines">“Approaching Zero Driver Overhead” Style Engines</h3>
<p>Of course, that wasn’t enough :). Engines and games routinely struggled to hit frame times, partially owing to driver overhead. As materials and
techniques evolved, graphics engineers soon found themselves at the limits of the APIs again, resulting in yet another paradigm of programming. In
the “Approaching Zero Driver Overhead” or AZDO approach , yet more synchronization points between the CPU and GPU were removed. The
previous approach would often access data or mutate a buffer that was not yet streamed to the GPU, or that was currently in-use by the GPU
respectively. Simplifying the problem a bit, protecting against such data-hazard violations were safe-guarded by the driver. As a result, the driver
had to do a fair bit of work through inserted fences, reference counts, hidden copy-on-write accesses, locks, etc. The manner of this overhead also
meant that drivers had a very difficult time scaling to multicore workloads. Trying to opimize draw call submission with an engine architected in the
previous paradigm was generally unproductive. To combat this, AZDO-style engines maximize GPU throughput by queueing work in ways that require as few
touchpoints between the CPU and GPU as possible. For example, dynamic buffer data would be triple buffered to ensure no hidden fences or
copy-on-write behavior would kick in. Furthermore, such buffers would be coalesced into much larger single allocations with offsets passed to the
shader for proper per-object data consumption. For texture data, engines began to turn to either “mega-textures” that were virtually addressed or
arrays of texture-arrays to essentially “bind once” and never require changing texture state bindings again. Finally, draw calls themselves could be
reduced from a draw per-object or per-instance-group to just per-material through usage of “indirect draws,” wherein GPU draw commands were codified
in a flat buffer of command data instead of individual function calls.</p>
<h3 id="mobile-and-desktop-divergence">Mobile and Desktop Divergence</h3>
<p>Meanwhile, mobile hardware and mobile graphics were getting more and more important. Mobile chipsets continued to develop, but were faced with many
challenges to deliver performant hardware accelerated graphics. A primary problem, for example, was that of memory bandwidth. For modern phones, the
pixel density is surprisingly high (especially for high-DPI devices), and drawing the full screen every frame in the traditional “immediate mode
rendering” (IMR) style that discrete graphics cards do would have required more bandwidth than was reasonable for the phone’s heat and energy
requirements. As such, phones moved to a new style of rendering known as TBDR (tile-based deferred rendering). In this rendering mode, triangle
fragments were first binned to tiles in screen-space (often small tiles, say 16x16 pixels). In a second pass, all the fragments of each tile would
be shaded, one tile at a tile. This effectively cuts down the amount of memory in the framebuffer that needs to be “hot” at the same time,
especially compared to immediate rendering which may as well invoke random access of framebuffer memory. While this was a great optimization though,
graphics engineers needed to adapt their practices to take this form of rendering into account. It’s worth mentioning that prior to mobile, similar
architectures and problems existed in the console space too. Consoles, like mobile, use a unified memory architecture (UMA), and in some cases,
encouraged a tile based rendering approach (e.g. Xbox One’s ESRAM was fast but limited in size). In a TBDR world, accidentally introducing a
command such as a readback or data-dependency that could flush the entire pipeline was extremely easy, and often non-trivial to address. In addition,
rendering algorithms began to diverge. Some techniques that performed well on desktop performed poorly on mobile, and vice versa. For example, the
availability of HSR (hidden surface removal) on PowerVR chips (or similar) meant that sorting draws by depth was strictly worse for performance on
mobile. In contrast, IMRs usually benefitted greatly from a depth-prepass to enable fast depth-based culling. Additional considerations were the
usage of the <code class="highlighter-rouge">discard</code> instruction in a shader program or alpha blending invalidating the HSR optimization.</p>
<h3 id="modern-apis">“Modern” APIs</h3>
<p>I say “modern” with quotes because what is modern today, may not be modern tomorrow. Already we have hardware samplers that are getting ever closer
to a ray-tracing paradigm (and by some definitions, already there depending on your performance target). For the time being, with APIs such as Vulkan, Direct3D 12, and Metal 2, all the implicit memory dependencies and instruction dependencies are more or less in the hands of the developer
(for better or worse). This is, again, a huge paradigm shift which opens up another class of potential rendering architectures. When specifying
fences, memory dependencies, etc becomes the responsibility of the programmer, a large swath of optimizations open up, particular that of multi-
threaded rendering. Locks, fences, and barries no longer implicitly happen, so scaling up rendering across multiple cores is a possibility and
drivers more or less get out of the way (in principal). Of course, this brings about new challenges as engines still often need some form of
backwards compatibility, and a careless approach to injecting the new API may result in <em>worse</em> performance on both fronts. In addition, games
and apps continue to need to account for differences in TBDR and IMR architectures, as well as UMA and non-UMA architectures. For example, render
pass “subpass” dependencies can be used to enable user-land tiled deferred rendering where a gbuffer may have been prohibitively expensive before.
Also, allocation of dynamic buffers needs to account available memory heap types and allocate the correct one for the job.</p>
<h2 id="history-summary">History Summary</h2>
<p>TL;DR All that is to say that graphics is a volatile space and much has changed in just a couple decades. The progression does not resemble a
guacamole dip where new layers are layered on top. Here, we have entire paradigm shifts, that have often caused entire rewrites of games, graphics
engines, UI frameworks, browser backends, and more. Developers in every one of these disciplines have needed to pour enormous amounts of time and
resources to advance their efforts, while simultaneously needing to support innumerable fallback paths and branching paths for every type of
device under the sun. But back to the subject at hand, we’re talking about just a 2D graphics API right? Why does any of the above matter?</p>
<h2 id="99-problems">99 Problems</h2>
<p>Let’s consider, for just a 2D API, what the C++ standard library now needs to consider for <strong>all</strong> platforms:</p>
<ul>
<li>Window surface creation and its configuration</li>
<li>Creation of a hardware accelerated context of some sort (Assuming you want hardware acceleration? Hopefully?)</li>
<li>Handling of an event loop so the developer can decide what happens when the user moves the mouse, clicks/taps a pixel, performs a gesture, resizes the windows, etc.</li>
<li>Actual rasterization of the data, animation, etc.</li>
</ul>
<p>I think not doing all the above properly will likely result in what I consider a “toy” library, in the sense that I would not ship something to
production relying on it. Handling all the above essentially means replicating software like Skia (but now it’s the compiler author’s responsibility
to make sure it compiles, runs, and performs well on all supported hardware).</p>
<p>It gets worse though! As the folks that made OpenGL (Khronos) have learned over the past several decades, things get dicey when software
specifications run aground of hardware limitations. That is to say, having a spec is all well and good, but what if the platform you run on can’t
support a feature? Here, we get into the murky territory of checking capabilities at runtime and selecting the appropriate code path thereafter.
This is <em>strange</em> for a standardized C++ library, where even SIMD which is nearly ubiquitous is not easily standardizable because of instruction
set availability and variability (ARM NEON vs Intel intrinsics, and then you have the embedded devices that don’t support it at all).</p>
<p>Suppose we sat down and scoped it. Minimal window-ing functionality, barebones event-handling, and software rasterization. This could be the
barest lowest common denominator for something we might deem useful. The question is, <em>do I really want this to bloat every executable I make
and link to the standard library</em>. Even this minimal cross-section is a fair bit of code, as anyone who has written a software rasterizer will
tell you. And this “small” cross-section actually has, dispproportionately, the <em>largest</em> cross section with OS-level behavior compared to any
other section of the standard. Furthermore, this also introduces a strict <em>runtime</em> dependency on its execution that definitely goes well beyond
any other runtime functionality switching that I’m aware of in the standard.</p>
<p>Now, for a “real” 2D graphics API, it’s worth thinking about what your web browser does. To render a typical HTML/CSS page, your browser needs to
know how to handle:</p>
<ul>
<li>SVG rendering (for fonts and SVG figures)</li>
<li>Hardware accelerated animations and transforms</li>
<li>“Dirty rectangles” layout optimizations (and knowing when applying them is worse)</li>
<li>Full-fledged event handling for supporting scroll events, gestures, window focus states, backgrounding, etc</li>
<li>Hit-box detection</li>
<li>Depth-aware layering so things on top, draw on top</li>
<li>etc.</li>
</ul>
<p>And this isn’t even accounting for the browser renderer’s many mixed-media capabilities (image support out the wazoo, video codecs, an entire
embeddable WebGL-capable canvas, etc). The people that author browser render backends are numerous, skilled, and specialized. They follow the trends I highlighted above, despite not doing
“hardcore 3D rendering” because they have a different set of performance benchmarks and needs. They manage compatibility on an impressive array of
hardware and platforms, and likely maintain hundreds of branching code paths to make any individual feature fast and correct for your device. They
have giant testing suites and compliance tests to make sure they do not break existing functionality. But as great as they are, <em>they are not your
compiler engineers.</em></p>
<h2 id="conclusion">Conclusion</h2>
<p>I think someone should make an “easy” (not necessarily simple, just easy) 2D library that someone can pick up and make buttons, shapes and lines on.
I think that library can introduce concepts that may be useful for a budding graphics engineer down the road. Maybe throw in bezier curves. Maybe
teach something about winding orders. Eventually maybe you kick it up a notch and add some time dependent blending there. In a computational physics
class you learn something about stable numerical integration, what have you.</p>
<p>This thing would be great as a library. And my contention here is just that. Frankly, spending the committee’s time trying to even contemplate the
inclusion of graphics into the core of the standard library before fixing other glaring issues feels ridiculous to the point of embarrassment. The greatest strengths
of the language is its ability to compile everywhere, while providing zero-cost abstractions in some cases, and no abstractions at all for the rest.
Even in this respect, however, the language will eventually lose ground if other languages (read. Rust) manage to provide parity in this regard while
improving ergonomics for other things (read. build systems, package management, etc). If the committee really wants to pursue graphics, I think a
better starting point is to first consider the following:</p>
<ul>
<li>How should the C++ memory model interoperate with external hardware (possibly not just GPUs, but network cards, audio cards, and more)? This is the
subject of the “Elsewhere Memory” proposal</li>
<li>How much “OS” should the C++ standard library actually care about?</li>
<li>Why is “graphics” in the proposed state so important, and who will maintain it into perpetuity?</li>
<li>Is the C++ better served with less bloat within the standard itself, and an easier way to access code to integrate with?</li>
</ul>
<p>As a long-time user of C++, I’m curious as always to see how it develops and have been a longtime proponent of the language in spite of its flaws.
The “spirit of the committee” in this regard though, will be a great tell as to how the language will choose to evolve in the future, and I suspect
play a big part in shaping the language’s future demographic as well.</p>
Mon, 05 Nov 2018 00:00:00 +0000
https://jeremyong.com/c++/graphics/2018/11/05/thoughs-on-the-cpp-graphics-proposal/
https://jeremyong.com/c++/graphics/2018/11/05/thoughs-on-the-cpp-graphics-proposal/My Engineering Manifesto<p>The Engineering Manifesto. There are many like it, but this is mine; the one I strive to abide by regardless of where I am, for someone else’s company, or my own.</p>
<ol>
<li><strong>As an engineer, my first and foremost obligation is to my employer and/or client (barring, of course, violations of state or federal law).</strong>
<ul>
<li>Many potential objections to this pertain to situations in which the client/employer’s agenda is immoral or, at best, morally ambiguous. The important corollary to this statement is that alignment with the company’s vision is critical, and misalignment is more than enough grounds to seek partnership elsewhere.</li>
<li>A weaker formulation of this statement (but still true, nonetheless), is that my own preferences and wants are secondary to fulfilling my duties.</li>
</ul>
</li>
<li><strong>The software I create should account for all reasonable requirements, mentioned and unmentioned, balancing time to delivery, risk, performance, suitability, maintainability, cost, and more. During the assessment of an approach, approaches should be appraised for each of these qualities in order of most importance, which will often change depending on the nature of the assignment.</strong>
<ul>
<li>It is easy to fall into the trap of caring only about, say, performance, at the expense of other qualities. Having an intuition for what is important when is an important skill all engineers should develop.</li>
<li>In situations where a client/manager/lead/etc disagrees about prioritization, a reasonable effort should be made to communicate why something warrants additional time or consideration.</li>
</ul>
</li>
<li><strong>My goal for every project is <em>timely delivery</em> of something which functions <em>to specification</em>, packaged in a way that is <em>usable and maintainable</em> as per the project requirements.</strong>
<ul>
<li>Engineering is especially prone to volatile timelines, occasionally due to changing product requirements, but more often than not, due to “missing a beat” due to underestimated complexity, espcially as projects get more difficult. Even if time estimation is hard, good estimates are something I should aspire to.</li>
<li>My credibility as an engineer and employee depends on my execution, as I am a critical component in a company’s executive arm.</li>
</ul>
</li>
<li><strong>My value as an engineer is dependent on not just my technical ability, but also my ability to communicate and educate.</strong>
<ul>
<li>At no point should I expect others to blindly assume I am doing the right thing or on the right track.</li>
<li>My coworkers, managers, and clients have a right to know what I’m doing or trying, how long I expect it to take, and my rationale for taking the approach that I did.</li>
<li>Pursuing wanton endeavors without communication is both toxic, and dangerous, to team and company objectives (although there is a time and place for experimentation that I should strive to recognize also).</li>
</ul>
</li>
<li><strong>My personal mistakes should be communicated and addressed transparently and without fear.</strong>
<ul>
<li>Failure to acknowledge fault stymies forward progress. How can a team collectively understand why a project deadline slipped without transparency? How can I improve as an engineer without being honest with myself?</li>
<li>Mutual transparency within an organization prevents costly, systemic error. I should not expect transparency from others, if I cannot offer my own.</li>
</ul>
</li>
<li><strong>I am an engineer, but not a martyr, victim, or patient.</strong>
<ul>
<li>Extreme measures of self-imposed obligation, expectations, or rebukes compromise my well-being and effectiveness as a whole.</li>
<li>I should work hard, but keep in mind that working in an unsustainable manner is ultimately not in the company’s best interest.</li>
</ul>
</li>
<li><strong>I should respect time.</strong>
<ul>
<li>My own time.</li>
<li>The time of my coworkers.</li>
<li>The time of the company.</li>
</ul>
</li>
<li><strong>My respect of time applies equivalently to other resources such as money, materials, equipment, and more.</strong>
<ul>
<li>The engineer constantly labors against waste and entropy of all forms: dead code, wasted CPU cycles, excess usage of storage, and more.</li>
</ul>
</li>
<li><strong>The standard I abide by applies equally to things seen and unseen.</strong>
<ul>
<li>Lines of code that I write are my responsibility, with or without a code review.</li>
<li>The time I spend in the absence of my manager is equally valuable.</li>
<li>Test cases I write that run silently, without pomp and circumstance are virtuous.</li>
</ul>
</li>
</ol>
Wed, 13 Jun 2018 00:00:00 +0000
https://jeremyong.com/engineering/philosophy/2018/06/13/my-engineering-manifesto/
https://jeremyong.com/engineering/philosophy/2018/06/13/my-engineering-manifesto/Putting C++ to work - Making Abstractions for Vulkan Specialization Info<p>This post will explore some real-world usage of the more exotic template metaprogramming features of C++11, 14, and 17. I think the resulting interface is quite nice and would not have been as convenient to provide without modern language features. The specific application in this example is an abstraction around the Vulkan <code class="highlighter-rouge">vk::SpecializationInfo</code> struct but the techniques should transfer well to other domains outside graphics as well.
All the code described here is free to use as a simple single header <a href="https://github.com/jeremyong/ninepoints/blob/master/vulkan/ShaderSpecialization.h">here</a>.</p>
<h1 id="problem-statement">Problem Statement</h1>
<p>In Vulkan, we often want to provide what are known as specialization constants to a shader. You can think of these as preprocessor definitions that will modify the layout and execution of the shader bytecode on the GPU. This data must be fed to the Vulkan shader module (an object that encapsulates the shader state) from the CPU.</p>
<p>All of this data should be held in a contiguous block of typeless memory.
To specify the shape of the data, we create an array of specialization map entries (<code class="highlighter-rouge">vk::SpecializationMapEntry</code>) that contain three pieces of information a piece:</p>
<ol>
<li>The constant identifier for corresponding the map entry to the data in the shader</li>
<li>The size of the data this entry refers to</li>
<li>The offset of the entry in the block</li>
</ol>
<p>For example, one might have a simplified shader below which contains a single specialization constant for specifying the number of instances rendered in a draw call:</p>
<div class="language-glsl highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#version 450
</span>
<span class="c1">// Number of instances specified at runtime. Defaults to 1
</span><span class="k">layout</span> <span class="p">(</span><span class="n">constant_id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">const</span> <span class="kt">uint</span> <span class="n">k_instance_count</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">layout</span> <span class="p">(</span><span class="n">binding</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span> <span class="n">set</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="k">uniform</span> <span class="n">Camera</span>
<span class="p">{</span>
<span class="kt">mat4</span> <span class="n">projection</span><span class="p">;</span>
<span class="kt">mat4</span> <span class="n">view</span><span class="p">;</span>
<span class="p">};</span>
<span class="k">layout</span> <span class="p">(</span><span class="n">binding</span> <span class="o">=</span> <span class="mi">1</span><span class="p">,</span> <span class="n">set</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="k">uniform</span> <span class="n">Model</span>
<span class="p">{</span>
<span class="c1">// We use the specialization constant here to determine how many model transform matrices should be
</span> <span class="c1">// provided at draw time
</span> <span class="kt">mat4</span> <span class="n">model</span><span class="p">[</span><span class="n">k_instance_count</span><span class="p">];</span>
<span class="p">};</span>
<span class="c1">// Rest of the vertex shader code here
</span></code></pre></div></div>
<p>In C++, we’d provide the value of <code class="highlighter-rouge">k_instance_count</code> at pipeline creation time. Note that I’m using the C++ bindings to Vulkan (and you should too!) but C users will follow along just fine.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// Set the number of instances to 100
</span><span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">instance_count</span> <span class="o">=</span> <span class="mi">100</span><span class="p">;</span>
<span class="c1">// We're just using one entry for now, but we present it as an array since
// we may have many entries in the general case (all of which are typed differently)
</span><span class="n">std</span><span class="o">::</span><span class="n">array</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">SpecializationMapEntry</span><span class="p">,</span> <span class="mi">1</span><span class="o">></span> <span class="n">entries</span> <span class="o">=</span> <span class="p">{</span>
<span class="n">vk</span><span class="o">::</span><span class="n">SpecializationMapEntry</span><span class="p">{</span>
<span class="mi">0</span><span class="p">,</span> <span class="c1">// Constant ID
</span> <span class="mi">0</span><span class="p">,</span> <span class="c1">// Offset
</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">),</span> <span class="c1">// Size
</span> <span class="p">}</span>
<span class="p">};</span>
<span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span> <span class="n">info</span><span class="p">{</span>
<span class="n">entries</span><span class="p">.</span><span class="n">size</span><span class="p">(),</span> <span class="c1">// Entry count
</span> <span class="n">entries</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="c1">// Entry data
</span> <span class="k">sizeof</span><span class="p">(</span><span class="kt">unsigned</span> <span class="kt">int</span><span class="p">),</span> <span class="c1">// Total size of ALL the entry data
</span> <span class="o">&</span><span class="n">instance_count</span> <span class="c1">// Pointer to the start of the data block
</span><span class="p">};</span>
</code></pre></div></div>
<p>If we had more data to provide than just the instance count, we would need to ensure that all the data is packed appropriately.
Obviously, there are a lot of pitfalls in this interface. We have to manage all the offsets and size bookkeeping ourselves, and
with this much boilerplate, it’s easy to introduce bugs. This isn’t a fault of Vulkan itself, but just a natural consequence of
dealing with type erasure boundaries.</p>
<h1 id="imagining-a-solution">Imagining a Solution</h1>
<p>People that work with me know that I like to think about ideal interfaces upfront. Even if they aren’t always ultimately practical (either due to language constraints, or possibly time constraints), undergoing the mental exercise of imagining an ideal solution gives you a target definition of good you can pursue.</p>
<p>What I initially came up with was something like the following:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Will construct map entries for an int and a float
// The int and float will be packed contiguously in memory
ShaderSpecialization<unsigned int, float> sp;
sp.set<0>(4);
sp.set<1>(2.5f);
// Returns the value of the unsigned int at index 0
unsigned int x = sp.get<0>();
// Returns a vk::SpecializationInfo object map
// Offsets are computed at compile time
vk::SpecializationInfo info = sp.info();
// Use the info above to construct your pipeline
</code></pre></div></div>
<p>With an interface like this, we can very easily create type-safe specializations!</p>
<h1 id="the-code">The Code</h1>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">template</span> <span class="o"><</span><span class="k">typename</span><span class="p">...</span> <span class="n">Ts</span><span class="o">></span>
<span class="k">class</span> <span class="nc">ShaderSpecialization</span>
<span class="p">{</span>
<span class="c1">// Write me please
</span><span class="p">};</span>
</code></pre></div></div>
<p>Let’s start first with the data this will need to hold. The first order of business is to compute the total size of the data.
That is, we wish to compute the sum of all the sizes of the types in the parameter pack <code class="highlighter-rouge">Ts</code>.</p>
<p>To do this we write a couple helper functions:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">size_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">static_assert</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">is_scalar</span><span class="o"><</span><span class="n">T</span><span class="o">>::</span><span class="n">value</span><span class="p">,</span>
<span class="s">"Data put into specialization maps must be scalars"</span><span class="p">);</span>
<span class="k">return</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T1</span><span class="p">,</span> <span class="k">typename</span> <span class="n">T2</span><span class="p">,</span> <span class="k">typename</span><span class="p">...</span> <span class="n">More</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">size_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">static_assert</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">is_scalar</span><span class="o"><</span><span class="n">T1</span><span class="o">>::</span><span class="n">value</span><span class="p">,</span>
<span class="s">"Data put into specialization maps must be scalars"</span><span class="p">);</span>
<span class="k">return</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T1</span><span class="p">)</span> <span class="o">+</span> <span class="n">size_helper</span><span class="o"><</span><span class="n">T2</span><span class="p">,</span> <span class="n">More</span><span class="p">...</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The static asserts ensure that we don’t try to place anything funky in the specialization map as these are not supported by the GL_KHR_vulkan_glsl extension.
Next, we use these functions to initialize a static variable and our data:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">size_helper</span><span class="o"><</span><span class="n">Ts</span><span class="p">...</span><span class="o">></span><span class="p">();</span>
<span class="n">std</span><span class="o">::</span><span class="n">array</span><span class="o"><</span><span class="kt">uint8_t</span><span class="p">,</span> <span class="n">size</span><span class="o">></span> <span class="n">m_data</span><span class="p">;</span>
</code></pre></div></div>
<p>It should be relatively clear that calling <code class="highlighter-rouge">size_helper</code> will recursively fold over the types and reduce them down to the total byte size.
Because this is a <code class="highlighter-rouge">constexpr</code> expression, we can use it to determine the size of <code class="highlighter-rouge">m_data</code> at compile time.</p>
<p>The next problem we need to solve is how to retrieve data from this array given an index. Because the array is untyped, we need to compute
the byte location of a given entry. This is obviously the sum of all the byte sizes of the types that precede it. Since all the types are known
at compile time, it’s clear that we should be able to compute the offsets at compile time as well.</p>
<p>Let’s start first with something that will give us the type at the <code class="highlighter-rouge">N</code>th index. We can do this easily with:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="k">using</span> <span class="n">type</span> <span class="o">=</span> <span class="k">typename</span> <span class="n">std</span><span class="o">::</span><span class="n">tuple_element</span><span class="o"><</span><span class="n">N</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">tuple</span><span class="o"><</span><span class="n">Ts</span><span class="p">...</span><span class="o">>>::</span><span class="n">type</span><span class="p">;</span>
</code></pre></div></div>
<p>If you haven’t seen this before, it might be a little shocking as the type alias itself is templatized. This is totally legal :). With this, given some index, we can access the type in the parameter pack. Let’s use this to build our <code class="highlighter-rouge">offset_of</code> variable.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">...(</span><span class="n">Ts</span><span class="p">);</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">offset_of</span> <span class="o">=</span> <span class="n">offset_of_helper</span><span class="o"><</span><span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="mi">0</span><span class="o">></span><span class="p">();</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">Index</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">Max</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">Offset</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">offset_of_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="k">constexpr</span> <span class="p">(</span><span class="n">Index</span> <span class="o">==</span> <span class="n">Max</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">Offset</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">offset_of_helper</span><span class="o"><</span><span class="n">Index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Offset</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">Index</span><span class="o">></span><span class="p">)</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here, <code class="highlighter-rouge">offset_of</code> is another template variable (available since C++14), templatized on the index in the parameter pack.
The <code class="highlighter-rouge">offset_of_helper</code> function is yet another recursive function that increments the index and accumulates type sizes to return the final offset.
To termination condition uses an <code class="highlighter-rouge">if constexpr</code> to stop accumulating into the return value once we reach the desired index.</p>
<p>We can use <code class="highlighter-rouge">offset_of</code> to write a bunch of getters and setters:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="k">auto</span> <span class="n">get</span><span class="p">()</span> <span class="k">const</span>
<span class="p">{</span>
<span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span> <span class="n">value</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="o">&</span><span class="n">value</span><span class="p">,</span> <span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">()</span> <span class="o">+</span> <span class="n">offset_of</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">));</span>
<span class="k">return</span> <span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="kt">void</span> <span class="n">set</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">()</span> <span class="o">+</span> <span class="n">offset_of</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">,</span> <span class="o">&</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">));</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The purpose of the <code class="highlighter-rouge">std::memcpy</code> here is to prevent undefined behavior when aliasing memory.</p>
<p>Because of our type magic, these casts are nice and typesafe. The last thing we need to do now is provide a constructor.
This constructor needs to initialize the <code class="highlighter-rouge">vk::SpecializationInfo</code> and <code class="highlighter-rouge">vk::SpecializationMapEntry</code> objects directly.
Remember that because our <code class="highlighter-rouge">offset_of</code> quantity is a template variable, we cannot templatize it on a runtime constant. Thus, our
constructor also needs to rely on compile-time code.</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="c1">// Data members
</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span> <span class="n">m_info</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">array</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">SpecializationMapEntry</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">...(</span><span class="n">Ts</span><span class="p">)</span><span class="o">></span> <span class="n">m_entries</span><span class="p">;</span>
<span class="n">ShaderSpecialization</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">m_info</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span><span class="p">{</span>
<span class="n">count</span><span class="p">,</span> <span class="c1">// Map entry count
</span> <span class="n">m_entries</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="c1">// Map entries
</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">DeviceSize</span><span class="o">></span><span class="p">(</span><span class="n">size</span><span class="p">),</span> <span class="c1">// Data size
</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="kt">void</span><span class="o">*></span><span class="p">(</span><span class="n">m_data</span><span class="p">)</span> <span class="c1">// Data
</span> <span class="p">};</span>
<span class="n">construct_helper</span><span class="o"><</span><span class="mi">0</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">Index</span><span class="o">></span>
<span class="kt">size_t</span> <span class="n">construct_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="k">constexpr</span> <span class="p">(</span><span class="n">Index</span> <span class="o">==</span> <span class="n">count</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="n">m_entries</span><span class="p">[</span><span class="n">Index</span><span class="p">]</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationMapEntry</span><span class="p">{</span>
<span class="n">Index</span><span class="p">,</span> <span class="c1">// Constant ID
</span> <span class="n">offset_of</span><span class="o"><</span><span class="n">Index</span><span class="o">></span><span class="p">,</span> <span class="c1">// Offset
</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">Index</span><span class="o">></span><span class="p">)</span> <span class="c1">// Size
</span> <span class="p">};</span>
<span class="n">construct_helper</span><span class="o"><</span><span class="n">Index</span> <span class="o">+</span> <span class="mi">1</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>It is <em>very</em> important to fix the copy and move constructors for this class. Because the specialization info takes a pointer to the start of the data block, this must be corrected when the class is moved and copied. These constructors are shown below:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="n">ShaderSpecialization</span><span class="p">(</span><span class="n">ShaderSpecialization</span><span class="o">&&</span> <span class="n">other</span><span class="p">)</span>
<span class="o">:</span> <span class="n">m_entries</span><span class="p">{</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">m_entries</span><span class="p">)}</span>
<span class="p">,</span> <span class="n">m_data</span><span class="p">{</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">)}</span>
<span class="p">{</span>
<span class="n">m_info</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span><span class="p">{</span>
<span class="n">count</span><span class="p">,</span> <span class="c1">// Map entry count
</span> <span class="n">m_entries</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="c1">// Map entries
</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">DeviceSize</span><span class="o">></span><span class="p">(</span><span class="n">size</span><span class="p">),</span> <span class="c1">// Data size
</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="kt">void</span><span class="o">*></span><span class="p">(</span><span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">())</span> <span class="c1">// Data
</span> <span class="p">};</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="p">(</span><span class="k">const</span> <span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="n">other</span><span class="p">)</span>
<span class="o">:</span> <span class="n">m_entries</span><span class="p">{</span><span class="n">other</span><span class="p">.</span><span class="n">m_entries</span><span class="p">}</span>
<span class="p">,</span> <span class="n">m_data</span><span class="p">{</span><span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">}</span>
<span class="p">{</span>
<span class="n">m_info</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span><span class="p">{</span>
<span class="n">count</span><span class="p">,</span> <span class="c1">// Map entry count
</span> <span class="n">m_entries</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="c1">// Map entries
</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">DeviceSize</span><span class="o">></span><span class="p">(</span><span class="n">size</span><span class="p">),</span> <span class="c1">// Data size
</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="kt">void</span><span class="o">*></span><span class="p">(</span><span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">())</span> <span class="c1">// Data
</span> <span class="p">};</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="k">const</span> <span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="k">this</span> <span class="o">!=</span> <span class="o">&</span><span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// We are careful not to copy the info and entry data
</span> <span class="n">m_data</span> <span class="o">=</span> <span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="o">*</span><span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="n">ShaderSpecialization</span><span class="o">&&</span> <span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="k">this</span> <span class="o">!=</span> <span class="o">&</span><span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// We are careful not to move the info and entry data
</span> <span class="n">m_data</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="o">*</span><span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>In the case of the assignment operators, we can be sure that one of the other constructors ran so <code class="highlighter-rouge">m_info</code> and <code class="highlighter-rouge">m_entries</code> will contain the correct data and will not need modification.</p>
<p>Just like when we wrote <code class="highlighter-rouge">offset_of</code>, using <code class="highlighter-rouge">if constexpr</code> here lets us make a nice and simple compile time loop. We use recursion as before on <code class="highlighter-rouge">construct_helper</code> to initialize all the map entries to have the correct offsets and sizes using all the facilities we’ve written before. Finally, we slap an accessor for <code class="highlighter-rouge">m_info</code> and we’re done!</p>
<p>The code in its entirety, reformatted and with correct scoping is reproduced below:</p>
<div class="language-c++ highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#pragma once
</span>
<span class="cp">#include <array>
#include <tuple>
#include <vulkan/vulkan.hpp>
</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span><span class="p">...</span> <span class="n">Ts</span><span class="o">></span>
<span class="k">class</span> <span class="nc">ShaderSpecialization</span>
<span class="p">{</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">size_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">static_assert</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">is_scalar</span><span class="o"><</span><span class="n">T</span><span class="o">>::</span><span class="n">value</span><span class="p">,</span>
<span class="s">"Data put into specialization maps must be scalars"</span><span class="p">);</span>
<span class="k">return</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="k">typename</span> <span class="n">T1</span><span class="p">,</span> <span class="k">typename</span> <span class="n">T2</span><span class="p">,</span> <span class="k">typename</span><span class="p">...</span> <span class="n">More</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">size_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">static_assert</span><span class="p">(</span><span class="n">std</span><span class="o">::</span><span class="n">is_scalar</span><span class="o"><</span><span class="n">T1</span><span class="o">>::</span><span class="n">value</span><span class="p">,</span>
<span class="s">"Data put into specialization maps must be scalars"</span><span class="p">);</span>
<span class="k">return</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">T1</span><span class="p">)</span> <span class="o">+</span> <span class="n">size_helper</span><span class="o"><</span><span class="n">T2</span><span class="p">,</span> <span class="n">More</span><span class="p">...</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">Index</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">Max</span><span class="p">,</span> <span class="kt">size_t</span> <span class="n">Offset</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">offset_of_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="k">constexpr</span> <span class="p">(</span><span class="n">Index</span> <span class="o">==</span> <span class="n">Max</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">Offset</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">offset_of_helper</span><span class="o"><</span><span class="n">Index</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">Offset</span> <span class="o">+</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">Index</span><span class="o">></span><span class="p">)</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="k">public</span><span class="o">:</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">count</span> <span class="o">=</span> <span class="k">sizeof</span><span class="p">...(</span><span class="n">Ts</span><span class="p">);</span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">size</span> <span class="o">=</span> <span class="n">size_helper</span><span class="o"><</span><span class="n">Ts</span><span class="p">...</span><span class="o">></span><span class="p">();</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="k">using</span> <span class="n">type</span> <span class="o">=</span> <span class="k">typename</span> <span class="n">std</span><span class="o">::</span><span class="n">tuple_element</span><span class="o"><</span><span class="n">N</span><span class="p">,</span> <span class="n">std</span><span class="o">::</span><span class="n">tuple</span><span class="o"><</span><span class="n">Ts</span><span class="p">...</span><span class="o">>>::</span><span class="n">type</span><span class="p">;</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="k">static</span> <span class="k">constexpr</span> <span class="kt">size_t</span> <span class="n">offset_of</span> <span class="o">=</span> <span class="n">offset_of_helper</span><span class="o"><</span><span class="mi">0</span><span class="p">,</span> <span class="n">N</span><span class="p">,</span> <span class="mi">0</span><span class="o">></span><span class="p">();</span>
<span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span><span class="o">&</span> <span class="n">info</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">return</span> <span class="n">m_info</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="k">auto</span> <span class="n">get</span><span class="p">()</span> <span class="k">const</span>
<span class="p">{</span>
<span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span> <span class="n">value</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="o">&</span><span class="n">value</span><span class="p">,</span> <span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">()</span> <span class="o">+</span> <span class="n">offset_of</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">));</span>
<span class="k">return</span> <span class="n">value</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">N</span><span class="o">></span>
<span class="kt">void</span> <span class="n">set</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span> <span class="n">value</span><span class="p">)</span>
<span class="p">{</span>
<span class="n">std</span><span class="o">::</span><span class="n">memcpy</span><span class="p">(</span><span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">()</span> <span class="o">+</span> <span class="n">offset_of</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">,</span> <span class="o">&</span><span class="n">value</span><span class="p">,</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">N</span><span class="o">></span><span class="p">));</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">construct_helper</span><span class="o"><</span><span class="mi">0</span><span class="o">></span><span class="p">();</span>
<span class="n">m_info</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span><span class="p">{</span>
<span class="n">count</span><span class="p">,</span> <span class="c1">// Map entry count
</span> <span class="n">m_entries</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="c1">// Map entries
</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">DeviceSize</span><span class="o">></span><span class="p">(</span><span class="n">size</span><span class="p">),</span> <span class="c1">// Data size
</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="kt">void</span><span class="o">*></span><span class="p">(</span><span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">())</span> <span class="c1">// Data
</span> <span class="p">};</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="p">(</span><span class="n">ShaderSpecialization</span><span class="o">&&</span> <span class="n">other</span><span class="p">)</span>
<span class="o">:</span> <span class="n">m_entries</span><span class="p">{</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">m_entries</span><span class="p">)}</span>
<span class="p">,</span> <span class="n">m_data</span><span class="p">{</span><span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">)}</span>
<span class="p">{</span>
<span class="n">m_info</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span><span class="p">{</span>
<span class="n">count</span><span class="p">,</span> <span class="c1">// Map entry count
</span> <span class="n">m_entries</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="c1">// Map entries
</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">DeviceSize</span><span class="o">></span><span class="p">(</span><span class="n">size</span><span class="p">),</span> <span class="c1">// Data size
</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="kt">void</span><span class="o">*></span><span class="p">(</span><span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">())</span> <span class="c1">// Data
</span> <span class="p">};</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="p">(</span><span class="k">const</span> <span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="n">other</span><span class="p">)</span>
<span class="o">:</span> <span class="n">m_entries</span><span class="p">{</span><span class="n">other</span><span class="p">.</span><span class="n">m_entries</span><span class="p">}</span>
<span class="p">,</span> <span class="n">m_data</span><span class="p">{</span><span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">}</span>
<span class="p">{</span>
<span class="n">m_info</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span><span class="p">{</span>
<span class="n">count</span><span class="p">,</span> <span class="c1">// Map entry count
</span> <span class="n">m_entries</span><span class="p">.</span><span class="n">data</span><span class="p">(),</span> <span class="c1">// Map entries
</span> <span class="k">static_cast</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">DeviceSize</span><span class="o">></span><span class="p">(</span><span class="n">size</span><span class="p">),</span> <span class="c1">// Data size
</span> <span class="k">reinterpret_cast</span><span class="o"><</span><span class="kt">void</span><span class="o">*></span><span class="p">(</span><span class="n">m_data</span><span class="p">.</span><span class="n">data</span><span class="p">())</span> <span class="c1">// Data
</span> <span class="p">};</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="k">const</span> <span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="k">this</span> <span class="o">!=</span> <span class="o">&</span><span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// We are careful not to copy the info and entry data
</span> <span class="n">m_data</span> <span class="o">=</span> <span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">return</span> <span class="o">*</span><span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">ShaderSpecialization</span><span class="o">&</span> <span class="k">operator</span><span class="o">=</span><span class="p">(</span><span class="n">ShaderSpecialization</span><span class="o">&&</span> <span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="k">this</span> <span class="o">!=</span> <span class="o">&</span><span class="n">other</span><span class="p">)</span>
<span class="p">{</span>
<span class="c1">// We are careful not to move the info and entry data
</span> <span class="n">m_data</span> <span class="o">=</span> <span class="n">std</span><span class="o">::</span><span class="n">move</span><span class="p">(</span><span class="n">other</span><span class="p">.</span><span class="n">m_data</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="o">*</span><span class="k">this</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">private</span><span class="o">:</span>
<span class="k">template</span> <span class="o"><</span><span class="kt">size_t</span> <span class="n">Index</span><span class="o">></span>
<span class="kt">void</span> <span class="n">construct_helper</span><span class="p">()</span>
<span class="p">{</span>
<span class="k">if</span> <span class="k">constexpr</span> <span class="p">(</span><span class="n">Index</span> <span class="o">==</span> <span class="n">count</span><span class="p">)</span>
<span class="p">{</span>
<span class="k">return</span><span class="p">;</span>
<span class="p">}</span>
<span class="k">else</span>
<span class="p">{</span>
<span class="n">m_entries</span><span class="p">[</span><span class="n">Index</span><span class="p">]</span> <span class="o">=</span> <span class="n">vk</span><span class="o">::</span><span class="n">SpecializationMapEntry</span><span class="p">{</span>
<span class="n">Index</span><span class="p">,</span> <span class="c1">// Constant ID
</span> <span class="n">offset_of</span><span class="o"><</span><span class="n">Index</span><span class="o">></span><span class="p">,</span> <span class="c1">// Offset
</span> <span class="k">sizeof</span><span class="p">(</span><span class="n">type</span><span class="o"><</span><span class="n">Index</span><span class="o">></span><span class="p">)</span> <span class="c1">// Size
</span> <span class="p">};</span>
<span class="n">construct_helper</span><span class="o"><</span><span class="n">Index</span> <span class="o">+</span> <span class="mi">1</span><span class="o">></span><span class="p">();</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">vk</span><span class="o">::</span><span class="n">SpecializationInfo</span> <span class="n">m_info</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">array</span><span class="o"><</span><span class="n">vk</span><span class="o">::</span><span class="n">SpecializationMapEntry</span><span class="p">,</span> <span class="n">count</span><span class="o">></span> <span class="n">m_entries</span><span class="p">;</span>
<span class="n">std</span><span class="o">::</span><span class="n">array</span><span class="o"><</span><span class="kt">uint8_t</span><span class="p">,</span> <span class="n">size</span><span class="o">></span> <span class="n">m_data</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>To use the class, do something like the following:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Define a specialization map containing 2 integers and a float
ShaderSpecialization<int, int, float> sp;
// Assign some values
sp.set<0>(4);
sp.set<1>(1);
sp.set<2>(93.2f);
// Access them if you want
std::cout << sp.get<0>() << std::endl;
// Use this to create your graphics or compute pipeline
// The data will be mapped to constant ids 0, 1, and 2 respectively
vk::SpecializationInfo& info = sp.info();
</code></pre></div></div>
Fri, 20 Apr 2018 08:11:00 +0000
https://jeremyong.com/c++/vulkan/graphics/rendering/2018/04/20/putting-cpp-to-work-making-abstractions-for-vulkan-specialization-info/
https://jeremyong.com/c++/vulkan/graphics/rendering/2018/04/20/putting-cpp-to-work-making-abstractions-for-vulkan-specialization-info/Understanding Backpropagation My Way<p>Note to the reader: by “my way,” I don’t intend to purport that any of the thoughts in this post are original or unique. I did, however, write it off the top of my head so any similarity to other resources are incidental. Also, I’ve obviously read other books, papers, and articles on the subject of modern neural network architectures, so I don’t wish to lay claim to any of those ideas as well. I also believe that each individual should search for his or her own intuition of any concept, so what worked for me may not work for you (but it may aid you in that search).</p>
<p>Disclaimer #2: This post is currently a draft and I would like to potentially write more about the nitty gritty bits of implementing a neural network (proper usage of storage buffers, GPU-warp shared memory, compute dispatch patterns, etc). It’s sort of a loose jumble of thoughts intended to eventually be something organized, so reader beware!</p>
<h2 id="motivation">Motivation</h2>
<p>So why am I writing this? It’s because I think most of the explanations and derivations of the standard backpropagation algorithm aren’t very good. Now you are free to disagree, but at least for me, when I was teaching myself the algorithm for the purposes of actually finishing a toy GPU-accelerated implementation, I found myself getting lost in a sea of indices and notation. Also, when trying to wrap my head around how to actually <em>implement</em> something like a convolutional neural network, it felt like the explosion in pure bookkeeping put a lot of strain on my intuition. And so I found a new intuition and I thought I would jot it down here.</p>
<p>If you aren’t trying to build a neural network framework or implementation from scratch, you may not need to delve into this topic much (frameworks like TensorFlow, Keras, Theano, etc do a fair bit for you), but I think it’s still helpful to know, if only to get an intuition about algorithm performance characteristics (assuming you at least have a rough conceptual grasp of what’s happening).</p>
<h2 id="my-previous-thought-model">My previous thought model</h2>
<p>Most explanations you’ll find about neural networks and backpropagation are extremely index and vector heavy. This influenced my intuition pretty heavily as well. Obviously, this might seem like a natural approach but I think it is fraught with danger. The weakness is that the underlying concepts get lost (in other words, the act of bookkeeping and the algorithmic structure are conflated). There is also a huge burden in keeping track of specific functions (sigmoids, weighted sums, derivatives, etc). Generally, there is a very exact structure that backpropagation is taught in terms of (multi-layer fully connected networks with linear weighted combinations followed by some non-linear activation function). All of these things can be independently motivated easily enough (maybe I’ll write about this later), but the act of <em>learning</em> itself does not require any of this baggage. Furthermore, when an actual <em>implementation</em> is required, many networks don’t follow this canned network topology and add more “exotic” elements like different activation functions, pooling layers, normalization, and more. Without a good basis for understanding, it’s easy to get lost in the details when trying to implement it (which should be your goal after all).</p>
<h2 id="rethinking-things-with-recursion">Rethinking things… with recursion</h2>
<p>As computer scientists (and mathematicians), we like simple base cases, and simple inductions. I think when I was starting to learn neural network architectures, it was tempting to try a “two-neuron network”, or “two-layer network” to simplify the understanding. However, even this is <em>too complicated</em>.
When coming up with a recursive algorithm, it’s important to reframe things in terms of a base case, and an inductive step.
For the inductive step, <em>a single abstract layer</em> is sufficient.</p>
<script type="math/tex; mode=display">f(p_0, p_1, \ldots, p_n, x_0, x_1, \ldots, x_m)</script>
<p>Conceptually, this simple function represents a layer with <script type="math/tex">n</script> parameters and <script type="math/tex">m</script> inputs (I am deliberately avoiding words like “weights” and “activations”). Let’s build our inductive step now. Remembering that the ultimate goal is to find <script type="math/tex">\mathbf{\nabla}C</script> for some cost function <script type="math/tex">C</script>, suppose someone told this specific layer we’re looking at its differential contribution to that cost function. In other words, we are given</p>
<script type="math/tex; mode=display">\frac{\partial C}{\partial f}</script>
<p>as an actual numerical quantity. This is pretty useful! We can pretty much directly compute <script type="math/tex">\frac{dC}{dp_i}</script> for any of the <script type="math/tex">n</script> parameters like so</p>
<script type="math/tex; mode=display">\frac{\partial C}{\partial p_i} = \frac{\partial C}{\partial f}\frac{\partial f}{\partial p_i}</script>
<p>It’s <em>really important</em> to remember looking at these equations that they are evaluated for a particular point in hyperspace (the coordinates of that point corresponding to the parameters of the model and input vector). Thus, <script type="math/tex">\frac{\partial f}{\partial p_i}</script> should be a simple numerical computation and we end up with a number expressing the component of the gradient for <script type="math/tex">p_i</script>.</p>
<p>For example, if our layer function was something silly like</p>
<script type="math/tex; mode=display">f(x_0, p_0) = x_0 \cdot p_0</script>
<p>and an input of <script type="math/tex">x_0 = 4</script> was given with <script type="math/tex">p_0</script> initialized to <script type="math/tex">3</script>, and we were given <script type="math/tex">\frac{\partial C}{\partial f} = -2</script>, then from the equation above, we can compute</p>
<script type="math/tex; mode=display">\frac{\partial C}{\partial p_0} = \frac{\partial C}{\partial f} \frac{\partial f}{\partial p_0} = -2 \cdot 4 = -8</script>
<p>Conveniently, we can treat the input <script type="math/tex">x_0</script> of this dumb multiplication layer as a parameter if we wanted (aka treat it like another variable).</p>
<script type="math/tex; mode=display">\frac{\partial C}{\partial x_0} = \frac{\partial C}{\partial f} \frac{\partial f}{\partial x_0} = -2 \cdot 3 = -6</script>
<p>Intuitively, the relative sizes of the partial derivatives indicates that <script type="math/tex">p_0</script> needed to change more to correct the naughty behavior of <script type="math/tex">x_0</script>. Simultaneously, <script type="math/tex">x_0</script> also should be corrected somehow, but not by as large an amount (since <script type="math/tex">p_0</script> misbehaved less).</p>
<p>An important observation: it’s easy to think this result feels backwards. After all, the <script type="math/tex">x_0</script> initialized to <script type="math/tex">4</script> was more incorrect. Shouldn’t <script type="math/tex">x_0</script> be the one that needs to change more? Remember though, that a partial derivative essentially means we are <em>fixing</em> other variables and computing a function which expresses a change with respect to a single particular variable. If <script type="math/tex">p_0</script> were to remain fixed as opposed to <script type="math/tex">x_0</script>, we can see that <script type="math/tex">x_0</script> can change less than if the situation were reversed (with <script type="math/tex">x_0</script> fixed and <script type="math/tex">p_0</script> variable).</p>
<p>Unlike the parameters though, we can’t directly use the partial derivatives with respect to layer inputs since they aren’t directly incorporated in the model (they are either intermediate values of the model, or input features).</p>
<p>Suppose however, that there was another function <script type="math/tex">g</script> which was also parameterized and produced <script type="math/tex">x_0</script> in this example. It should be clear now that <script type="math/tex">\frac{\partial C}{\partial x_0}</script> which we just computed is analogous to the <script type="math/tex">\frac{\partial C}{\partial f}</script> derived from this step, so we now have our inductive step.</p>
<p>In practice the choice of <script type="math/tex">f</script> is a combination of weights or filters, functions, and all sorts of gizmos. Regardless of how complicated it gets though, you should be able to implement it successfully if you keep revisting the primary induction you are trying to achieve.</p>
<p>At this point, we have an <em>embarrassingly general</em> inductive step encapsulating the execution of an entire layer. However, this hypothetical layer only produced one output (although it successfully consumed multiple inputs). We can make it even more general by assuming (the inductive hypothesis) that we are given a set of partial derivatives for each of its outputs. If the outputs of the function are labelled <script type="math/tex">y_0, y_1, \ldots, y_s</script>, this might be written as the tuple</p>
<script type="math/tex; mode=display">\left\{\frac{\partial C}{\partial y_0}, \frac{\partial C}{\partial y_1}, \ldots, \frac{\partial C}{\partial y_s}\right\}</script>
<p>Now, the computation of a parameter like <script type="math/tex">\frac{\partial C}{\partial p_i}</script> must be treated carefully since <script type="math/tex">p_i</script> may have had contributions to many of the <script type="math/tex">y</script> outputs.
But wait, this should be where things start to get murky. After all, since we are given a vector quantity, we expect <script type="math/tex">\frac{\partial C}{\partial p_i}</script> to also be a vector.
If we rush into things we might be tempted to write some weird gradient-projection-vector-quantity-thing like</p>
<script type="math/tex; mode=display">\left\{ \frac{\partial C}{\partial y_0}\frac{\partial y_0}{\partial p_0}, \frac{\partial C}{\partial y_1}\frac{\partial y_1}{\partial p_0}, \ldots, \frac{\partial C}{\partial y_s}\frac{\partial y_s}{\partial p_0} \right\}</script>
<p>to be some sort of weird intermediate computation to try and recover <script type="math/tex">\frac{\partial C}{\partial p_0}</script> as in our first example.
How can we recover a single scalar from this? Should we average these partial derviatives? Take the square magnitude? Something else entirely? What do these quantities even mean?</p>
<p>The answer, of course, is to add them. Remember the <em>meaning</em> of a partial derivative, which means that we are measuring the rate of change of some output with everything fixed except for a single variable. The fact that a particular parameter is a dependency of some of these values just means that if we perturb the parameter, from a differential perspective, all of the <script type="math/tex">y</script> quantities will be perturbed as well, and we can combine them.
Thus, we can write</p>
<script type="math/tex; mode=display">\frac{\partial C}{\partial p_i} = \frac{\partial C}{\partial y_0}\frac{\partial y_0}{\partial p_i} + \cdots + \frac{\partial C}{\partial y_s}\frac{\partial y_s}{\partial p_i}</script>
<p>with a clear conscience. As another point, if changing <script type="math/tex">p_i</script> perturbs <script type="math/tex">y_4</script> more, say, this is already captured in the partial <script type="math/tex">\frac{\partial y_4}{\partial p_i}</script>.
Anecdotally, I always find that the hardest thing to do when developing an inductive algorithm is to trust the induction hypothesis :smile:.</p>
<h2 id="quick-recap">Quick Recap</h2>
<p>Let’s quickly summarize what we have so far:</p>
<ol>
<li>We have some gradient we are trying to compute</li>
<li>We have a function with inputs and parameters that we’re trying to optimize</li>
<li>The inputs might have been outputs from some other function/layer/whatever</li>
<li>We treat all inputs and parameters the same, the only difference being that partial derivatives with respect to variables internal to a particular layer do not participate in the recurrence relation</li>
<li>The recurrence relation is built on the chain rule</li>
<li>We can build all this intuition without indices except for the inputs, outputs, and parameters</li>
</ol>
<p>From an implementation standpoint then, we can expect to do a bit of work changing the <script type="math/tex">\frac{\partial y_i}{\partial p_j}</script> partials for various algorithms (activation functions, weighted sums, etc).</p>
<p>However, there are still a few holes in this understanding.</p>
<ol>
<li>Why is there no “averaging” going on when “distributing” the error across the various parameters?</li>
<li>Why are the previous outputs of any given layer important when performing backprop?</li>
</ol>
<p>The first question is based on the intuition that the chain rule acts sort of as an “assigner of responsibility” to each parameter of the model. This is a pretty good way of thinking about it, but note that there is no division by the sum of weights or anything in the above formulation.
The reason is because as we are computing the gradient numerically, the direction of this gradient already encodes the “relative responsibility” of each individual parameter.
That is to say that if we trust the calculus to do its thing, the relative sizes of the components will match the distribution, with or without some sort of normalization.
Furthermore, when we apply an optimizer to the gradient, the magnitude of this gradient actually gets modulated anyways to control the learning rate (either a fixed multiple, annealing over time, momentum based, etc).</p>
<p>The second question leads to a continuation of this formulation for a more exact form of the dependencies of <script type="math/tex">y</script> as a function of the layer parameters. Suppose we impose the following restriction on some given <script type="math/tex">y_i</script> as follows:</p>
<script type="math/tex; mode=display">y_i = f\left(w_0p_0 + w_1p_1 + \cdots + w_np_n\right)</script>
<p>where <script type="math/tex">f</script> is any nonlinear function. Now, we can write</p>
<script type="math/tex; mode=display">\frac{\partial y_i}{\partial p_i} = w_0\frac{\partial f}{\partial p_0}</script>
<p>The weight <script type="math/tex">w_0</script> is known for each particular step of the algorithm, but <script type="math/tex">\frac{\partial f}{\partial p_i}</script> needs to be supplied. For most choices of the activation function in a neural network, this partial will depend on the activation value itself, which is why the results of the feed forward operation need to be saved.</p>
<h2 id="random-intuition">Random Intuition</h2>
<p>Here are a few more random things to consider:</p>
<ol>
<li>When implementing a neural network from scratch, imagine that the activations from previous inputs are just weights and compute the partials with respect to those values like anything else. Of course, instead of modifying them in the update step, you’ll want to propagate them recursively.</li>
<li>With all these partials of the cost function, we could compute the Hessian matrix as well (second order derivatives). This would require a fair bit more memory, but considering how you would implement this is a great test of understanding.</li>
<li>You might ask, how we know we’re actually finding a minimum of the cost function and not a maximum? The answer is that cost functions are generally very well-behaved cost functions (like mean squared error or minimization of the KL divergence). The fact that the way we produce the output of the final layer is hyper-parameterized and a weird looking surface is irrelevant once you plug it into one of these “bowl-shaped” loss functions.</li>
<li>We are doing <em>very crude</em> numerical derivatives here using just the first moments. This seems to work well enough for people, but maybe you can do better!</li>
<li>Between the first and last layer, there’s a ton of floating point math, and when there are a ton of computations like this, you should be sensitive to floating point error accumulation. We get around this by regularization and keep weight sizes and activations all relatively small. Inputs that are unreasonably sized should probably be normalized though. Similarly, exploding error functions should be measured on a log scale.</li>
<li>For functions like <script type="math/tex">\textbf{ReLU}</script> and max-pooling and stuff, these functions don’t have well defined continuous derivatives everywhere. Backpropagation still works because all these derivatives are being evaluated numerically so if you want, imagine that there’s some sort of hidden “smoothing” going on when certain partial derivates just go to zero because a unit wasn’t activated and whatnot. After all, we’re just computing the first moments anyways…</li>
</ol>
<h2 id="application">Application</h2>
<p>Let’s “derive” backpropagation formulae now for a simple fully connected network.</p>
<p>First, the base case. Given <script type="math/tex">n</script> outputs labeled <script type="math/tex">y_0, y_1, \ldots, y_n</script>, we can compute the partial derivatives with respect to some cost fuction <script type="math/tex">C</script>:</p>
<script type="math/tex; mode=display">\left\{\frac{\partial C}{\partial y_0}, \ldots, \frac{\partial C}{\partial y_n}\right\}</script>
<p>The specific value of these derivates will depend on your choice of cost function and how the layer outputs were fed into it. Assuming that each output <script type="math/tex">y_i</script> is the affine linear combination <script type="math/tex">\mathbf{w} \cdot \mathbf{x} + \mathbf{b}</script>, the previous layer can now compute for any weight <script type="math/tex">w_i</script>, the component of the gradient <script type="math/tex">\mathbf{\nabla}C</script> projected on <script type="math/tex">w_i</script>:</p>
<script type="math/tex; mode=display">\frac{\partial C}{\partial w_i} = \sum_{j=0}^n\frac{\partial C}{\partial y_j}a_i</script>
<p>Here, <script type="math/tex">a_i</script> is the input to the <script type="math/tex">i</script>th node (and we are assuming a linear activation function, aka a dumb one). This should match our intuition since the output of this particular <script type="math/tex">i</script>th node contributed to the error of each of the outputs (it being fully connected). So we should expect a sum over all output errors. The contribution is also proportional to <script type="math/tex">a_i</script> (if the input to this node is greater, the node needs to adjust more to compensate for any error).</p>
<p>The equation for the gradient projected on the bias is similar but without the <script type="math/tex">a_i</script> factor.</p>
<script type="math/tex; mode=display">\frac{\partial C}{\partial b_i} = \sum_{j=0}^m \frac{\partial C}{\partial y_j}</script>
<p>This formula is probably odd looking to beginners but don’t be put off by it. Imagine for example, the bias functioned just like a weight but that the input activation always happened to be <script type="math/tex">1</script>. If the weight’s activation was also <script type="math/tex">1</script>, the error contribution would be split 50-50 between the bias and the weight. If the weight activation was more or less, we’d get the correct response accordingly.</p>
<p>Note that these equations look different than other text’s like <a href="[http://neuralnetworksanddeeplearning.com/chap2.html]">Nielsen’s</a>! That’s because in this wonky formulation I developed (just to get my own intuition), I built my recurrence relation based on the partial derivatives of the cost with respect to the layer’s outputs, not the node’s internal value prior to applying the weights. They are completely equivalent however, and it should be easy to move from this notation to that one and vice versa. Personally, I found it easier to reason about this formulation (framing things in terms of layer outputs) because it adapts well to weird layers that have other exotic functions, weight sharing (as in the case of a convolutional neural network), or network recurrence (GRU, LSTM, standard RNN, etc). Skip connections also fit well with this framework as well.</p>
<p>The last step of course is to compute the partials with respect to the inputs <script type="math/tex">x_i</script> to any given layer. Since these inputs get multiplied by the weights, the result is identical to the weight equation by symmetry (although they will evaluate differently). And that’s it! Everything necessary to recurse back to the starting layer is done.</p>
<p>As a side-note, remember that by definition, <script type="math/tex">\frac{\partial C}{\partial v} = 0</script> (here <script type="math/tex">v</script> is just one of the components of the input vector), since the actual training data should have contributed to the loss function in any way.</p>
<h2 id="conclusion">Conclusion</h2>
<p>The primary benefit of intuition regarding neural networks and the math powering it is to enable you to experiment with wonky ideas and structures, without being tied down to a specific formulation.
Using weights nonlinearly, adding weird connections, and more should all feel relatively straightforward to reason about.
Furthermore, by decoupling your understanding from exact index-heavy notation, you can more adequately find a formulation that performs better for the architecture you are targeting.
Finally, gaining a deeper understanding of a topic, even from a hand-wavey sense is liberating in the sense that the intuition is now your own, and you can do whatever you like with that!</p>
Mon, 02 Apr 2018 08:11:00 +0000
https://jeremyong.com/deeplearning/2018/04/02/understanding-backpropagation-my-way/
https://jeremyong.com/deeplearning/2018/04/02/understanding-backpropagation-my-way/