Jekyll2024-01-14T16:04:39+00:00https://www.atomicincrement.com/feed.xmlAtomic Increment - delivering High Efficiency ComputingAtomic Increment are a group of highly experienced coders who are passionate about code performance, Rust and saving you time and money when doing hard sums.
We teach Rust in association with ferrous Systems, High Efficiency Computing and develop the "extendr" R extension system with help from the R ISC Comittee.
Andy Thomasonandy@atomicincrement.comGame development in Rust2024-01-14T14:12:06+00:002024-01-14T14:12:06+00:00https://www.atomicincrement.com/rust/2024/01/14/rust-game-development<p>C++ has long been the language of choice in professional game
engine development.</p>
<p>This was not always the case as C and assembler were dominant for
some time before C++ with games like Quake being the last vestiges
of this.</p>
<p>Of course, some games are written in interpreted languages, originally
BASIC and more recently JavaScript and some game engines such as Unity
use languages like C# for scripting. C# is middle ground between
script languages and C-like languages using the “everything’s a pointer”
model and garbage collection. It is easier to learn than C++ and less
likely to experience undefined behaviour leading to crashes.</p>
<p>It should be noted that the runtimes of Unity and other game engines
that use C# are actually written in C++.</p>
<p>So why C++ and more recently, why Rust?</p>
<ul>
<li>Part 1 - The case for Rust in games.</li>
<li>Part 2 - An example: breakout in Bevy</li>
<li>Links</li>
</ul>
<p><em>Andy Thomason has worked in the game industry since the 1970’s developing
Namco console games and AI Chess players in Z80 assembler as a teenager.</em></p>
<p><em>He has worked for Sony twice (Psygnosis and SN Systems) doing research
in game technology such as the PS3 and Vita compilers.</em></p>
<h1 id="part-1---the-case-for-rust-in-games">Part 1 - The case for Rust in games.</h1>
<h2 id="the-cc-programming-model---stack-and-heap">The C/C++ programming model - Stack and Heap</h2>
<p>C++ is based on C and in fact the original C++ compiler, CFront transcoded
C++ into C. C uses as “Stack and Heap” model to handle dynamically created objects.</p>
<p>If we write a C function with a variable</p>
<pre><code class="language-C">void my_function() {
int x = 1;
printf("%d", x);
}
</code></pre>
<p>then the variable <code class="language-plaintext highlighter-rouge">x</code> is stored in a <em>stack</em> frame which is reserved
when we call <code class="language-plaintext highlighter-rouge">my_function</code> and removed when we return from the function.</p>
<p>We can also allocate objects that live longer using <code class="language-plaintext highlighter-rouge">malloc</code> to get a
pointer to the <em>heap</em> so that when we return from a function, we still
have our object.</p>
<p>This all looks like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+---------------+ top
+ Stack +
+---------------+ SP
+ +
+ unallocated +
+ +
+---------------+ BRK
+ Heap +
+---------------+ bottom
</code></pre></div></div>
<p>When you call a function, the stack pointer, SP goes down
creating more space, when you return, SP goes up freeing the space.</p>
<p>The heap, by comparison, grows up from the bottom, but never shrinks.
Instead we divide the heap into <em>chunks</em> which are allocated by <code class="language-plaintext highlighter-rouge">malloc</code>
and freed up by <code class="language-plaintext highlighter-rouge">free</code>.</p>
<p>Thus the <em>lifetime</em> of objects on the stack is limited to the call
being made, but the <em>lifetime</em> of an object allocated on the heap
can be much longer.</p>
<h2 id="advantages-or-c-and-c">Advantages or C and C++</h2>
<p>Engines written in C++ are much faster than those written in Java, GO
and C# because you get much closer to the machine. You can go a lot faster
still if you write in assembler, but those skills are in decline.</p>
<p>Writing in C++ does not make the code go faster on its own, but it gives
you a larger toolbox to work with and lets you talk to the hardware
more directly. On game consoles, for example, the C++ code is used
to write to hardware registers directly, bypassing bulky APIs.</p>
<p>C++ also lets you use multithreaded code and most modern C++ game
engines let you create huge number of tasks and events which
will be handled during the frame or over the course of many frames.</p>
<p>In garbage collected languages like C#, objects are <em>only</em> allocated
on the heap which is much costlier than allocating on the stack.</p>
<h2 id="problems-with-c-and-c">Problems with C and C++</h2>
<p>If an untrained driver sits in a formula one car and tries to drive
it, they will likely crash immediately and it is the same for
C and C++ - the program will crash.</p>
<p>Consider the following code:</p>
<pre><code class="language-C">int *fred() {
int x = 0;
return &x;
}
</code></pre>
<p>This function returns the address of the variable <code class="language-plaintext highlighter-rouge">x</code>, but after
returning from this function, x is no longer there and writing
to it will likely cause a crash. This is a <em>dangling pointer</em>.</p>
<p>Finding this kind of fault is very hard in C++ and this puts a lot
of people off writing games in C++.</p>
<p>Another problem with multi-threaded code is the <em>race condition</em>.</p>
<p>Consider this code:</p>
<pre><code class="language-C"> // Thread 1
x = 1;
y = 2;
// Thread 2
x = 3;
y = 4;
// Thread 3
X = x;
Y = y;
</code></pre>
<p>What is the value of (X, Y)? This can be (1, 2) (3, 4)
(1, 4) (undefined, 4) and so on. Many many faults in game
engines exist because of this.</p>
<h2 id="rust-is-the-successor-to-c">Rust is the successor to C++</h2>
<p>Rust was designed to get the benefits of C++ without the
pain of having to worry about race conditions and dangling
pointers.</p>
<p>It is a complete redesign of C++ with only the modern bits
and a safety-orientated checking system. It encourages
a common style of code generation trhouh warnings about
variable names and has extensive security checks to
avoid some of the nasty network attacks that can disable
games and steal user information.</p>
<p>Rust has two modes <em>safe</em> and <em>unsafe</em>. Most of the code
is written in <em>safe</em> mode which gives you guantees that
avoid the problems of C++ but some code is <em>unsafe</em>
such as interactions with hardware.</p>
<p>For example, the dangling pointer example we give
is not possible in Rust:</p>
<pre><code class="language-Rust">fn my_function() -> &i32 {
let x = 1;
&x
}
</code></pre>
<p>This will generate an error.</p>
<p>Likewise with the race condition example, passing writable
references to variables to other threads is not allowed
in safe Rust, so you don’t have to worry about it.</p>
<h2 id="so-why-not-just-stick-with-c">So why not just stick with C#?</h2>
<p>C# also gives these guarantees and indeed if you are writing
small games with low performance requirements, then C#
may be exactly what you are looking for.</p>
<p>But for large games, such as the Disney Engine which is over 5M
lines of code, using C# is just not going to be possible
and if you want to create effects that Unity is not pre-wired
to support, then good luck.</p>
<p>Rust makes it much easier to write large, multi-threaded games,
do networking, build servers to host thousands of players
and many more things.</p>
<p>Rust has a <em>huge</em> collection of libraries which you can use
by adding a single line to the manifest file like the NPM
JavaScript manager. The <em>Cargo</em> build tool will download
source code of any of the hundreds of thousands of libraries
available and compile it on the spot.</p>
<p>In many ways, it is the ease of using libraries that makes
Rust number one choice in new technologies such as Block Chain
and Fintech.</p>
<h2 id="who-uses-rust-in-the-game-industry">Who uses Rust in the game industry?</h2>
<p>Some studios, like Embark in Sweden, have adopted Rust
and are pushing the ecosystem. We are on the verge
of seing a new generation of Rust game engines
become stable enough for large scale development.</p>
<p>For example libraries for:</p>
<ul>
<li>Windowing</li>
<li>Audio</li>
<li>Shaders</li>
<li>3D rendering</li>
<li>Text rendering</li>
<li>AI</li>
<li>ECS (Enitity-component-system model)</li>
<li>VR</li>
<li>3D format loaders</li>
<li>Maths</li>
<li>Mesh tools.</li>
</ul>
<p>etc.</p>
<p>As well as a stack of fully formed game engines:</p>
<ul>
<li>Bevy</li>
<li>Fyrox</li>
<li>Amethyst</li>
<li>ggez</li>
<li>macroquad</li>
<li>Piston</li>
</ul>
<p>I’ve been using Bevy engine to do shader experiments with molecular
modelling, for example. Bevy uses Webgpu to make games that
can run on desktops, phones, browsers and many more platforms.</p>
<p><strong>Bevy</strong> has VR support, networking and many more things but it
is still and “expert” level tool, it doesn’t have the easy
gui that Unity has but suits my way of working.</p>
<p><strong>Fyrox</strong> has a GUI-driven scene generator and is orientated at
scripting like Unity:</p>
<p>Amethyst is also quite programmer orientated</p>
<p>As is <strong>Piston</strong>.</p>
<h2 id="what-needs-to-happen">What needs to happen</h2>
<p>Most Rust game engines are very much orientated towards
programmers. For exmple, bulding a large open world survival
strategy game like Factorio in Bevy would be quite easy,
but would require some programming skill.</p>
<p>To become more mainstream, these game engines need to
develop GUI interfaces to allow non-programmers to build
games. Some have started in that direction, but we will
see technically orientated games long before we see
artist-lead FPS games, for example.</p>
<p>Still, if it were a choice of starting a new game
engine in C++ or in Rust, the smart money would go
for Rust as it is hugely popular and much easier
to build large projects without breaking the bank.</p>
<p>If your were to start learning a low level language
now, Rust would be the choice, especially as most
Rust jobs are work-from-home with Europe developing
as a centre for Rust digital nomads.</p>
<p>As a lifestyle, the open source world of Rust is much
preferable to being stuck in a room of hundreds of C++
programmers on an industrial estate in the middle of nowhere,
not to mention any game studios in particular!</p>
<h1 id="part-2---an-example-breakout-in-bevy">Part 2 - An example: breakout in Bevy</h1>
<p>To illustrate what it is like to write a game in Rust, lets start
with one of the examples from the Bevy game engine:</p>
<p>Like in C and C++, the entry point to a Rust program is <code class="language-plaintext highlighter-rouge">main()</code></p>
<pre><code class="language-Rust">fn main() {
App::new()
.add_plugins(DefaultPlugins)
.run();
}
</code></pre>
<p>If this is all we did, then we would get a blank window.</p>
<p>So what we need to do is add data and code to make breakout run.</p>
<p>This adds two data <em>resources</em>, a scoreboard which we will define
and a clear colour which is a system defined resource.
Resources are not <em>assets</em> and do not draw themselves. We need
<em>entities</em> for the rendering plugin to draw anything, for example.
Resources are just bits of data which we will use.</p>
<pre><code class="language-Rust"> .insert_resource(Scoreboard { score: 0 })
.insert_resource(ClearColor(BACKGROUND_COLOR))
</code></pre>
<p>Next we add an event, which we will use to signal collisions
between the ball and other entities.</p>
<pre><code class="language-Rust"> .add_event::<CollisionEvent>()
</code></pre>
<p>And to make the game work, we have some systems which are
functions that get called to update things.</p>
<pre><code class="language-Rust"> .add_systems(Startup, setup)
.add_systems(
FixedUpdate,
(
apply_velocity,
move_paddle,
check_for_collisions,
play_collision_sound,
).chain(),
)
.add_systems(Update, (update_scoreboard, bevy::window::close_on_esc))
</code></pre>
<p>The <code class="language-plaintext highlighter-rouge">.chain()</code> makes these functions get run in sequence. Bevy
is a multi-threaded game engine and may run systems in any order
on different threads if needs be.</p>
<p>The first of these systems is <code class="language-plaintext highlighter-rouge">setup</code> which is called at the Start
of the game.</p>
<pre><code class="language-Rust">// Add the game's entities to our world
fn setup(
mut commands: Commands,
mut meshes: ResMut<Assets<Mesh>>,
mut materials: ResMut<Assets<ColorMaterial>>,
asset_server: Res<AssetServer>,
) {
// ...
}
</code></pre>
<p>The parameters to setup can come in any order and
use Rust’s flexible type system to make Assets and
other components acessible to the function.</p>
<p>The <code class="language-plaintext highlighter-rouge">commands</code> parameter is an interface that lets you
change the state of the game. For example:</p>
<pre><code class="language-Rust"> commands.spawn(Camera2dBundle::default());
</code></pre>
<p>sets up a 2D camera for the game world.</p>
<pre><code class="language-Rust"> // Sound
let ball_collision_sound = asset_server.load("sounds/breakout_collision.ogg");
commands.insert_resource(CollisionSound(ball_collision_sound));
</code></pre>
<p>adds a sound resource to the game.</p>
<pre><code class="language-Rust"> commands.spawn((
SpriteBundle {
// ...
},
Paddle,
Collider,
));
</code></pre>
<p>Adds a <em>bundle</em> of components to an entity (the paddle). A bundle is
an easy way of deploying a number of components at a time.</p>
<p>The component system is similar to Unity. Each object in the game
world has a number of components such as <code class="language-plaintext highlighter-rouge">Transform</code> and <code class="language-plaintext highlighter-rouge">Sprite</code> as
well as some user-defined components.</p>
<p>The <code class="language-plaintext highlighter-rouge">Transform</code> component, for example, specifies the location
of a sprite and the <code class="language-plaintext highlighter-rouge">Sprite</code> component describes the colour, image
and other properties.</p>
<p>Here <code class="language-plaintext highlighter-rouge">Paddle</code> and <code class="language-plaintext highlighter-rouge">Collider</code> are user defined components.</p>
<p>Likewise, we spawn entities such as the ball, the bricks, the walls
and so on.</p>
<h3 id="making-custom-components">Making custom components</h3>
<p>Making custom components is easy in Bevy. We use a <code class="language-plaintext highlighter-rouge">derive</code> macro
to generate extra code needed for the component. In these two cases
there is no extra data needed, so the structs don’t need curly braces:</p>
<pre><code class="language-Rust">#[derive(Component)]
struct Paddle;
#[derive(Component)]
struct Ball;
</code></pre>
<p>The types, however, are used to make a distinction between <code class="language-plaintext highlighter-rouge">Paddle</code>
and <code class="language-plaintext highlighter-rouge">Ball</code> and will be used to select the components when we run the systems.</p>
<h3 id="moving-the-paddle">Moving the paddle</h3>
<p>To move the paddle, we need a system which takes user input and
all the entities which have <code class="language-plaintext highlighter-rouge">Transform</code> and <code class="language-plaintext highlighter-rouge">Paddle</code> components like this:</p>
<pre><code class="language-Rust">fn move_paddle(
keyboard_input: Res<ButtonInput<KeyCode>>,
mut query: Query<&mut Transform, With<Paddle>>,
time: Res<Time>,
) {
let mut paddle_transform = query.single_mut();
}
</code></pre>
<p>there is only one paddle, so <code class="language-plaintext highlighter-rouge">query.single_mut();</code> will do
and also enable us to write to the transform (move the paddle).</p>
<p>By default in Rust, references like <code class="language-plaintext highlighter-rouge">&Transform</code> are read-only
and we need to use <code class="language-plaintext highlighter-rouge">&mut Transform</code> and <code class="language-plaintext highlighter-rouge">single_mut</code> to allow
us to change the transform.</p>
<p>The rest of the function reads the keyboard and moves the paddle.</p>
<h3 id="checking-for-collisions">Checking for collisions</h3>
<pre><code class="language-Rust">fn check_for_collisions(
mut commands: Commands,
mut scoreboard: ResMut<Scoreboard>,
mut ball_query: Query<(&mut Velocity, &Transform), With<Ball>>,
collider_query: Query<(Entity, &Transform, Option<&Brick>), With<Collider>>,
mut collision_events: EventWriter<CollisionEvent>,
) {
// ...
}
</code></pre>
<p>This system has:</p>
<ul>
<li>An interface to change the engine state.</li>
<li>A writeable <code class="language-plaintext highlighter-rouge">Scoreboard</code> resource.</li>
<li>A query to find the Velocity and Transform of the ball.</li>
<li>A query to find anything with a Transform and a collider, which may be a brick.</li>
<li>An <code class="language-plaintext highlighter-rouge">EventWriter</code> to signal collisions to other systems.</li>
</ul>
<p>We spin round checking the ball position against the bricks and walls, updating the scoreboard
and sending events if anything collides.</p>
<h3 id="sounds">Sounds</h3>
<pre><code class="language-Rust">fn play_collision_sound(
mut commands: Commands,
mut collision_events: EventReader<CollisionEvent>,
sound: Res<CollisionSound>,
) {
// ...
}
</code></pre>
<p>Here we receive collision events and convert them into
sounds.</p>
<p>We create the sound by <em>spawning</em> the collision <code class="language-plaintext highlighter-rouge">sound</code>
in a bundle. Yes! sounds are entities too.</p>
<h2 id="multithreading">Multithreading</h2>
<p>Because of the danger of race conditions, Bevy is careful
not to call two systems at the same time with a mutable
reference to the same component.</p>
<p>Bevy’s use of <code class="language-plaintext highlighter-rouge">#[derive]</code> and the Rust type system
makes for a more C#-like development environment.</p>
<p>With very large games, with hundreds of thousands of entitites,
this will make a big difference.</p>
<h1 id="thats-all-folks">That’s all folks</h1>
<p>We talked a little about how Rust, a low level language,
makes it easier to write safe multi-threaded code, stealing
some thunder from C# and giving a significant performance
boost.</p>
<p>We showed you how easy it is to build games using the
Bevy ECS (Entity-component-system) model.</p>
<p>So happy Rusting, and if you get the opportunity,
try writing a game in Bevy. It may take a bit of
getting used to, but you are a champion!</p>
<h1 id="links">Links</h1>
<p><a href="https://github.com/bevyengine/bevy/blob/main/examples/games/breakout.rs">Breakout</a></p>
<p><a href="https://arewegameyet.rs/#ecosystem">Are we game yet Ecosystem</a></p>
<p><a href="https://bevyengine.org/">Bevy Game Engine</a></p>
<p><a href="https://fyrox-book.github.io/beginning/scripting.html">Fyrox Game Engine</a></p>
<p><a href="https://amethyst.rs/">Amethyst Game Engine</a></p>Andy Thomasonandy@atomicincrement.comC++ has long been the language of choice in professional game engine development.Breaking the AI sound barrier with Doctor Syn.2021-11-18T00:00:00+00:002021-11-18T00:00:00+00:00https://www.atomicincrement.com/maths/2021/11/18/polynomial-approximation<h2 id="executive-summary">Executive Summary</h2>
<p>The problem facing the AI industry today is that many of the functions at
the heart of machine learning processes were written over forty years ago
when computer hardware and compiler technology were very different.</p>
<p>Our library, Doctor Syn, addresses this problem by using the three technologies
of SIMD, multithreading and autovectorisation. We achieve 30x or more speedups
over traditional libraries in C, C++, Rust and Fortran, without making the code
platform or language specific.</p>
<p>Doctor Syn’s primary focus at present is to generate accurate polynomial
approximations to key functions important to the execution of many programs. You are probably familiar with many of the functions we are targeting:</p>
<table>
<thead>
<tr>
<th>Rust Function</th>
<th>calculates</th>
</tr>
</thead>
<tbody>
<tr>
<td>f32/f64::sin</td>
<td>\(\sin{x}\)</td>
</tr>
<tr>
<td>f32/f64::cos</td>
<td>\(\cos{x}\)</td>
</tr>
<tr>
<td>f32/f64::atan2</td>
<td>\(\arctan{y/x}\)</td>
</tr>
<tr>
<td>f32/f64::exp</td>
<td>\(e^x\)</td>
</tr>
<tr>
<td>f32/f64::ln</td>
<td>\(\log{x}\)</td>
</tr>
</tbody>
</table>
<p>While improving these functions has a lot of value, we are currently focusing most of our effort on statistical functions such as:</p>
<table>
<thead>
<tr>
<th>R Function</th>
<th>distribution</th>
<th>role</th>
<th>calculates</th>
</tr>
</thead>
<tbody>
<tr>
<td>dnorm</td>
<td>normal</td>
<td>pdf</td>
<td>\(\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2}\)</td>
</tr>
<tr>
<td>pnorm</td>
<td>normal</td>
<td>cdf</td>
<td>\(\frac{1}{2}\left[1 + \operatorname{erf}\left( \frac{x-\mu}{\sigma\sqrt{2}}\right)\right]\)</td>
</tr>
<tr>
<td>qnorm</td>
<td>normal</td>
<td>quantile</td>
<td>\(\mu+\sigma\sqrt{2} \operatorname{erf}^{-1}(2p-1)\)</td>
</tr>
<tr>
<td>rnorm</td>
<td>normal</td>
<td>random</td>
<td>\(\operatorname{qnorm}(\operatorname{runif}(i))\)</td>
</tr>
</tbody>
</table>
<p>These functions are used extensively in finance and bioinformatics to perform statistical
inference, stochastic modelling, AI and Machine learning. For example, rnorm is a key part of many
MCMC algorithms and variational techniques, as well as a key part of monte carlo simulations, such as those that may be used to solve stochastic differential equations.</p>
<p>Using this library, combined with parallel iterators, we generate more efficient versions of</p>
<ul>
<li>Numpy</li>
<li>R</li>
<li>GNU Octave</li>
</ul>
<p>and many others.</p>
<p>We have also targeted new architectures like Arm SVE which do not fit the X86 model.
We are working with the Isambard A64FX cluster to attempt to improve existing
algorithms.</p>
<p>This approach to function generation should fit perfectly with the A64FX’s SVE
architecture as SVE has a variable length SIMD architecture which will run
the same binary on many different word-length machines. SVE requires
Autovectorisation to work effectively.</p>
<h2 id="the-ai-sound-barrier">The AI sound barrier</h2>
<p>While great effort has been expended in key function optimization, current techniques are unable to efficiently utilize modern compiler technology such as auto-vectorisation and thread based parallelism.
Research in function approximation has focused on squeezing out the last half bit of precision
of functions at the expense of making functions every more complex and very much slower.</p>
<p>In practice, machine learning algorithms can tolerate a large amount of error and giving users
the ability to choose the level of accuracy that a function delivers as well as the domain
of inputs can make those algorithms orders of magnitude faster.</p>
<p>For example, using 32 bit floating point instead of 64 bit often has a 4:1 performance advantage
in compute and a 2:1 advantage in memory performance. With modern computers, the memory bandwidth
is very often the limiting factor and finding smarter ways to represent data becomes the key
to fast algorithms. If we know that a vector of numbers does not contain NaN values, then we can
skip NaN checks on every calculation.</p>
<p>But primarily, we need to make our functions simple enough to be vectorisable - once we have achieved
this, we get remarkable performance improvements.</p>
<h2 id="the-challenge-of-vectorisation">The challenge of vectorisation.</h2>
<p>Existing functions will not vectorise primarily because:</p>
<ul>
<li>They are in shared or static libraries.</li>
<li>They contain branches and look-up tables.</li>
</ul>
<p>In the 1970’s when many of these functions were written, this was the state of the art. Today however, this is a problem: with the advances of modern vectorising processors, these implementations are substantially less efficient than they could be.</p>
<p>There is no short and simple fix to this problem either, these problems are fundamental ones that preclude any efficient vectorisation. The only viable approach to this problem is with substantial and novel changes such as the one that Doctor Syn proposes.</p>
<h2 id="autovectorisation">Autovectorisation</h2>
<p>In the past we have implemented fast functions in assembler or even written assembler
to write functions in machine code. These days however, we often try to take a more “civilized” approach where possible. Assembler functions are fast, but difficult to read, difficult to improve, and difficult to generalize. Despite these problems, many such functions end up hanging around like a bad smell, and more are being produced by chip vendors for special architectures. Even excellent libraries like Sleef
are done this way with machine specific intrinsics.</p>
<p>To solve the problems with these architecture specific assembly implementations, we try and write portable code; we wish our code to be able to run on both x86 architectures with SIMD as well as the new ARM SVE with variable sized registers. The way we achieve this is by trying to write code in such a way as that it will be automatically vectorised by modern compilers.</p>
<p>So instead of something like:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">use</span> <span class="nn">std</span><span class="p">::</span><span class="nn">arch</span><span class="p">::</span><span class="nn">x86_64</span><span class="p">::</span><span class="o">*</span><span class="p">;</span>
<span class="k">pub</span> <span class="k">fn</span> <span class="nf">inc_doubles_simd</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="o">&</span><span class="k">mut</span> <span class="p">[</span><span class="nb">f64</span><span class="p">])</span> <span class="p">{</span>
<span class="k">unsafe</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">one</span> <span class="o">=</span> <span class="mi">_</span><span class="nf">mm256_broadcast_sd</span><span class="p">(</span><span class="o">&</span><span class="mf">1.0</span><span class="p">);</span>
<span class="k">for</span> <span class="n">x</span> <span class="n">in</span> <span class="n">x</span><span class="nf">.chunks_exact_mut</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span> <span class="p">{</span>
<span class="k">let</span> <span class="n">a</span> <span class="o">=</span> <span class="mi">_</span><span class="nf">mm256_loadu_pd</span><span class="p">(</span><span class="o">&</span><span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">as</span> <span class="o">*</span><span class="k">const</span> <span class="nb">f64</span><span class="p">);</span>
<span class="k">let</span> <span class="n">b</span> <span class="o">=</span> <span class="mi">_</span><span class="nf">mm256_add_pd</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">one</span><span class="p">);</span>
<span class="mi">_</span><span class="nf">mm256_storeu_pd</span><span class="p">(</span><span class="o">&</span><span class="k">mut</span> <span class="n">x</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="k">as</span> <span class="o">*</span><span class="k">mut</span> <span class="nb">f64</span><span class="p">,</span> <span class="n">b</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">for</span> <span class="n">x</span> <span class="n">in</span> <span class="n">x</span><span class="nf">.chunks_exact_mut</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span><span class="nf">.into_remainder</span><span class="p">()</span> <span class="p">{</span>
<span class="o">*</span><span class="n">x</span> <span class="o">+=</span> <span class="mf">1.0</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>we simply write:</p>
<div class="language-rust highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">fn</span> <span class="nf">inc_doubles_scalar</span><span class="p">(</span><span class="n">x</span><span class="p">:</span> <span class="o">&</span><span class="k">mut</span> <span class="p">[</span><span class="nb">f64</span><span class="p">])</span> <span class="p">{</span>
<span class="k">for</span> <span class="n">x</span> <span class="n">in</span> <span class="n">x</span> <span class="p">{</span>
<span class="o">*</span><span class="n">x</span> <span class="o">+=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>This is much easier to read, works on all known hardware without modifications and
does not specify a vector size, which might be variable.</p>
<p>Vectorisers are fickle beasts however. If the wind blows in the wrong direction, the compiler
will often fail to vectorise or worse - vectorise in the IR and then convert the vector
operations to a long series of library calls.</p>
<p>For example, the following rather innocent function, which absolutely should be vectorisable,
converts itself into a series of function calls:</p>
<pre><code class="language-C">#include <math.h>
void vector_sin(double *d, int len) {
while (len--) {
*d = sin(*d);
++d;
}
}
</code></pre>
<p>Clang gives:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.LBB0_7: # =>This Inner Loop Header: Depth=1
vmovsd xmm0, qword ptr [rbx + 8*rbp] # xmm0 = mem[0],zero
call sin
vmovsd qword ptr [rbx + 8*rbp], xmm0
vmovsd xmm0, qword ptr [rbx + 8*rbp + 8] # xmm0 = mem[0],zero
call sin
vmovsd qword ptr [rbx + 8*rbp + 8], xmm0
vmovsd xmm0, qword ptr [rbx + 8*rbp + 16] # xmm0 = mem[0],zero
call sin
vmovsd qword ptr [rbx + 8*rbp + 16], xmm0
vmovsd xmm0, qword ptr [rbx + 8*rbp + 24] # xmm0 = mem[0],zero
call sin
vmovsd qword ptr [rbx + 8*rbp + 24], xmm0
add rbp, 4
cmp r14d, ebp
jne .LBB0_7
</code></pre></div></div>
<p>Each call will take several hundred cycles.</p>
<h2 id="making-library-functions-that-vectorise">Making library functions that vectorise</h2>
<p>Library functions make things bad for themselves by introducing
branching.
So even if we can inline the function, the functions will not vectorise.
To get better accuracy, they divide the domain
of a function - for example \([-\pi, \pi]\) for \(\sin(x)\) into many small
parts. This is often done using a <code class="language-plaintext highlighter-rouge">switch</code> statement which will not vectorise.
Alternatives include using a lookup table of coefficients but many CPUs have not yet
implemented an efficient <code class="language-plaintext highlighter-rouge">gather</code> operation which can do table lookups
in reasonable time. The exception to this is GPUs, which do commonly
have efficient <code class="language-plaintext highlighter-rouge">gather</code> but these are likely to hurt the cache performance
unless you use non-temporal loads and stores.</p>
<p><strong>Doctor Syn</strong> generates functions that are free of complex control flow
which would inhibit vectorisation. The functions are all available as source code
allowing inlining and as a result the chance of them inlining is much increased.</p>
<h2 id="example---sampling-from-the-normal-distribution">Example - sampling from the normal distribution.</h2>
<p>We tested some of our generated functions against one of the best stats distribution
libraries in the Rust world - <code class="language-plaintext highlighter-rouge">rand_distr</code>.</p>
<p>Combined with <code class="language-plaintext highlighter-rouge">rayon</code> the parallel execution library
this would have been the best choice for monte carlo experiments.</p>
<p>We started with a uniform random number generator, based on a
<code class="language-plaintext highlighter-rouge">xorshift</code> hash and tested this against rust’s <code class="language-plaintext highlighter-rouge">ThreadRng</code>.</p>
<p>By using a hash of an integer index instead of a sequence, we are able to
parallelise random number generation.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pub fn runif(index: usize) -> f64 {
let mut z = (index + 1) as u64 * 0x9e3779b97f4a7c15;
z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
z = z ^ (z >> 31);
from_bits((z >> 2) | 0x3ff0000000000000_u64) - 1.0
}
</code></pre></div></div>
<p>We tested both single and multi-threaded versions of these functions - easy
in Rust as it is a naturally multi-threaded language - on a four core X86 laptop.</p>
<table>
<thead>
<tr>
<th>Library</th>
<th>Function</th>
<th>ns per iteration (smaller is better)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Doctor Syn</td>
<td><code class="language-plaintext highlighter-rouge">runif</code></td>
<td>0.8</td>
</tr>
<tr>
<td>Doctor Syn</td>
<td>parallel <code class="language-plaintext highlighter-rouge">runif</code></td>
<td>0.6</td>
</tr>
<tr>
<td>rand</td>
<td><code class="language-plaintext highlighter-rouge">ThreadRnd::gen()</code></td>
<td>5.1</td>
</tr>
<tr>
<td>rand</td>
<td>parallel <code class="language-plaintext highlighter-rouge">ThreadRnd::gen()</code></td>
<td>2.1</td>
</tr>
<tr>
<td>R</td>
<td><code class="language-plaintext highlighter-rouge">runif</code></td>
<td>35.0</td>
</tr>
<tr>
<td>Numpy</td>
<td><code class="language-plaintext highlighter-rouge">numpy.random.uniform</code></td>
<td>35.0</td>
</tr>
<tr>
<td>C</td>
<td><code class="language-plaintext highlighter-rouge">rand() * (1.0/RAND_MAX)</code> -O3</td>
<td>6.0</td>
</tr>
<tr>
<td>C++</td>
<td><code class="language-plaintext highlighter-rouge">uniform_real_distribution</code> -O3</td>
<td>13.6</td>
</tr>
</tbody>
</table>
<p>So clearly, we do well against even the best Rust version and
much better (over 30 times better) than R and Numpy.</p>
<p>Moving to normal random number generation, we use the quantile (or probit) function
to shape the random variable - this is a very simple version good to about six
decimal digits, more accurate versions using <code class="language-plaintext highlighter-rouge">log</code> and <code class="language-plaintext highlighter-rouge">sqrt</code> are also available.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>fn qnorm(arg: fty) -> fty {
let scaled: fty = arg - 0.5;
let x = scaled;
let recip: fty = 1.0 / (x * x - 0.5 * 0.5);
let y: fty = (177186111.131545818686411653000483 as fty)
.mul_add(x * x, -219058235.58919835 as fty)
.mul_add(x * x, 117054121.857504129646289572504640 as fty)
.mul_add(x * x, -35345955.68660036 as fty)
.mul_add(x * x, 6623473.609141078534685775398250 as fty)
.mul_add(x * x, -796318.1973069897 as fty)
.mul_add(x * x, 61391.409088151006196662227193 as fty)
.mul_add(x * x, -2938.7971360761 as fty)
.mul_add(x * x, 83.911295471202339471921364 as fty)
.mul_add(x * x, 0.012702493639562371692090 as fty)
.mul_add(x * x, 1.856861340488065073103038 as fty)
.mul_add(x * x, -0.626662948075053 as fty)
* x;
y * recip
}
/// Use qnorm to shape the uniform random number.
pub fn rnorm(index: usize) -> f64 {
qnorm(runif(index) * 0.999 + 0.0005)
}
/// Parallel version in Rust.
#[target_feature(enable = "avx2,fma")]
unsafe fn test_par_rnorm(d: &mut [f64]) {
do_par(d, |d| *d = rnorm(ref_to_usize(d)));
}
</code></pre></div></div>
<table>
<thead>
<tr>
<th>Library</th>
<th>Function</th>
<th>ns per iteration (smaller is better)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Doctor Syn</td>
<td>rnorm</td>
<td>2.4</td>
</tr>
<tr>
<td>Doctor Syn</td>
<td>parallel rnorm</td>
<td>0.9</td>
</tr>
<tr>
<td>rand_distr</td>
<td>Normal::sample()</td>
<td>6.9</td>
</tr>
<tr>
<td>rand_distr</td>
<td>parallel Normal::sample()</td>
<td>1.7</td>
</tr>
<tr>
<td>R</td>
<td>rnorm</td>
<td>65.0</td>
</tr>
<tr>
<td>Numpy</td>
<td>numpy.random.uniform</td>
<td>60.4</td>
</tr>
<tr>
<td>C++</td>
<td><code class="language-plaintext highlighter-rouge">normal_distribution<double></code> -O3</td>
<td>31.0</td>
</tr>
</tbody>
</table>
<p>So more than 60x speedup over the R and python versions
on a four core laptop and about 30x for C++.</p>
<h2 id="future-work">Future work</h2>
<p>With an implementation of just two Doctor Syn functions, we have shown a significant performance boost
over even the best-in-class Rust distribution system. This excellent result on a toy example shows excellent promise for what the generalised <strong>Doctor Syn</strong> system is capable of.</p>
<p>Work has started on support for ARM SVE, The Doctor Syn method also provides flexibility
in the accuracy of the solution it provides, and we are exploring super-accurate
versions of our functions by using larger sizes, table lookups or fixed point integer
arithmetic. Finally, we are working on the substantial task of full verification as
well as generation of R, python and Octave libraries.</p>
<p>Atomic Increment is developing this technology in partnership with <a href="https://www.embecosm.com">Embecosm</a>. If you want to use this technology, get in touch. We are keen to develop industry
partnerships with companies who require extra performance in their machine learning
and computational processes.</p>
<p>andy@atomicincrement.com
jeremy.bennett@embecosm.com</p>Andy Thomasonandy@atomicincrement.comExecutive SummaryWhat is an atomic increment?2021-11-17T14:12:06+00:002021-11-17T14:12:06+00:00https://www.atomicincrement.com/welcome/2021/11/17/what-is-an-atomic-increment<p>Why do we call ourselves Atomic Increment?
One of the myriad ways of improving code performance
is to use multithreaded code to lower latency. Every modern computer has
many cores which can run one or more threads at the same time.</p>
<p>The “atomic increment” operation allows us to safely share a counter between
two threads. Why is this necessary? This is because two CPUs running the same
code increment a counter using these three operations:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> load from memory
increment
store back to memory
</code></pre></div></div>
<p>If two threads are running, then we may get these following sequences
The good one:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Processor 1 Processor 2
|------------------------|------------------------|
| load from memory | |
| increment | |
| store back to memory | |
| | load from memory |
| | increment |
| | store back to memory |
|------------------------|------------------------|
</code></pre></div></div>
<p>This adds 2 to the memory.</p>
<p>And the bad one:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> Processor 1 Processor 2
|------------------------|------------------------|
| load from memory | |
| | load from memory |
| increment | |
| | increment |
| | store back to memory |
| store back to memory | |
|------------------------|------------------------|
</code></pre></div></div>
<p>The second sequence is bad because we only add 1 to the memory
not 2! This is because Processor 1 overwrites the result of
Processor 2 - a situation known as a “Race Condition”.</p>
<p>To solve this we use special instructions that allow the CPUs
to “lock” the memory while the increment occurs. How this
is implemented depends on the CPU.</p>
<p>For more information about concurrent programming, get in touch
with us through andy@atomicincrement.com</p>Andy Thomasonandy@atomicincrement.comWhy do we call ourselves Atomic Increment? One of the myriad ways of improving code performance is to use multithreaded code to lower latency. Every modern computer has many cores which can run one or more threads at the same time. The “atomic increment” operation allows us to safely share a counter between two threads. Why is this necessary? This is because two CPUs running the same code increment a counter using these three operations: load from memory increment store back to memory If two threads are running, then we may get these following sequences The good one: Processor 1 Processor 2 |------------------------|------------------------| | load from memory | | | increment | | | store back to memory | | | | load from memory | | | increment | | | store back to memory | |------------------------|------------------------| This adds 2 to the memory. And the bad one: Processor 1 Processor 2 |------------------------|------------------------| | load from memory | | | | load from memory | | increment | | | | increment | | | store back to memory | | store back to memory | | |------------------------|------------------------| The second sequence is bad because we only add 1 to the memory not 2! This is because Processor 1 overwrites the result of Processor 2 - a situation known as a “Race Condition”. To solve this we use special instructions that allow the CPUs to “lock” the memory while the increment occurs. How this is implemented depends on the CPU. For more information about concurrent programming, get in touch with us through andy@atomicincrement.com