Atomic Increment - delivering High Efficiency Computing

Game development in Rust

2024-01-14T14:12:06+00:00

C++ has long been the language of choice in professional game engine development.

This was not always the case as C and assembler were dominant for some time before C++ with games like Quake being the last vestiges of this.

Of course, some games are written in interpreted languages, originally BASIC and more recently JavaScript and some game engines such as Unity use languages like C# for scripting. C# is middle ground between script languages and C-like languages using the “everything’s a pointer” model and garbage collection. It is easier to learn than C++ and less likely to experience undefined behaviour leading to crashes.

It should be noted that the runtimes of Unity and other game engines that use C# are actually written in C++.

So why C++ and more recently, why Rust?

Part 1 - The case for Rust in games.
Part 2 - An example: breakout in Bevy
Links

Andy Thomason has worked in the game industry since the 1970’s developing Namco console games and AI Chess players in Z80 assembler as a teenager.

He has worked for Sony twice (Psygnosis and SN Systems) doing research in game technology such as the PS3 and Vita compilers.

Part 1 - The case for Rust in games.

The C/C++ programming model - Stack and Heap

C++ is based on C and in fact the original C++ compiler, CFront transcoded C++ into C. C uses as “Stack and Heap” model to handle dynamically created objects.

If we write a C function with a variable

void my_function() {
    int x = 1;
    printf("%d", x);
}

then the variable x is stored in a stack frame which is reserved when we call my_function and removed when we return from the function.

We can also allocate objects that live longer using malloc to get a pointer to the heap so that when we return from a function, we still have our object.

This all looks like this:

+---------------+ top
+ Stack         +
+---------------+ SP
+               +
+ unallocated   +
+               +
+---------------+ BRK
+ Heap          +
+---------------+ bottom

When you call a function, the stack pointer, SP goes down creating more space, when you return, SP goes up freeing the space.

The heap, by comparison, grows up from the bottom, but never shrinks. Instead we divide the heap into chunks which are allocated by malloc and freed up by free.

Thus the lifetime of objects on the stack is limited to the call being made, but the lifetime of an object allocated on the heap can be much longer.

Advantages or C and C++

Engines written in C++ are much faster than those written in Java, GO and C# because you get much closer to the machine. You can go a lot faster still if you write in assembler, but those skills are in decline.

Writing in C++ does not make the code go faster on its own, but it gives you a larger toolbox to work with and lets you talk to the hardware more directly. On game consoles, for example, the C++ code is used to write to hardware registers directly, bypassing bulky APIs.

C++ also lets you use multithreaded code and most modern C++ game engines let you create huge number of tasks and events which will be handled during the frame or over the course of many frames.

In garbage collected languages like C#, objects are only allocated on the heap which is much costlier than allocating on the stack.

Problems with C and C++

If an untrained driver sits in a formula one car and tries to drive it, they will likely crash immediately and it is the same for C and C++ - the program will crash.

Consider the following code:

int *fred() {
    int x = 0;
    return &x;
}

This function returns the address of the variable x, but after returning from this function, x is no longer there and writing to it will likely cause a crash. This is a dangling pointer.

Finding this kind of fault is very hard in C++ and this puts a lot of people off writing games in C++.

Another problem with multi-threaded code is the race condition.

Consider this code:

   // Thread 1
   x = 1;
   y = 2;

   // Thread 2
   x = 3;
   y = 4;

   // Thread 3
   X = x;
   Y = y;

What is the value of (X, Y)? This can be (1, 2) (3, 4) (1, 4) (undefined, 4) and so on. Many many faults in game engines exist because of this.

Rust is the successor to C++

Rust was designed to get the benefits of C++ without the pain of having to worry about race conditions and dangling pointers.

It is a complete redesign of C++ with only the modern bits and a safety-orientated checking system. It encourages a common style of code generation trhouh warnings about variable names and has extensive security checks to avoid some of the nasty network attacks that can disable games and steal user information.

Rust has two modes safe and unsafe. Most of the code is written in safe mode which gives you guantees that avoid the problems of C++ but some code is unsafe such as interactions with hardware.

For example, the dangling pointer example we give is not possible in Rust:

fn my_function() -> &i32 {
    let x = 1;
    &x
}

This will generate an error.

Likewise with the race condition example, passing writable references to variables to other threads is not allowed in safe Rust, so you don’t have to worry about it.

So why not just stick with C#?

C# also gives these guarantees and indeed if you are writing small games with low performance requirements, then C# may be exactly what you are looking for.

But for large games, such as the Disney Engine which is over 5M lines of code, using C# is just not going to be possible and if you want to create effects that Unity is not pre-wired to support, then good luck.

Rust makes it much easier to write large, multi-threaded games, do networking, build servers to host thousands of players and many more things.

Rust has a huge collection of libraries which you can use by adding a single line to the manifest file like the NPM JavaScript manager. The Cargo build tool will download source code of any of the hundreds of thousands of libraries available and compile it on the spot.

In many ways, it is the ease of using libraries that makes Rust number one choice in new technologies such as Block Chain and Fintech.

Who uses Rust in the game industry?

Some studios, like Embark in Sweden, have adopted Rust and are pushing the ecosystem. We are on the verge of seing a new generation of Rust game engines become stable enough for large scale development.

For example libraries for:

Windowing
Audio
Shaders
3D rendering
Text rendering
AI
ECS (Enitity-component-system model)
VR
3D format loaders
Maths
Mesh tools.

etc.

As well as a stack of fully formed game engines:

Bevy
Fyrox
Amethyst
ggez
macroquad
Piston

I’ve been using Bevy engine to do shader experiments with molecular modelling, for example. Bevy uses Webgpu to make games that can run on desktops, phones, browsers and many more platforms.

Bevy has VR support, networking and many more things but it is still and “expert” level tool, it doesn’t have the easy gui that Unity has but suits my way of working.

Fyrox has a GUI-driven scene generator and is orientated at scripting like Unity:

Amethyst is also quite programmer orientated

As is Piston.

What needs to happen

Most Rust game engines are very much orientated towards programmers. For exmple, bulding a large open world survival strategy game like Factorio in Bevy would be quite easy, but would require some programming skill.

To become more mainstream, these game engines need to develop GUI interfaces to allow non-programmers to build games. Some have started in that direction, but we will see technically orientated games long before we see artist-lead FPS games, for example.

Still, if it were a choice of starting a new game engine in C++ or in Rust, the smart money would go for Rust as it is hugely popular and much easier to build large projects without breaking the bank.

If your were to start learning a low level language now, Rust would be the choice, especially as most Rust jobs are work-from-home with Europe developing as a centre for Rust digital nomads.

As a lifestyle, the open source world of Rust is much preferable to being stuck in a room of hundreds of C++ programmers on an industrial estate in the middle of nowhere, not to mention any game studios in particular!

Part 2 - An example: breakout in Bevy

To illustrate what it is like to write a game in Rust, lets start with one of the examples from the Bevy game engine:

Like in C and C++, the entry point to a Rust program is main()

fn main() {
    App::new()
        .add_plugins(DefaultPlugins)
        .run();
}

If this is all we did, then we would get a blank window.

So what we need to do is add data and code to make breakout run.

This adds two data resources, a scoreboard which we will define and a clear colour which is a system defined resource. Resources are not assets and do not draw themselves. We need entities for the rendering plugin to draw anything, for example. Resources are just bits of data which we will use.

        .insert_resource(Scoreboard { score: 0 })
        .insert_resource(ClearColor(BACKGROUND_COLOR))

Next we add an event, which we will use to signal collisions between the ball and other entities.

        .add_event::()

And to make the game work, we have some systems which are functions that get called to update things.

        .add_systems(Startup, setup)
        .add_systems(
            FixedUpdate,
            (
                apply_velocity,
                move_paddle,
                check_for_collisions,
                play_collision_sound,
            ).chain(),
        )
        .add_systems(Update, (update_scoreboard, bevy::window::close_on_esc))

The .chain() makes these functions get run in sequence. Bevy is a multi-threaded game engine and may run systems in any order on different threads if needs be.

The first of these systems is setup which is called at the Start of the game.

// Add the game's entities to our world
fn setup(
    mut commands: Commands,
    mut meshes: ResMut>,
    mut materials: ResMut>,
    asset_server: Res,
) {
    // ...
}

The parameters to setup can come in any order and use Rust’s flexible type system to make Assets and other components acessible to the function.

The commands parameter is an interface that lets you change the state of the game. For example:

    commands.spawn(Camera2dBundle::default());

sets up a 2D camera for the game world.

    // Sound
    let ball_collision_sound = asset_server.load("sounds/breakout_collision.ogg");
    commands.insert_resource(CollisionSound(ball_collision_sound));

adds a sound resource to the game.

    commands.spawn((
        SpriteBundle {
            // ...
        },
        Paddle,
        Collider,
    ));

Adds a bundle of components to an entity (the paddle). A bundle is an easy way of deploying a number of components at a time.

The component system is similar to Unity. Each object in the game world has a number of components such as Transform and Sprite as well as some user-defined components.

The Transform component, for example, specifies the location of a sprite and the Sprite component describes the colour, image and other properties.

Here Paddle and Collider are user defined components.

Likewise, we spawn entities such as the ball, the bricks, the walls and so on.

Making custom components

Making custom components is easy in Bevy. We use a derive macro to generate extra code needed for the component. In these two cases there is no extra data needed, so the structs don’t need curly braces:

#[derive(Component)]
struct Paddle;

#[derive(Component)]
struct Ball;

The types, however, are used to make a distinction between Paddle and Ball and will be used to select the components when we run the systems.

Moving the paddle

To move the paddle, we need a system which takes user input and all the entities which have Transform and Paddle components like this:

fn move_paddle(
    keyboard_input: Res>,
    mut query: Query<&mut Transform, With>,
    time: Res,
) {
    let mut paddle_transform = query.single_mut();
}

there is only one paddle, so query.single_mut(); will do and also enable us to write to the transform (move the paddle).

By default in Rust, references like &Transform are read-only and we need to use &mut Transform and single_mut to allow us to change the transform.

The rest of the function reads the keyboard and moves the paddle.

Checking for collisions

fn check_for_collisions(
    mut commands: Commands,
    mut scoreboard: ResMut,
    mut ball_query: Query<(&mut Velocity, &Transform), With>,
    collider_query: Query<(Entity, &Transform, Option<&Brick>), With>,
    mut collision_events: EventWriter,
) {
    // ...
}

This system has:

An interface to change the engine state.
A writeable Scoreboard resource.
A query to find the Velocity and Transform of the ball.
A query to find anything with a Transform and a collider, which may be a brick.
An EventWriter to signal collisions to other systems.

We spin round checking the ball position against the bricks and walls, updating the scoreboard and sending events if anything collides.

Sounds

fn play_collision_sound(
    mut commands: Commands,
    mut collision_events: EventReader,
    sound: Res,
) {
    // ...
}

Here we receive collision events and convert them into sounds.

We create the sound by spawning the collision sound in a bundle. Yes! sounds are entities too.

Multithreading

Because of the danger of race conditions, Bevy is careful not to call two systems at the same time with a mutable reference to the same component.

Bevy’s use of #[derive] and the Rust type system makes for a more C#-like development environment.

With very large games, with hundreds of thousands of entitites, this will make a big difference.

That’s all folks

We talked a little about how Rust, a low level language, makes it easier to write safe multi-threaded code, stealing some thunder from C# and giving a significant performance boost.

We showed you how easy it is to build games using the Bevy ECS (Entity-component-system) model.

So happy Rusting, and if you get the opportunity, try writing a game in Bevy. It may take a bit of getting used to, but you are a champion!

Links

Breakout

Are we game yet Ecosystem

Bevy Game Engine

Fyrox Game Engine

Amethyst Game Engine

Breaking the AI sound barrier with Doctor Syn.

2021-11-18T00:00:00+00:00

Executive Summary

The problem facing the AI industry today is that many of the functions at the heart of machine learning processes were written over forty years ago when computer hardware and compiler technology were very different.

Our library, Doctor Syn, addresses this problem by using the three technologies of SIMD, multithreading and autovectorisation. We achieve 30x or more speedups over traditional libraries in C, C++, Rust and Fortran, without making the code platform or language specific.

Doctor Syn’s primary focus at present is to generate accurate polynomial approximations to key functions important to the execution of many programs. You are probably familiar with many of the functions we are targeting:

Rust Function	calculates
f32/f64::sin	\(\sin{x}\)
f32/f64::cos	\(\cos{x}\)
f32/f64::atan2	\(\arctan{y/x}\)
f32/f64::exp	\(e^x\)
f32/f64::ln	\(\log{x}\)

While improving these functions has a lot of value, we are currently focusing most of our effort on statistical functions such as:

R Function	distribution	role	calculates
dnorm	normal	pdf	\(\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2}\)
pnorm	normal	cdf	\(\frac{1}{2}\left[1 + \operatorname{erf}\left( \frac{x-\mu}{\sigma\sqrt{2}}\right)\right]\)
qnorm	normal	quantile	\(\mu+\sigma\sqrt{2} \operatorname{erf}^{-1}(2p-1)\)
rnorm	normal	random	\(\operatorname{qnorm}(\operatorname{runif}(i))\)

These functions are used extensively in finance and bioinformatics to perform statistical inference, stochastic modelling, AI and Machine learning. For example, rnorm is a key part of many MCMC algorithms and variational techniques, as well as a key part of monte carlo simulations, such as those that may be used to solve stochastic differential equations.

Using this library, combined with parallel iterators, we generate more efficient versions of

Numpy
R
GNU Octave

and many others.

We have also targeted new architectures like Arm SVE which do not fit the X86 model. We are working with the Isambard A64FX cluster to attempt to improve existing algorithms.

This approach to function generation should fit perfectly with the A64FX’s SVE architecture as SVE has a variable length SIMD architecture which will run the same binary on many different word-length machines. SVE requires Autovectorisation to work effectively.

The AI sound barrier

While great effort has been expended in key function optimization, current techniques are unable to efficiently utilize modern compiler technology such as auto-vectorisation and thread based parallelism. Research in function approximation has focused on squeezing out the last half bit of precision of functions at the expense of making functions every more complex and very much slower.

In practice, machine learning algorithms can tolerate a large amount of error and giving users the ability to choose the level of accuracy that a function delivers as well as the domain of inputs can make those algorithms orders of magnitude faster.

For example, using 32 bit floating point instead of 64 bit often has a 4:1 performance advantage in compute and a 2:1 advantage in memory performance. With modern computers, the memory bandwidth is very often the limiting factor and finding smarter ways to represent data becomes the key to fast algorithms. If we know that a vector of numbers does not contain NaN values, then we can skip NaN checks on every calculation.

But primarily, we need to make our functions simple enough to be vectorisable - once we have achieved this, we get remarkable performance improvements.

The challenge of vectorisation.

Existing functions will not vectorise primarily because:

They are in shared or static libraries.
They contain branches and look-up tables.

In the 1970’s when many of these functions were written, this was the state of the art. Today however, this is a problem: with the advances of modern vectorising processors, these implementations are substantially less efficient than they could be.

There is no short and simple fix to this problem either, these problems are fundamental ones that preclude any efficient vectorisation. The only viable approach to this problem is with substantial and novel changes such as the one that Doctor Syn proposes.

Autovectorisation

In the past we have implemented fast functions in assembler or even written assembler to write functions in machine code. These days however, we often try to take a more “civilized” approach where possible. Assembler functions are fast, but difficult to read, difficult to improve, and difficult to generalize. Despite these problems, many such functions end up hanging around like a bad smell, and more are being produced by chip vendors for special architectures. Even excellent libraries like Sleef are done this way with machine specific intrinsics.

To solve the problems with these architecture specific assembly implementations, we try and write portable code; we wish our code to be able to run on both x86 architectures with SIMD as well as the new ARM SVE with variable sized registers. The way we achieve this is by trying to write code in such a way as that it will be automatically vectorised by modern compilers.

So instead of something like:

use std::arch::x86_64::*;

pub fn inc_doubles_simd(x: &mut [f64]) {
    unsafe {
        let one = _mm256_broadcast_sd(&1.0);
        for x in x.chunks_exact_mut(4) {
            let a = _mm256_loadu_pd(&x[0] as *const f64);
            let b = _mm256_add_pd(a, one);
            _mm256_storeu_pd(&mut x[0] as *mut f64, b);
        }
        for x in x.chunks_exact_mut(4).into_remainder() {
            *x += 1.0;
        }
    }
}

we simply write:

fn inc_doubles_scalar(x: &mut [f64]) {
    for x in x {
        *x += 1;
    }
}

This is much easier to read, works on all known hardware without modifications and does not specify a vector size, which might be variable.

Vectorisers are fickle beasts however. If the wind blows in the wrong direction, the compiler will often fail to vectorise or worse - vectorise in the IR and then convert the vector operations to a long series of library calls.

For example, the following rather innocent function, which absolutely should be vectorisable, converts itself into a series of function calls:

#include 

void vector_sin(double *d, int len) {
    while (len--) {
        *d = sin(*d);
        ++d;
    }
}

Clang gives:

.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [rbx + 8*rbp]   # xmm0 = mem[0],zero
        call    sin
        vmovsd  qword ptr [rbx + 8*rbp], xmm0
        vmovsd  xmm0, qword ptr [rbx + 8*rbp + 8] # xmm0 = mem[0],zero
        call    sin
        vmovsd  qword ptr [rbx + 8*rbp + 8], xmm0
        vmovsd  xmm0, qword ptr [rbx + 8*rbp + 16] # xmm0 = mem[0],zero
        call    sin
        vmovsd  qword ptr [rbx + 8*rbp + 16], xmm0
        vmovsd  xmm0, qword ptr [rbx + 8*rbp + 24] # xmm0 = mem[0],zero
        call    sin
        vmovsd  qword ptr [rbx + 8*rbp + 24], xmm0
        add     rbp, 4
        cmp     r14d, ebp
        jne     .LBB0_7

Each call will take several hundred cycles.

Making library functions that vectorise

Library functions make things bad for themselves by introducing branching. So even if we can inline the function, the functions will not vectorise. To get better accuracy, they divide the domain of a function - for example \([-\pi, \pi]\) for \(\sin(x)\) into many small parts. This is often done using a switch statement which will not vectorise. Alternatives include using a lookup table of coefficients but many CPUs have not yet implemented an efficient gather operation which can do table lookups in reasonable time. The exception to this is GPUs, which do commonly have efficient gather but these are likely to hurt the cache performance unless you use non-temporal loads and stores.

Doctor Syn generates functions that are free of complex control flow which would inhibit vectorisation. The functions are all available as source code allowing inlining and as a result the chance of them inlining is much increased.

Example - sampling from the normal distribution.

We tested some of our generated functions against one of the best stats distribution libraries in the Rust world - rand_distr.

Combined with rayon the parallel execution library this would have been the best choice for monte carlo experiments.

We started with a uniform random number generator, based on a xorshift hash and tested this against rust’s ThreadRng.

By using a hash of an integer index instead of a sequence, we are able to parallelise random number generation.

pub fn runif(index: usize) -> f64 {
    let mut z = (index + 1) as u64 * 0x9e3779b97f4a7c15;
    z = (z ^ (z >> 30)) * 0xbf58476d1ce4e5b9;
    z = (z ^ (z >> 27)) * 0x94d049bb133111eb;
    z = z ^ (z >> 31);
    from_bits((z >> 2) | 0x3ff0000000000000_u64) - 1.0
}

We tested both single and multi-threaded versions of these functions - easy in Rust as it is a naturally multi-threaded language - on a four core X86 laptop.

Library	Function	ns per iteration (smaller is better)
Doctor Syn	`runif`	0.8
Doctor Syn	parallel `runif`	0.6
rand	`ThreadRnd::gen()`	5.1
rand	parallel `ThreadRnd::gen()`	2.1
R	`runif`	35.0
Numpy	`numpy.random.uniform`	35.0
C	`rand() * (1.0/RAND_MAX)` -O3	6.0
C++	`uniform_real_distribution` -O3	13.6

So clearly, we do well against even the best Rust version and much better (over 30 times better) than R and Numpy.

Moving to normal random number generation, we use the quantile (or probit) function to shape the random variable - this is a very simple version good to about six decimal digits, more accurate versions using log and sqrt are also available.

fn qnorm(arg: fty) -> fty {
    let scaled: fty = arg - 0.5;
    let x = scaled;
    let recip: fty = 1.0 / (x * x - 0.5 * 0.5);
    let y: fty = (177186111.131545818686411653000483 as fty)
        .mul_add(x * x, -219058235.58919835 as fty)
        .mul_add(x * x, 117054121.857504129646289572504640 as fty)
        .mul_add(x * x, -35345955.68660036 as fty)
        .mul_add(x * x, 6623473.609141078534685775398250 as fty)
        .mul_add(x * x, -796318.1973069897 as fty)
        .mul_add(x * x, 61391.409088151006196662227193 as fty)
        .mul_add(x * x, -2938.7971360761 as fty)
        .mul_add(x * x, 83.911295471202339471921364 as fty)
        .mul_add(x * x, 0.012702493639562371692090 as fty)
        .mul_add(x * x, 1.856861340488065073103038 as fty)
        .mul_add(x * x, -0.626662948075053 as fty)
        * x;
    y * recip
}

/// Use qnorm to shape the uniform random number.
pub fn rnorm(index: usize) -> f64 {
    qnorm(runif(index) * 0.999 + 0.0005)
}

/// Parallel version in Rust.
#[target_feature(enable = "avx2,fma")]
unsafe fn test_par_rnorm(d: &mut [f64]) {
    do_par(d, |d| *d = rnorm(ref_to_usize(d)));
}

Library	Function	ns per iteration (smaller is better)
Doctor Syn	rnorm	2.4
Doctor Syn	parallel rnorm	0.9
rand_distr	Normal::sample()	6.9
rand_distr	parallel Normal::sample()	1.7
R	rnorm	65.0
Numpy	numpy.random.uniform	60.4
C++	`normal_distribution` -O3	31.0

So more than 60x speedup over the R and python versions on a four core laptop and about 30x for C++.

Future work

With an implementation of just two Doctor Syn functions, we have shown a significant performance boost over even the best-in-class Rust distribution system. This excellent result on a toy example shows excellent promise for what the generalised Doctor Syn system is capable of.

Work has started on support for ARM SVE, The Doctor Syn method also provides flexibility in the accuracy of the solution it provides, and we are exploring super-accurate versions of our functions by using larger sizes, table lookups or fixed point integer arithmetic. Finally, we are working on the substantial task of full verification as well as generation of R, python and Octave libraries.

Atomic Increment is developing this technology in partnership with Embecosm. If you want to use this technology, get in touch. We are keen to develop industry partnerships with companies who require extra performance in their machine learning and computational processes.

andy@atomicincrement.com jeremy.bennett@embecosm.com

What is an atomic increment?

2021-11-17T14:12:06+00:00

Why do we call ourselves Atomic Increment? One of the myriad ways of improving code performance is to use multithreaded code to lower latency. Every modern computer has many cores which can run one or more threads at the same time.

The “atomic increment” operation allows us to safely share a counter between two threads. Why is this necessary? This is because two CPUs running the same code increment a counter using these three operations:

   load from memory
   increment
   store back to memory

If two threads are running, then we may get these following sequences The good one:

    Processor 1               Processor 2
   |------------------------|------------------------|
   | load from memory       |                        |
   | increment              |                        |
   | store back to memory   |                        |
   |                        | load from memory       |
   |                        | increment              |
   |                        | store back to memory   |
   |------------------------|------------------------|

This adds 2 to the memory.

And the bad one:

    Processor 1               Processor 2
   |------------------------|------------------------|
   | load from memory       |                        |
   |                        | load from memory       |
   | increment              |                        |
   |                        | increment              |
   |                        | store back to memory   |
   | store back to memory   |                        |
   |------------------------|------------------------|

The second sequence is bad because we only add 1 to the memory not 2! This is because Processor 1 overwrites the result of Processor 2 - a situation known as a “Race Condition”.

To solve this we use special instructions that allow the CPUs to “lock” the memory while the increment occurs. How this is implemented depends on the CPU.

For more information about concurrent programming, get in touch with us through andy@atomicincrement.com