Mathieu Ropert

What makes a game tick? Special Issue - Buffy the Performance Slayer

2026-05-05T00:00:00+00:00

Whether one is making a RPG or a strategy game, there usually comes a time where designers want to attach a bunch of stats buffs and debuffs to each and every object in the game. The game actors start small but eventually we want our characters, units, countries and monsters stats to be able to be affected by a mix of equipment, perks, area effects, difficulty settings and whatnot. And if we don’t take care, it might turn our game into buff recalculation simulator 2000.

For those who never had design or implement a stats system, this might not sound like a very hard problem at first glance. After all, shouldn’t it simply be a simple struct with a bunch of values in it? It could, but it depends a lot on what your system can and cannot handle.

Say that we want our actors to have a bunch of stats that can be affected by various sources as we mentioned in our intro paragraph. Let’s ask a few questions about the extents of our system.

To cache or not to cache?

First let’s get the most basic question out of the way: “do we need store the sum total of a given stat with all buffs applied?”. After all, we could simply recalculate it on the fly each time. This helps us framing the problem domain as fundamentally a caching issue. It’s often said that cache invalidation is one of the two hard problems in software engineering, so can we dispense with it?

This might sound like a no-brainer but it actually depends a lot on the game, and then possibly on the stat itself. It forces us to ask ourselves (and the designers): how often does this value change compared to how often it is used. This is made trickier by the fact that the answer might change as the game design evolves. Maybe the designer has great plans for a given stat, but it turns out “bonus torpedo speed for submarines in deep waters” isn’t a value that is often used in the game¹, or maybe a stat started as a niche thing, but became ubiquitous over the years.

On a similar note, how expensive is the value itself to compute? Is it just about querying a couple values and summing them? Does it need to iterate over collection of objects from possibly dozens or thousands of potential contributors? And finally, does it need to recursively query values from another game object that will likely be using the same system?

Assuming we see a need for caching at least some class of stats, let’s continue with our next question.

Source tracking

Another very important question is “do we need to keep track of the contributors to a given stat?”. In the most basic case it’s not necessary. We could simply have a value, add to it when a buff is activated and subtract from it when they’re disabled (for example when a character equips or unequips a piece of equipment).

I say very important because it can easily be overlooked in the early stages of development. At first glance it looks like it won’t be necessary. The architecture gives us clear OnEnable() and OnDisable() callbacks that we can use to make sure the value is up to date. The designers don’t foresee the need to track where a buff comes from. They even thought about stacking rules and concluded that it’s fine to stack multiple sources so we won’t need to make sure only the highest spell effect should apply if multiple are present.

But then the first beta test feedback comes back, and the most common comment is that the UI doesn’t show the breakdowns and it’s impossible to understand why a given stat sums up to a given value. Or maybe QA complains that it’s hard to test whether or not the buff system actually works and requests at least a debug tooltip explaining how the value was calculated.

So… does it mean we should always keep track of breakdowns in our stats caching system because we’ll always need to at least display it in some form? From my experience, I’d argue mostly yes, although I’ve considered cheating once or twice. Given how expensive the whole system can end up being², one could consider having the UI breakdown code path bypass the caching system and redo the calculation on the fly, since we usually only display a handful breakdowns per frame. The displayed value can end up slightly out-of-sync with the cached value that is actually used by the tick updates, which can confuse players and testers alike. If you can afford it, not tracking buffs sources in the cache will save precious CPU cycles (and memory).

The one thing I definitely don’t recommend is forcing a cache recalculation on the fly (with breakdowns) when the UI needs it. I have fixed several out-of-sync MP bugs in my career because of it. And even for single-player only games, this introduces the ability for the UI to write to the gamestate on the fly, which if you’ve followed my previous articles should know that I’m very much against.

Pilot

With that out of the way, let’s talk about how this article got started. I recently had the opportunity to consult for Galactic Starfish on their upcoming title Strategeist. One of things I noted is that, much like every Grand Stategy Game I know of, there was a performance issue with their modifier system.

They kindly allowed me to use their source code as a “real-world” benchmark to test out various implementations for the purpose of this article³. I wouldn’t take the current numbers as any indication of how the final release might eventually look like, they are only given to illustrate the impact of various implementation choices.

So, let’s start with our baseline implementation:

using StatID = uint32_t;
using BuffSource = uint64_t;

struct Stats
{
    struct Entry
    {
        double value;
        std::unordered_map<BuffSource, double> sources;
    };

    std::unordered_map<StatID, Entry> entries;
};

Pretty straightforward so far. In the case of Strategeist there is at the moment 480 unique stats in the game. Grand Strategy games usually get into the thousands. Those should probably be split up in categories (by which final game object classes they can apply to), but still that’s probably a good representative number.

Now as I mentioned before, this implementation was fairly expensive to run. On my i7-10700, I measured the cache recalculation of each in-game country (the actor the player controls) to around 163.7us per run. Of course this is done for each country in the game, and if you’re familiar with map games you already know that there are hundreds of them in the game.

While there are probably things that can be improved in the calculation themselves, for the rest of this article I want to focus on how much could be achieved by just changing the underlying storage.

So you’re like a good demon? Bringing allocations?

While I first used Unreal Insights as an instrumentation profiler to see the modifier update stand out in the tick profile, switching to a sampling profiling (even one as basic as the one embedded in Visual Studio) was enough to confirm my suspicions: using those containers sparked a lot of allocations to the point that most of the update was spent in new and delete. Unreal’s TMalloc is better than the default malloc provided by Windows’ CRT, but it was still a major issue.

As far as associative containers go, std::unordered_map isn’t great. It’s better on MSVC than on Clang/GCC due to the way it avoids modulo but it’s still not great. Especially because the issue here isn’t lookup time, it’s insertions/removals that trigger a bunch of new and delete. This is due to the fact that std::unordered_map is basically implemented as std::vector of std::list, with each entry being it’s own heap allocated std::pair.

Looking at usage patterns by going through the code I noticed that we barely make use of inner std::unordered_map in a way that matters. The need for O(1) lookup of a given BuffSource in a given entry is rare, mostly limited to UI. Most entries will only have a handful sources, with the maximum case being maybe a dozen or two.

As it turns out, modern CPUs are actually quite good at brute-force linear search through small arrays. We could turn the inner std::unordered_map into a std::vector and use std::find_if to find the right pair. If we really needed it, we could even make it two vectors (one for keys, one for values), which could turn our std::find for a given key into a handful of SIMD uint64_t compares.

Making our sources into a std::vector> already yields quite impressive results: 91.8us. It also lowers the memory footprint (vector is more memory efficient than unordered_map for the same size/capacity).

Trying out Unreal types

Since we’re using Unreal Engine 5, we might as well try out their own vector alternative. After all there’s always people in my comments telling me that the STL is bad and every gamedev™️ should rewrite it even on a solo project. I know I’m being cheeky but this is a recurring thing I’ve been dealing with for years. In my book the bare minimum Unreal should do to win me over would be to work as a drop-in replacement of std::vector (like EASTL does), but sadly it doesn’t. Iterators are not copyable for some reason (which makes it unusable with most of ), and none of the methods names match the STL besides begin() and end() (if they exist at all).

Either way, let’s bite the bullet and try replacing our origins once more with TArray>.

The result gives us 86.6us on average. That is 5us better. This is not bad, but I’m not sure it’s worth the hassle especially when the main performance improvement can be gained without it.

See, the main reason why this is faster is that Unreal’s TArray allocates at least 4 elements when it grows from 0 capacity (unless you set a flag asking it to be as conservative as the STL). This avoids the common pitfall with std::vector that it reallocates when going from capacity 0 to 1, then 1 to 2, then 2 to 3, and (if unlucky) from 3 to 4 due to the way the geometric growth factor math works. I agree with Unreal devs that for most cases this is probably a better strategy. In particular, I would love if the STL had a basic heuristic to avoid the 0 -> 1 -> 2 -> 3 -> 4 reallocs under a certain element size. Worst case scenario, it can be added on our side with a simple if (v.capacity() == 0) v.reserve(4); check.

Either way, there’s something even better than 1 allocation for small size. That’s no allocation. This is usually called a small vector (because it does Small Buffer Optimization, like std::string). They aren’t in the STL (sadly) but can be found in common libraries like Abseil and Boost. Or since we’re using Unreal, we can use their TInlineAllocator with our TArray:

using Sources = TArray<std::pair<BuffSource, double>, TInlineAllocator<4>> sources

This brings us down to 69us per update by avoiding allocations entirely for origins for most entries, unless they have a lot of buffs associated with them, in which case the geometric allocation formula kicks in, making sure we only (re)allocate 1 or 2 times as we grow the array. We’re already twice as fast as our baseline!

Arrays, arrays everywhere!

Now you may be thinking: “well if the inner container is faster as an array, shouldn’t we also do it for the outer container?”. Congrats, I thought the same. Because that what’s been done on games like Europa Universalis IV and Hearts of Iron IV. Except there’s a catch.

static constexpr std::size_t MaxStats = 480;
std::array<Entry, MaxStats> entries;

… and our average stats recalculation becomes 1142.8us. One entire millisecond per actor. With several hundred actors we’re looking at a tick that may take up to a whole second. It’s a disaster! What’s gone wrong?

The issue is that the calculations make use of temporary Stats variables before adding them to the main country stats. Each time they allocate about a hundred kilobytes. Even if they carry only a few values with one source each. While this probably indicates an over-reliance on temporary variables that could use a re-architecture, it does illustrate a point. Maybe not all stats need to use a large (but mostly empty) storage. Some systems and game objects with few actual entries would probably still benefit from the sparse storage offered by unordered_map.

So what if we left the storage strategy up to the user? After all, they probably know better which is best in a given corner of the codebase.

using ArrayEntries = std::vector<Entry> entries;
using MapEntries = std::unordered_map<StatID, Entry>;
std::variant<MapEntries, ArrayEntries> entries;

That way, the default behaviour is to use the sparse map storage that is optimized for objects that only carry a few active stats, while bigger objects (like countries) can switch to the array version. Note that we switched from array to vector to ensure the size of an empty Stats object remains minimal. We will allocate once for objects we switch to the vector variant, a perfectly acceptable cost that is paid only once upon init/construction.

The results actually show a huge jump in performance: 45.1us. We’re now almost 4 times faster. Not only are lookups/inserts much faster, but also since we now have a vector as base we can make sure we don’t free any memory upon clear. The origins array under each Entry will never need to allocate for most countries after one tick, because we will keep the capacity untouched. This is one of the big advantages of dense arrays as they can easily preserve inner container allocations (it’s not impossible to do with maps but you would need a custom allocator that reuses freed entries and keeps their origins array allocated).

Better maps?

We’ve been saying unordered_map isn’t very good, so what if we tried something else? Continuing the Unreal Engine theme, I considered using TMap but sadly I found the API really terrible, especially when trying to replace unordered_map for a quick test. Instead, I decided to use ska::bytell_hash_map, by the author of the previously linked C++Now talk on hash tables. If you’re curious about more options I found this article offers a good overview.

Since those are a drop-in replacement for std::unordered_map (mostly, remember inserts invalidate iterators 😉), it’s easy to try-out:

using MapEntries = ska::bytell_hash_map<StatID, Entry>;

This bring us our best timing of the whole experiment: 42.4us. The small scale of the improvement is mostly due to the fact that our country Stats container is using the array variant, so it only improves the temporary variables and friends used in the recalculation code.

If we switch off the array storage for countries and use the bytell_hash_map in all cases, we still get an honorable 53.43us. It also gives us a hint about the potential improvements we could get by improving our calculations’ usage of Stats outside of the ones stored inside the country.

Again all those numbers are given to illustrate the difference container choices can make without changing anything else in our stats/buff system. The relative ratio is probably a bit off because the baseline calculations could be improved⁴.

Final numbers

During this little experiment I’ve tried many combinations. I’ll leave CPU and memory usage (size of Stats per country, including sub allocations) for reference:

V1 (baseline) (unordered_map + unordered_map): 163.7us / 17918 bytes
V2 (unordered_map + vector): 91.8us / 11629 bytes
V3 (unordered_map + TArray): 86.6us / 18645 bytes
V4 (unordered_map + TArray Inline): 69us / 18291B bytes
V5 (vector + TArray Inline): 1142.8us / 994224 bytes
V6 (variant + TArray Inline): 45.1us / 45096 bytes
V7 (variant + TArray Inline): 42.4us / 45096 bytes
V8 (bytell_hash_map + TArray Inline): 53.4us / 25086 bytes
V9 (bytell_hash_map + TArray): 71.4us / 19558 bytes
V10 (variant + TArray): 44.8us / 25314 bytes

By the way, I have not mentioned multithreading so far for a reason. While stats cache update is definitely something that can trivially be thrown at a parallel_for for each actor, I wanted to focus on single-core performance for the purpose of this article. Especially because, for those out there who are still using the default malloc() implementation from MSVC, you will feel the pain of trying to parallelize an operation that is mostly bound by allocations. As I mentioned in my talk last year, the default allocator uses mutexes which will make all your numbers explode. If your application doesn’t already have a custom general purpose memory allocator like Unreal does, consider switching to mimalloc⁵.

And with that being said, see you next time!

¹: This may or may not be inspired by a game I previously worked on

²: Both Stellaris and Victoria 3 were at some point or another bound by how fast they can recalculate stats/modifiers

³: In general it’s way too rare that game companies share any code outside of a few outliers. If you’re reading this and are in a position to change this, please encourage your company to open-source past titles for the purpose of education if anything.

⁴: I’ve improved performance on several live GSG titles over the years by removing temporary copies of Stats, believe me when I say it can make quite the difference.

⁵: Yes I know, mimalloc is made by Microsoft. Ironic, don’t you think?

Can we finally use C++ Modules in 2026?

2026-04-13T00:00:00+00:00

Every 6 to 12 months, I try to use C++ modules, run into a hurdle, maybe rant about it on social media, then move on to something else. Despite watching multiple talks on the topic, there’s always something that gets in the way. My biggest success so far has been managing to use the VulkanHpp module in my renderer library, after which things started breaking down. But after making some progress again last week (and running into new hurdles), I feel like I have enough to make a proper summary.

As a disclaimer, I’d like to mention that I have shared some of my conclusions on the state of modules with fellow C++ programmers and they didn’t all agree with my conclusions. However, I believe that modules suffer from a strong “expert bias” problem that makes a lot of counterpoints read like “on my machine it works” to people like me who haven’t had a lot of exposure with and didn’t follow the standardization closely. I do not presume to be a subject matter expert on the topic, but I know build systems and I believe I have spent much more time trying to fiddle with modules on my projects that the average C++ programmer, so I think this piece can speak for the average enthusiast user (or would-be user, more like).

Oh and I mostly focus on MSVC. I might throw a quick mention of Clang or GCC but my experience is mostly on Windows.

The easy parts

Contrary to what you may have heard, the simple use cases are fairly easy to make work, providing you stay within a strict set of limitations. For example, as I mentioned before I used the module provided by VulkanHpp in my rendering library and it works just fine. Or more precisely, it used to work, until they changed something upstream that ran into the set of limitations I alluded to. We’ll get back to the details later. In the meantime, here’s what it looks like in my CMake:

add_library( VulkanHppModule )
target_sources( VulkanHppModule PRIVATE
  FILE_SET CXX_MODULES
  BASE_DIRS ${Vulkan_INCLUDE_DIR}
  FILES ${Vulkan_INCLUDE_DIR}/vulkan/vulkan.cppm
)
target_compile_definitions( VulkanHppModule PUBLIC
  VULKAN_HPP_NO_SETTERS
  VULKAN_HPP_NO_CONSTRUCTORS
)
target_link_libraries( VulkanHppModule PUBLIC Vulkan::Vulkan )

I didn’t even have to come up with those lines myself, they were given by the project’s documentation. All I really needed to customize was the compile definitions if needed (in this case I disabled setters and constructors to instead rely on C++ 20 designated initializers).

And there it worked, I could just do import vulkan_hpp in my renderer library and use Vulkan’s C++ bindings. Hadn’t I managed to make it work, I would probably have gone back to Vulkan’s C API with my own custom RAII wrappers, because the compile times with standard #include were atrocious. This also worked recursively (again with limitations to be explained later), meaning my renderer library could have the import of VulkanHpp in its public headers and it would pass on just fine when included in my projects that do #include .

You may have read that CMake takes a bit of hacking to make modules work, that you have to use esoteric flags such as CMAKE_CXX_SCAN_FOR_MODULES, CMAKE_EXPERIMENTAL_CXX_MODULE_DYNDEP or CMAKE_EXPERIMENTAL_CXX_MODULE_CMAKE_API but none of those are needed at the moment, provided that you use a recent version of CMake (ideally 4.x but the defaults should be on starting 3.28).

So there it was, with little work I had replaced the agonizing 9 seconds it took to include VulkanHpp into a negligible amount of milliseconds. I consider this a solid win. Now comes the trouble…

IntelliNonSense

So here’s a fun fact for you: you can find meeting minutes from SG15 dating from 2019 where Microsoft claims that they have modules working just fine internally for the Edge team. And yet if you open a project that uses modules with Visual Studio 2026 you get greeted with this amazing message:

C++ IntelliSense support for C++20 Modules is currently experimental.

Yup. It’s been 7 years since and they still can’t get IntelliSense to properly parse import directives. I know that the language server is based on EDG and not VC++, but frankly I don’t care. This is a company worth almost 3 trillions dollars at the time of writing telling us that they can’t make a feature work a decade after they pushed for modules to be standardized based on their in-house success story. I don’t know if they exaggerated their claims at the time, or if they didn’t properly fund the Visual Studio team since or what, but you can’t tell me 8 years wasn’t enough to make syntax highlighting work with modules. And if it is, then maybe there was something deeply wrong in their proposal and the committee should have asked to see the receipts before voting yes.

Anyways, here’s how you solve it:

#if defined( __INTELLISENSE__ )
#include 
#include 
#else
import vulkan_hpp;
#endif

That keeps your compiler (and iteration time) on the module fast path, and then IntelliSense can chug along parsing header files in the background so you get highlighting and autocompletion. Is it a hack? Absolutely. But it’s a hack I’ve been using for 6 months that allows me to focus on something else.

And with that out of the way, we can talk about the real problem.

Modules are viral all-or-nothing

I have hinted in previous sections that modules work if you stick to some strict limitations. Trouble is, those aren’t small limitations. Mainly, modules are kind of an all or nothing situation. If you start using a libraries through import directives, you can’t have the same translation unit pull it through #includes. And that quickly becomes a problem.

Here’s the simplest example that explains it:

// Works, obviously
#include 

// Works even if  is included before and part of the std module
import std;

// Error, will yield a million "xxx already declared" failures
#include  

Simply put, a library can both be imported and included as long #include comes first and import comes second. I’m still not sure if this is mandated by the standard or an implementation limitation, but it’s something I’ve observed directly on MSVC and heard mentioned by others too.

In my previous use case this was fine, because VulkanHpp is only imported by my renderer library, doesn’t import anything itself, and isn’t used anywhere else in my build tree. Sadly, things took a turn for the worst when the recent release started pulling the standard library by doing import std. Because suddenly, there’s a transitive dependency that imports a very common library, so now I have to make sure my import vulkan_hpp directive comes after any other #include of the standard library. And since vulkan_hpp is used publicly in my renderer library, now my renderer library also need to always be imported last in every translation unit. Else I get a billion redeclaration/redefinition compile errors.

“Just move to modules”

The preferred solution, I’m told, is to move everything to modules. Or at least, if one library starts doing import std, patch every other library I use to only do import std. In the case of my toy project, that would mean at least TBB and fastgltf. Ironically, it doesn’t seem to impact C++ libraries that only rely the C standard library (I believe it would if I did import std.compat?). It’s a sad affair that this vindicates library authors who refuse to use the STL.

Note that I said patch, not just flip a switch. Because despite C++20 being 6 years old, barely any C++ libraries comes with a module definition. Boost only offer modules for a few select libraries. The claims I read of Catch2 providing a module seem to only have been AI hallucination. The only big one I could find is fmt, which is a nice library but honestly if you have C++20 support you already have available anyway.

And of course, each library that decides to support modules needs to provide some form of dual build because not all their clients use modules yet. And for each of their own dependency, they need to decide if they pull them through #include, import or let the user configure it (my current opinion is that the module version should always use import and not provide a switch to avoid combinatorial hell).

Supporting dual-build

Next I’ve tried supporting dual build for my renderer lib and it’s not entirely a trivial affair.

First, as suggested before you need to toggle includes to imports when building/parsing in module mode. That usually means adding a define and doing a little dance around each #include directive:

#ifndef RENDERER_MODULE
#include 
#include 
#include 
#else
import std;
#endif

For libraries that are one single header-only implementation this isn’t the worst, but for more complex libraries made of multiple .cpp and .h files it becomes a bit more of an easter-egg hunt. In my current POC branch I ended up ripping all #include directives out and putting them all in one file that I can toggle on/off between the module and the non-module path. This makes the build slower without modules, because now all my translation units are pulling a bunch of headers that they don’t personally need it (looking at you, 😠).

Then, we have to handle the fact that module directives cannot be #ifdef‘d out. By design. I’m not certain why that is, but it is a hard error as per the standard. Which means if you have a .cpp implementation file, you cannot use #ifdef and friends to conditionally declare it as a part of a module. That leaves three options: a hack, another hack or always building your library as a module.

Let’s start with the first hack. I don’t like it, but it kind of shows the futility of trying to restrict #ifdef in the spec. Because that restriction doesn’t apply to #include. So we can just bypass it by duplicating every implementation file:

// device_module.cpp
module renderer;
#define RENDERER_MODULE
#include 

This implies using a different set of .cpp file whether you build as a module or not, and having an extra glue file for every implementation file, but it works. Alternatively, a suggestion by Daniela Engert was to entirely discard the separate compilation of all the .cpp files and instead pull them all in the module :private; section of the module definition with #include directives:

export module renderer;
export {
    #include 
}
module :private;
#include 
#include 
#include 
#include 
// ...

Some of my readers may object “but that would put all implementation in the same translation unit, like unity builds”. That would be correct. Which is why I would rather not use that solution either. I have had to deal with unity builds in the past and still consider them a hack that breaks the traditional expectation around static and namespace {}.

Almost Always Modules?

Instead, I’ve opted to always build my library as a module. That way, I can put module declarations in my .cpp files without issues. The trick is to use C++20’s extern "C++". In the same way that names declared with extern "C" will use backward compatible C linkage and name mangling, wrapping export {} declarations with extern "C++" generates symbols using an ABI compatible with #include declarations (the default with modules is to decorate every symbol with its module name, which makes it impossible to find by the linker in non-module contexts).

export module renderer;
// Don't mangle as a module for backward compatibility with non modules includes
extern "C++"
{
	export {
        #include "renderer/renderer.h"
    }
}

That way, the library doesn’t need to build differently for consumers using import vs #include. This is obviously only an issue for libraries that produce exported symbols. Header-only libraries do not need to bother with it.

Having only one build means the library doesn’t exercise it’s own #include variant anymore. You are advised to keep a few test around that use the library both through the import and the #include path for as long as you support both (which I suspect is gonna be a while given module’s adoption rate).

So, should I use modules?

There’s a big upfront cost to switch to modules. Having to switch all your dependencies to modules is some amount of work and there’s sadly little support from library maintainers at the moment. Even the people who report using modules seem to be using forks of their third party libraries at the moment. I do not know if they didn’t feel like contributing/maintaining patches, or if they submitted patches that got rejected, but this isn’t very encouraging. Polls from Meeting C++ do not show a high adoption rate for a 6 year old feature. It might be a chicken and egg problem (no one switches to modules due to lack of library support, library maintainers don’t bother due to lack of modules users).

I am considering contributing patches for the libraries I use, but I admit even after writing this article I still feel a bit of an imposter syndrome and wonder if my contribution would be any good. There’s so little expertise, experience and literature around modules out there that it’s not obvious what is and isn’t a practice. I’ve figured the point of the new keywords mostly by trial and error, which makes me suspect most project won’t have a qualified reviewer to see if a proposed patch is good.

In the meantime, the easy way out is to do like I did initially with VulkanHpp and keep module usages to libraries that are heavy to parse but easy to keep last in the #include/import path for a quick win, but sadly it breaks down quickly at scale due to the viral factor.

Addendum: Jens Weller mentioned to me the existence of Are We Modules Yet?, a website that lists which projects provide modules. Funny enough fastgltf provides a module, it’s just not built or installed by vckpg which means I didn’t see it. I think libraries should always add module definitions to their install list rather that put it behind a build setting so it doesn’t become a package manager problem.

You’re absolutely right, no one can tell if C++ is AI generated

2026-03-30T00:00:00+00:00

A tweet has been making the rounds over the weekend after escaping the C++ community containment. It offers 2 different ways of handling a somewhat classic “insert or return existing” associative container problem. The author claims one was made with AI and the other hand written. They’re both bad, but they make for a good interview question. And also a deeper discussion about AI generated code. Let’s delve (wink, wink) into it!

The two options

Here’s the original post:

Same C++ function.

One is generated with AI.
The other one is written manually.

Guess which one is which. pic.twitter.com/LnyAfmsnJJ
— Dmitrii Kovanikov (@ChShersh) March 29, 2026

I’ll reproduce both options in text for better accessibility:

// Option 1 (left picture)
Node* get_or_create(Nodes& nodes, std::string_view name) {
    auto it = nodes.data.find(name);
    if (it != nodes.data.end()) {
        return it->second.get();
    }

    auto node = std::make_unique<Node>();
    node->name = name;

    Node* node_ptr = node.get();
    nodes.data.try_emplace(name, std::move(node));
    return node_ptr;
}

// Option 2 (right picture)
Node* get_or_create(const string& name) {
    if (!nodes.count(name)) {
        nodes[name] = make_unique<Node>();
        nodes[name]->name = name;
    }

    return nodes[name].get();
}

So which is AI? And which is better? I partially gave up the answer already saying they were both bad, but say you had to pick one, which would it be? And why?

The author didn’t explicit say at the time of writing which one is AI, but they gave hints pointing at #2. Assuming they are the author of one of them and not just trying to make a shitpost (dangerous assumption in those trying times, I know), that would seem like the reasonable answer.

The second one, after all, is non-idiomatic C++. That may surprise some readers depending on their experience, but use (or should I say, ‘abuse’) of operator[] on associative container types (think map and friends) is usually discouraged. And for good reason. After all, the second version will run about twice as slow as the first one.

Performance analysis

Each use of the square bracket operator on a map (even unorderered_map and flat_map) performs a lookup. That’s a logarithmic operation on map and flap_map, and constant “on average” on unordered_map (meaning it’s usually constant but linear on worst case). count is also a lookup in disguise, usually the equivalent of find and return it == end ? 0 : 1.

That brings us to a total of 2 lookups if a node already exists, and 4 if it needs to be created. That’s obviously very bad.

The first example only does 1 or two lookups through find() and try_emplace(). That’s twice as good. Also, it doesn’t seem to rely on nodes being a global variable. It also uses string_view over const string& which is better because APIs with string tend to generate a ton of temporary heap allocations to convert from const char* and string literals.

So which one is AI? Probably #2, because #1 shows signs of trying to avoid some common junior pitfalls, albeit with a clunky implementation. The second look more like someone who came from Python or another language and tried to write C++ instead.

So, case closed? The right one is naive AI code and the left one is senior C++ code which is why it looks unreadable to people not already quite familiar with C++ (ironically some replies assumed this is the AI variant because it looks so busy). Or is it?

Please review your own homework, make no mistakes

I do not have access to any premium AI services and I very rarely use them, but I couldn’t resist asking one of them for review. So here’s what ChatGPT has to say about it:

Uh oh…

It guessed that the first one is likely AI generated because it’s clunky and over engineered, and the second one would be made by humans because it’s more readable.

Before you go “Ah-ah, AI thinks the way to tell AI code from human code is to look at which one is bad because it knows AI is bad at code”, I need to do a short digression. AI doesn’t “think”. AI doesn’t “know”. LLMs are text prediction machines that reflect their training data. All we can tell from this is that it’s likely that the majority position on AI generated code is that it’s clunky and over engineered. And that humans like to write inefficient code that does four lookups when only one is necessary. Now I’m wondering how bad is the average codebase it used in its training data. Or maybe again it’s an assumption stemming from people writing that the average codebase is bad.

Interestingly, it then offers this version:

Again, reproducing the code for ease of use and accessibility:

Node* get_or_create(const std::string& name) {
    auto [it, inserted] = nodes.try_emplace(name, nullptr);

    if (inserted) {
        it->second = std::make_unique<Node>();
        it->second->name = name;
    }

    return it->second.get();
}

I have to say, this is clean C++17, and looks better than both the original versions. It clearly focused on limiting the amount of lookup to the optimal number (only one) and wrote what I’d consider to be idiomatic modern C++. Almost.

But then I noticed it picked string_view over string. And used a global variable. Two things that made us guess the second snippet was AI generated, while ChatGPT considered it as “not perfectly optimal but clean and readable, very human”. Is it very human to use global variables? To not having switched for string_view despite the fact that it was added to C++ nine years ago?

Now is a good time as any to remind the reader that using AI to detect AI generated code (or text) is a waste of time and resources. First because it’s extremely unreliable, and second because that figuring out which of the two is AI generated is beside the point. The important thing is that both are bad for different reasons and while ChatGPT seems (at least partially) able to point out why, it is too obsequious to challenge our framing device and instead gives us a made-up summary of what makes code “human” written in the style a LinkedIn influencer post (bonus question to take home: do LinkedIn posts look like this because they all use AI, or does AI look like this because it’s trained on LinkedIn posts?).

So, AI Generated Code Good?

Since the answer given by a free version of ChatGPT is better than both original snippets, I’m starting to suspect the original poster may have fudged the prompts to farm some engagement. But it still begs the question: “is AI good at writing experienced C++ code?”.

To which my answer is “no”, because by being too accommodating to the user (a common trait and failure of LLMs, as we just mentioned), it failed the first rule of engineering: “always ask ‘why?’”.

In this case: why is the value type in the container a unique_ptr? Because a lot of the clunkiness in both original snippets is due to the allocation and initialization of Node. Elements in maps are individually heap allocated, do we need that indirection? We can see that it being null doesn’t seem to be a valid case, as the first thing we do on every insert is call make_unique which to me sounds like the assumption that it should always point to valid Node. Can’t we use Node directly as the value type? And also set the name in the constructor while we’re at it:

struct Node
{
    // Ensure all nodes have a name by construction
    explicit Node(std::string s)
        : name(s) {}

    std::string name;
};

using Nodes = std::map<std::string, Node>;

Node* get_or_create(Nodes& nodes, const std::string& name)
{
    return &nodes.try_emplace(name, name).first->second;
}

This is perfectly fine to use as is because std::map guarantees that nodes are stable. Key/value pairs are heap allocated individually meaning you can keep pointers to them that remain valid even after inserting more elements. That also holds true for std::unordered_map (insertion may invalidate iterarors, but not references or pointers to actual elements).

Now if we used std::flat_map, or a custom open address hash map that wouldn’t hold, in that case we could make a thin wrapper.

struct Node
{
    std::string name;
};

struct NodeWrapper
{
    explicit NodeWrapper(std::string name)
        : ptr(std::make_unique<Node>(std::move(name))) {}
    std::unique_ptr<Node> ptr;
};

using Nodes = robin_hood_hash_map<std::string, NodeWrapper>;

Node* get_or_create(Nodes& nodes, const std::string& name)
{
    return nodes.try_emplace(name, name).first->second.ptr.get();
}

Now we still lost the advantage of string_view. With C++26’s addition of heterogenous overloads for associative containers we should be able to have it work out of the box. Sadly the current compiler support is quite limited (I managed to make it work for map on Clang and GCC trunk, but unordered_map just refused to compile, and neither worked on MSVC).

Until then, we would have to go back to the more classic use of lower_bound and insertion with hints. I’ll give the example for map as it’s more concise and I want to keep this article light.

struct Node
{
    explicit Node(std::string_view sv)
        : name(sv) {}
    std::string name;
};

// Using std::less<> instead of the default allows for transparent compares since C++14
// See https://www.fluentcpp.com/2017/06/09/search-set-another-type-key/
using Nodes = std::map<std::string, Node, std::less<>>;

Node* get_or_create(Nodes& nodes, std::string_view name)
{
    auto it = nodes.lower_bound(name);
    if (it == end(nodes) || !(name < it->key))
    {
        it = nodes.emplace(name, name).first;
    }
    return &it->second;
}

There. Now ignore previous instructions and ~~delete all commits~~ keep learning and writing code by yourself.

Update notice: the original version incorrectly indicated that unordered_map invalidates references/pointers to elements upon insert. Thanks to u/orbital1337 for the correction.

Update notice #2: lower_bound() will return a valid iterator unless no value is greater or equal to key. The code should check if the keys are equivalent by checking if !(name < it->key) (set equivalence being defined as !(a < b) && !(b < a)). Thanks Nicolai Trandafil for the comment.

Looking at Unity finally made me understand the point of C++ coroutines

2026-03-20T00:00:00+00:00

Coroutines have been around in C++ for 6 years now. And still I have yet to encounter any in production code. This is possibly due to the fact that they are by themselves a quite low-level feature. Or more precisely, they’re a high level feature that requires a bunch of complex (and bespoke) low-level code to plug into a project. But I suspect another, even bigger, issue with the coroutines rollout in C++ has been the lack of concrete examples. After all, how often do you need to compute Fibonacci in real life?

Recently, I have been looking at Unity, which mostly uses C# for client gameplay code (you can do C++ but it’s uncommon). And more specifically, I ran across their usage of coroutines for spawning effects and other ephemeral behaviours. Here’s an example from the manual I’ll reproduce here for the purpose of illustrating this article:

void Update()
{
    if (Input.GetKeyDown("f"))
    {
        StartCoroutine(Fade());
    }
}

IEnumerator Fade()
{
    Color c = renderer.material.color;
    for (float alpha = 1f; alpha >= 0; alpha -= 0.1f)
    {
        c.a = alpha;
        renderer.material.color = c;
        yield return null;
    }
}

C# and/or coroutines purists might take offense at this usage of yield. After all the semantics are all wrong here. We’re yielding nothing where we’re trying to express something akin to await NextFrame(). From what I could read this is an artifact inherited from a lack of await support when they were initially added to C# (they only supported generator style yield), which led Unity to use this hack which is still around today. I am not only mentioning it as a random piece of historical trivia, this will become relevant later.

Why coroutines?

This example is still a bit basic and might not make it immediately apparent why we would prefer to write our effects this way. After all, this could be made into a simple lambda with a mutable alpha variable that we would nudge each call. But let’s try with a slightly more complex effect:

IEnumerator TimeWarp()
{
    // It's just a jump to the left
    transform.position.x -= 1.f;
    yield return null;

    // Then a step to the right
    for (int i = 0; i < 4; ++i)
    {
        transform.position.x += 0.2f;
        yield return null;
    }

    // Put your hands on your hips
    // ...

    // Let's do the time warp again!
    for (int i = 0; i < 4; ++i)
    {
        transform.Rotate(0.f, 90.f * i, 0.f);
        yield return null;
    }
}

Now it would become actually painful to turn this into a regular functor or lambda. Writing it in C++ turns it into some sort of ugly state machine like this:

class TimeWarp
{
    enum class State
    {
        Jump,
        StepRight,
        HandsOnHips,
        // ...
        DoAgain
    };

    State _state = State::Jump;
    int _i = 0;
    Transform* _transform;

    TimeWarp(Transform& transform) : _transform(&transform) {}

    bool operator()()
    {
        switch ( _state )
        {
            case State::Jump:
                _transform->position.x -= 1.f;
                _state = State::StepRight;
                break;

            case State::StepRight:
                _transform->position.x += 0.2f;
                if ( ++_i == 4 )
                {
                    _state = State::HandsOnHips;
                    _i = 0;
                }
                break;

            // ...

            case State::DoAgain:
                _transform->Rotate(0.f, 90.f * i, 0.f);
                if ( ++_i == 4 )
                {
                    // Indicate we're done
                    return true;
                }
                break;
        }
        return false;
    }
}

Pretty ugly, isn’t it? Would you let it pass code review? What else would you suggest instead?

I guess I would perhaps recommend the author split TimeWarp into its component moves and handle state transitions by queueing the next effect as a continuation. But I probably wouldn’t be happy about it.

This, to me, is the kind of no-brainer case I’ve been dying to see to be sold on the value of coroutines. Wrapping one loop might not be worth the hassle of figuring out how to integrate coroutines in your codebase, but wrapping a sequence of operations with state definitely does. It’s all about turning a hard to read state machine into a very simple function.

A C++23 implementation

So, let’s do the time warp again in C++ then.

std::generator<std::monostate> TimeWarp(GameObject& obj)
{
    // It's just a jump to the left
    obj.transform.position.x -= 1.f;
    co_yield {};

    // Then a step to the right
    for (int i = 0; i < 4; ++i)
    {
        obj.transform.position.x += 0.2f;
        co_yield {};
    }

    // Put your hands on your hips
    // ...

    // Let's do the time warp again!
    for (int i = 0; i < 4; ++i)
    {
        obj.transform.Rotate(0.f, 90.f * i, 0.f);
        co_yield {};
    }
}

My readers may object that this is a hack. In fact, this is the same hack as Unity did back a decade and some change ago. And that’s precisely the point. For the exact same reasons.

See, the real reason we mostly see Fibonacci generators in slides is because using co_yield is (relatively) easy, especially since C++23 gave us . But making use of co_await is hard. Yielding from a coroutine is fairly straightforward and generic. The control flow is simple, we suspend and return to the caller and they decide when we will be awaken next. On the other hand handling co_await requires answering a lot of questions that don’t have an obvious answer. What are we going to wait on? How will they signal that they are ready to resume? Can we use signals/interrupts instead of polling? Who will check that they are ready to run again? Will they also awaken (run) the coroutine, or will they put them back in an execution queue? Which execution queue? A background thread? A thread pool? Using what implementation? The list goes on.

To misquote Kennedy, “we chose to focus coroutines on generator in C++23, not because it is hard, but because it is easy”.

C++26 should implement execution and give us a framework to be able to use co_await, but I expect it to be an uphill battle. After all, most projects should already have their own concurrency solution and given how little is in the standard besides low level constructs, it means a lot of divergence that will need to be plugged back into the execution model. I expect most projects have their own custom schedulers, thread pools and the like. Or use something like TBB to get one.

Perhaps your codebase already uses boost::asio in which case you already have support for coroutines. If not, you will either need to wait for C++26 and switch/integrate with execution, or implement your own promises and awaitables to fit your threading model.

Or you could use the Unity hack.

Unity-like coroutines runner in C++

It took me less than an hour to implement a simple Unity style coroutine executor in my toy game main thread. Here’s the whole thing:

class effects_manager
{
public:
    void add( std::generator<std::monostate> effect )
    {
        _effects.push_back( std::move( effect ) );
        _iterators.push_back( _effects.back().begin() );
    }

    void run()
    {
        // Remove the ones that are done
        // (tweaked https://en.cppreference.com/w/cpp/algorithm/remove.html#Version_3)
        int first = 0;
        for ( ; first != _effects.size()
                 && _iterators[ first ] != _effects[ first ].end(); ++first );

        if ( first != _effects.size() )
        {
            for ( int i = first; ++i != _effects.size(); )
            {
                if ( _iterators[ i ] != _effects[ i ].end() )
                {
                    _effects[ first ] = std::move( _effects[ i ] );
                    _iterators[ first ] = std::move( _iterators[ i ] );
                    ++first;
                }
            }
            _effects.erase( begin( _effects ) + first, end( _effects ) );
            _iterators.erase( begin( _iterators ) + first, end( _iterators ) );
        }

        // Run the effects
        for ( int i = 0; i < _effects.size(); ++i )
        {
            ++_iterators[ i ];
        }
    }

private:
    std::vector<std::generator<std::monostate>> _effects;
    using effect_iterator = decltype( std::declval<std::generator<std::monostate>>().begin() );
    std::vector<effect_iterator> _iterators;
};

That’s it. The only hard part is the loop that removes the coroutines that have reached the end of their execution by hand-writing a std::remove_if variant that works with 2 zipped arrays. If you already have a utility for it, the whole thing will take less than 20 lines.

Now can fire effects by writing something like effects.add(TimeWarp(object)) and we just need to remember to call effects.run() in our main loop.

Doing it the “proper” way would require to write a custom next-frame awaiter that inserts our coroutine handle into a next frame queue. While that’s doable, this requires a more in-depth understanding of coroutines internals to implement. And, to be honest, I kind of like the yield approach to mean “yield control until next frame”.

Bonus benefit

As I was writing this, I also realized, it wouldn’t take much to turn our current implementation into a proper generator rather than relying on our coroutine invoking side effects. Instead of monostate we could return a renderable object.

std::generator<Draw> TimeWarp(const Model& model)
{
    // It's just a jump to the left
    vec3 position{ -1.f, 0.f, 0.f };
    co_yield Draw{ .model = model, .transform{ .position = position } };

    // Then a step to the right
    for (int i = 0; i < 4; ++i)
    {
        position.x += 0.2f;
        co_yield Draw{ .model = model, .transform{ .position = position } };
    }

    // Put your hands on your hips
    // ...

    // Let's do the time warp again!
    for (int i = 0; i < 4; ++i)
    {
        obj.transform.Rotate(0.f, 90.f * i, 0.f);
        co_yield Draw{ .model = model,
                       .transform{ .position = position,
                                   .rotation = Rotate(0.f, 90.f * i, 0.f) } };
    }
}

Now we change our run() method to populate a vector of draws:

std::vector<Draw> run()
{
    // Remove the ones that are done ()
    // ...

    // Run the effects
    std::vector<Draw> draws;
    draws.reserve( _effects.size() );
    for ( int i = 0; i < _effects.size(); ++i )
    {
        draws.push_back( *_iterators[ i ] );
        ++_iterators[ i ];
    }
    return draws;
}

And while we’re at it, we could even make our loop run in parallel now since we removed the side effects:

// Run the effects
std::vector<Draw> draws( _effects.size() );
tbb::parallel_for( 0zu, _effects.size(), [this, &draws]( size_t i )
                   {
                       draws[ i ] = *_iterators[ i ];
                       ++_iterators[ i ];
                   } );
return draws;

There. A simple and relatively efficient effect system for our game that allows designers to implement all sorts of bespoke funky things as easy to read coroutines, and the entire system took us less than a hundred lines to write.

Now, wouldn’t you say this looks much more interesting to have than if I had shown you yet another Fibonacci generator?

What makes a game tick? Part 9 - Data Driven Multi-Threading Scheduler

2026-02-27T00:00:00+00:00

Back in late 2025 we started implementing data-driven multi threaded ticks by making all game object lookups and dereferences go through a thin accessor. This in turn forced us to describe which types a given tick task would need to read and write. And with that information, we have everything we need to build a parallel scheduler.

Task metadata

If you remember from part 7 we had described a set of tasks that constituted our example tick and built a simple data access table. I’ll repeat it here for easy reference:

Task	Economy	Diplomacy	Modifiers	Provinces	Armies	Navies	AI
UpdateModifiers			🖊️
UpdateProvinces		📖		🖊️	📖
UpdateEconomy	🖊️			📖
UpdateDiplomacy		🖊️		📖
UpdateArmies		📖	📖	📖	🖊️
UpdateNavies		📖	📖	📖		🖊️
UpdateAI	📖	📖	📖	📖	📖	📖	🖊️

First, let’s translate those into C++ function signatures using the accessor type we described before:

namespace tick_tasks
{
void UpdateModifiers(accessor<Modifiers>);
void UpdateProvinces(accessor<const Army, const CountryDiplomacy, Province>);
void UpdateEconomy(accessor<const Province, CountryEconomy>);
void UpdateDiplomacy(accessor<const Province, CountryDiplomacy>);
void UpdateArmies(accessor<const CountryDiplomacy, const Modifiers, const Province, Army>);
void UpdateNavies(accessor<const CountryDiplomacy, const Modifiers, const Province, Navy>);
void UpdateAI(accessor<const Army,
                       const CountryDiplomacy,
                       const CountryEconomy,
                       const Modifiers,
                       const Province,
                       const Navy,
                       CountryAI>);
}

As you can see, the data accesses of each task are part of the function signature (through the type of their first argument). With some simple template meta programming we can access it. The obvious place to capture it would be through whatever registry mechanism we use to tell our scheduler which tasks are part of the tick.

class scheduler
{
public:
    template<typename... Types>
    void register_task(void (*task)(accessor<Types...>))
    {
        _tasks.emplace_back(task);
    }

private:
    std::vector<task> _tasks;
};

namespace tick_task
{
void RegisterTasks(scheduler& sched)
{
    sched.register_task(UpdateModifiers);
    sched.register_task(UpdateProvinces);
    sched.register_task(UpdateEconomy);
    // ...
}
}

Task polymorphism

To implement our task storage we will need some form of type erasure to turn our registry into a vector of task objects that we can then manipulate in a more traditional fashion. While I enjoy template metaprogramming, I find it simpler to keep things on the low side as soon as it looks like we’ll need do any kind of iteration or sorting. Some programming languages are really good at making types manipulation easy, but C++ isn’t one of them (we can revisit this assertion once C++26 reflection is available).

This means that some parts of our scheduler will do things at runtime that could possibly be done at compile time (such as building a task dependency graph), but after experimenting with both options I found the runtime version much simpler to use and maintain, and the cost of building the task graph one time at startup was negligible.

So, let’s implement the task type erasure.

namespace details
{
template<typename T>
void add_type(std::flat_set<std::size_t>& reads, std::flat_set<std::size_t>& writes)
{
    if constexpr(std::is_const_v<T>)
    {
        reads.emplace(typeid(std::remove_const_t<T>).hash_code());
    }
    else
    {
        writes.emplace(typeid(T).hash_code());
    }
}
template <typename Tuple, std::size_t... I>
void add_types_from_tuple(std::flat_set<std::size_t>& reads,
                          std::flat_set<std::size_t>& writes,
                          std::index_sequence<I...>)
{
    // Might be able to skip the need for using tuple with C++26 pack indexing?
    (add_type<typename std::tuple_element<I, Tuple>::type>(reads, writes), ...);
}
}

class task
{
public:
    template<typename... Types>
    task(void (*task_fn)(accessor<Types...>))
        : _fn([task_fn](Gamestate& gamestate)
              {
                  auto accessor = gamestate.make_accessor<Types...>();
                  task_fn(accessor);
              }
          )
    {
        using Tuple = std::tuple<Types...>;
        details::add_types_from_tuple<Tuple>(
            _reads,
            _writes,
            std::make_index_sequence<std::tuple_size_v<Tuple>>{});
    }

    void run(Gamestate& gamestate) const { _fn(gamestate); }
    const auto& get_reads() const { return _reads; }
    const auto& get_writes() const { return _writes; }

private:
    std::function<void(Gamestate&)> _fn;
    std::set<size_t> _reads;
    std::set<size_t> _writes;
};

And with that we should have our basic task. The idea behind the interface is that all tasks can be run as a callable that takes a reference to the Gamestate, and provide a set of which types are being read or written by the task. Assuming the task is only called from a sane environment (like a task graph built upon our constraints), this is potential place for where we can create an accessor. In general you want to make sure gamestate accessors are created at safe points, but the nice thing about this pattern is that those become the only points where you can possibly create a data race. Anywhere else would trigger a compile error.

In this example we use the hash code from the typeid which allows us to work with any given type. Alternatively we could have our own registry of allowed types which assigns an index to each registered type and use a bitset instead, it would be more intrusive as we need to explicitly register types, but it would simplify some the graph construction because finding intersection between 2 bitsets is a simple binary AND.

The scheduler itself

Once we have turned all our tasks into a vector, it becomes easier for us to create a task graph. Here’s a very basic implementation:

// Utility to keep the rest readable
template<typename T>
bool intersects(const std::set<T>& s1, const std::set<T>& s2)
{
    std::vector<T> intersection;
    std::set_intersection(
        begin(s1), end(s1), begin(s2), end(s2), std::back_inserter(intersection));
    return !intersection.empty();
}

void build_graph(std::span<task> tasks)
{
    for (int i = 0; i < tasks.size(); ++i)
    {
        for (int j = 0; j < i - 1; ++j)
        {
            if (intersects(tasks[i].get_reads(), tasks[j].get_writes())
                || intersects(tasks[i].get_writes(), tasks[j].get_reads())
                || intersects(tasks[i].get_writes(), tasks[j].get_writes()))
            {
                tasks[j].add_dependency(task[i]);
            }
        }
    }
}

And with that, we have a built a task graph that we can then feed to our favourite threading library to run in parallel.

Next time, we wil look at data storage and how to tie all this together. See you there!

Profiling on Windows: a Short Rant

2026-02-13T00:00:00+00:00

We have to disrupt our scheduled program because I ran into an annoying hurdle and I feel we need to talk about it. Because right now the profiler situation on Windows kind of sucks and it’s an issue given how ubiquitous the platform is. It works alright for basic/medium usage, but when you need more advanced metrics it breaks down. Let me explain.

I have published many talks about performance, and in particular I had one about profiling and one about caches. CPU caches have been critical for performance for the past decade and a half and while sometimes you can ensure good cache hit rate by following existing patterns, sometimes you just need to measure.

There’s a host of solutions when it comes to sampling and instrumentation profiling on Windows. I always keep Optick around (even though I worry the project looks abandoned these days, I tried to reach out to the maintainer but he didn’t get back to me). There’s one that comes free with Visual Studio. I heard good things from Tracy but sadly I cannot get past the imgui feel of the interface. And if you feel like expensing some paid solution, I found Superluminal’s user experience quite good in the past.

But when you suspect you have a micro-architecture related issue, you need more metrics. Especially cache miss/hit rate, cycles per instruction, branch misspredicts, frontend/backend bound ops, that kind of thing. I recently ran into an issue that I couldn’t explain with basic flamegraphs and cpu time metrics. I suspect it’s related to some code that’s bad for hardware, maybe false sharing or cache-unfriendly memory access, but I cannot measure it.

On Linux and friends there’s a few options for this. Most commonly perf and Cachegrind are free and readily available.

But on Windows, there’s mostly one very obvious choice if you’re using an Intel CPU.. I even mention it in my talks. And it’s the reason I’m writing the article. It’s vTune.

That’s right, Intel has decided that the major tool for CPU metrics on Windows now requires an 11th gen CPU or more recent. This wouldn’t be such an issue if you could rollback versions, but sadly, you can’t. Older releases are only available through paid support, and even then only for 2 years. For years every release of vTune only required a 5th gen Core, but if like me you hit the “update” button it will brick your profiler with no way back.

Why is it so bad? Well, I got a 10th gen CPU. Sure it’s over 5 years old, but it works just fine. It’s still the recommended spec for recent AAA games like Battlefield 6. Steam hardware survey does not have CPU model data, but we can use AVX512-VNNI instruction set support as a proxy (it was introduced with 10th gen) and that’s only 25% of all users at the time of writing.

Shouldn’t developers have beefier machines you may ask? Maybe. I considered getting a new laptop when I started my consulting business but so far I haven’t felt the need to change my workstation. And now that RAM prices have tripled and that GPUs are becoming a luxury I’m even less in a rush. I heard from fellow engineers with full-time positions that their request for upgrades are being delayed because their IT department cannot source components at reasonable prices either.

What’s the alternative? That’s the catch, I found nothing great so far. AMD has a similar toolhttps://www.amd.com/en/developer/uprof.html, but like Intel it only works for their CPUs, and as we mentioned this isn’t a great time to buy a new machine. I read that Perfview can collect hardware metrics but so far I found the interface too arcane to be used.

It’s a bit of a sad conclusion to say that I do not have a solution so far. If you happen to have kept a copy of the pre 2025 vTune offline installer, I suggest you hold on to it for dear life (and maybe host it somewhere and throw me a link 😎). And if you work at Intel, consider convincing the PM to bring back support for older CPUs (or at least make the old installers available). There’s a lot of software on Windows that could use better performance, and I don’t think cutting off a sizeable part of the user base from their profiling tool is a great way to improve the situation.

Benchmarking with Vulkan, or the curse of variable GPU clock rates

2026-01-29T00:00:00+00:00

Choosing between two implementation often requires answering the age-old question “which is faster?”. Which means measuring/benchmarking. Now what do you do when your device’s default mode of operation gives you unreliable numbers?

While modern CPUs have dynamic frequency scaling with technologies like TurboBoost, in my experience this hasn’t been a huge deal for comparing two benchmarks (as long as you handle P vs E cores). GPUs on the other hand are bit more capricious. According to GPU-Z, my RTX 2800 is currently running a 300MHz on the GPU and 100 MHz on the VRAM while I’m typing this article. This is obviously not its usual frequency under moderate or heavy workload. According to the internet the GPU should run between 1650 and 1815 MHz, and the VRAM at about 1937 MHz. The numbers are off by a factor of 5-6 on the GPU and 19 on the VRAM. That’s quite the discrepancy.

Steady measurements

This mechanism of dynamic frequency scaling is neat because is saves on power draw, puts less stress on the hardware and lower the decibels created by the cooling fans. But it sucks for benchmarking.

On my current project I was trying to compare 2 ways of rendering meshes by having a simple toggle in the UI that would select which shader is used and keep a basic average of the last 60 frames for each mode. But I kept getting nonsense. More precisely I got the occasional weird jitter. The scene would take 2ms for a while, then suddenly jump to 4 or 6ms, before going back to 2ms or sometimes even less.

This is not a new problem and it’s somewhat well documented that GPU benchmarks should be done with fixed/steady clocks. But I admit I thought that if I would just disable vsync and use VK_PRESENT_MODE_MAILBOX_KHR it would keep my GPU busy enough to not throttle down much. Sadly this wasn’t what I observed.

The good, the bad and the ugly workaround

A common recommendation I’ve seen online is to run SetStablePowerState.exe a simple exe you can build from source (or download) that was once provided by Nvidia. What it does is create a DX12 device and call the developer/debug API function ID3D12Device::SetStablePowerState which will fix the clock to steady rate until you close it.

It works and it does the trick, but it’s been kind of disowned by the company since. The new recommended way is to use the command line tool nvidia-smi to fix the clocks to a desired rate.

I found both to be lacking in some way:

While SetStablePowerState.exe does the job and is simple enough, it is still an exe I have to remember to launch, and close when I’m done doing GPU work (or at least benchmarking). If I forget to run it I’ll get the wrong results. If I forget to close it I’ll leave my GPU running at max clock all night.
nvidia-smi is even worse in my book. First it doesn’t automatically pick a clock speed for me. The recommendation is to run SetStablePowerState.exe, then look up the clock values in GPU-Z or similar, note those down, then invoke nvidia-smi with the right numbers to fix the clocks. Worse, unlike SetStablePowerState.exe it doesn’t stop if you close the window. There is no window to close. You have to invoke it again, once with --reset-gpu-clocks and another with -reset-memory-clocks to get back to the default behaviour. If I would probably remember to close SetStablePowerState.exe most of the time, I would very likely forget to run nvidia-smi and eat up my hardware’s lifetime.

And so I went for the third option, make my own utility.

A simple API

All I wanted was simple: the clocks should be fixed while I run my benchmark or comparison scenario, and off again when it’s done. If my renderer library was based on DX12 it would be easy, just call ID3D12Device::SetStablePowerState(), but sadly Vulkan as no such equivalent (there is an extension request but it doesn’t seem to be getting much traction).

But as it turns out, nothing stops you from creating a DX12 device context in a Vulkan app. So I did just that.

My API is quite simple:

#include 

int main()
{
    // Defaults to off
    gpu_stable_power::Context stable_power;

    // Lock clock speeds
    stable_power.set_enabled( true );

    // Do benchmark

    // Optional: manual toggle off
    stable_power.set_enabled( false );

    // Automatically disables itself on destruction
}

The way I use it is I keep it off by default, but I have a toggle in my debug UI to activate it when I need to benchmark. That way it’s always available when I need it, and I cannot forget to turn it off since it’s at worse disabled when my app exits.

Implementation wise, it’s very close to SetStablePowerState.exe. Create a DX12 device for adapter 0 and call ID3D12Device::SetStablePowerState() when toggled on/off. The rest is mostly there to make integration less painful. The implementation is hidden behind a pimpl (so you don’t get DirectX SDK included in a header file), and it turns into a no-op on non Windows builds (for portability) and release builds (since this API is locked behind Windows 10/11 developer mode). The criteria can be overridden by setting GPU_STABLE_POWER_ENABLED in the library’s build settings. And if you hate CMake, you can just add gpu_stable_power.cpp to your build. Finally I’ve used #pragma comment lib to add DXGI.lib and D3D12.lib to the linker when needed and keep the build integration to a minimum.

I have not bothered adding GPU selection because I only have one, and I don’t have the hardware available to test if it should be disabled for other vendors (I assume AMD also has variable clock rates?), but it should be easy to add if the need arises.

That’s it for today, happy benchmarking!

Designated Initializers, the best feature of C++20

2026-01-15T00:00:00+00:00

If you’ve been following my hot takes on C++, you might have noticed that I haven’t been the most enthusiastic person about the recent additions to the language. While some of them were nice addition, I haven’t felt like they had a significant impact on my code unless I had a somewhat niche use case. But for the past months I’ve been using C++20’s designated initializers and it’s been quite the change.

The feature

Originally proposed as P0329, the feature is a port of C99 with some tweaks. For those who have never used it in either language, it allows initializing structure members by name while omitting the ones that should keep their default values.

Here’s an example use from my current project:

Texture::Desc desc { .format = Texture::Format::R16G16B16A16_SFLOAT,
                     .usage = Texture::Usage::COLOR_ATTACHMENT | Texture::Usage::TRANSFER_SRC,
                     .extent = device.get_extent(),
                     .samples = 4 };

And here’s the full structure declaration:

class Texture
{
    // ...
    struct Desc
    {
      Format format = Format::UNDEFINED;
      Usage usage = Usage::NONE;
      Extent2D extent;
      int mips = 1;
      int samples = 1;
    };
};

Note that I am not specifying any value for mips. The compiler will leave it to its default value upon construction, the same way it does for member initializer lists in constructors. There could be 0 or a dozen members between 2 explicitly initialized elements like extent and samples and it will compile and work just fine. But unlike constructor initialization lists, the compiler will emit a hard error if any of those members appear out of order. This departure in design also makes it differ from the C99 version which allows members to appear in any order. I think this is a good design choice I know it has its detractors, more on that later.

That’s all?

This feature might not feel like a big deal at first, especially compared to the other large additions to C++ like modules, coroutines, concepts and the like. So why is it so important in my opinion?

A lot of C++ is about catching bugs with the compiler rather than with the debugger (or the QA process, or worse). And one of the big ways to do that is with strong types. Apples and Oranges are two different types that might both just be int under the hood, but if you try to assign one to the other by mistake, you get a compile error.

Going back to my example, before C++17 you could have written it this way:

Texture::Desc desc { Texture::Format::R16G16B16A16_SFLOAT,
                     Texture::Usage::COLOR_ATTACHMENT | Texture::Usage::TRANSFER_SRC,
                     device.get_extent(),
                     1,
                     4 };

This is good old aggregate initialization, and it’s been there forever. But notice how last 2 parameters are just int. Would you immediately catch that the first is mips and the second is samples without double checking the struct declaration?

It goes even deeper. Both Texture::Format and Texture::Usage are enum class which in turn, you guessed it, are also int. And why did we make enum class in C++11 in the first place? Same reason: to make sure we can’t accidentally mix them up. But you know how else we could avoid mixing them up? Making sure they are used in an expression with a left hand side that has a name.

Compare with old fashioned member-wise assignment:

Texture::Desc desc;
desc.format = Texture::Usage::COLOR_ATTACHMENT | Texture::Usage::TRANSFER_SRC;
desc.usage = Texture::Format::R16G16B16A16_SFLOAT;
desc.extent = device.get_extent();
desc.mips = 4;
desc.samples = 1;

It’s very obvious we mixed things up here, right? Even if format and usage weren’t strong enums, it would be fairly easy to catch during in a code review. A compile error is nicer, the IDE adding squiggly red lines under the expression as we type it is even better, but still it’s quite jarring.

C++ Units

But why don’t we go the all the way for mips and samples like we did for format and usage? That would be nice if we could express them as different types entirely, no? There are some libraries out there that offer some options.

// With https://github.com/rollbear/strong_type
using Mips = strong::type<uint32_t, struct Mips_>;
using Samples = strong::type<uint32_t, struct Samples_>;

// With https://github.com/joboccara/NamedType
using Mips = NamedType<uint32_t, struct MipsTag>;
using Samples = NamedType<uint32_t, struct SamplesTag>;

// With https://github.com/mpusz/mp-units
// TODO. I gave up after reading too many manual pages

Since C++ has no compiler support for named types, they all use similar meta-programming tricks to create a unique struct with some tags that wraps an integer (or similar scalar type). Which usually means a nontrivial amount of library code dedicated to various operators to bring back all the semantics of int to our named struct. Compilers then have various degrees of success translating that back into assembly (mostly fine with optimizations on, mostly terrible without).

But that doesn’t entirely solve our problem, because usually to avoid mixing and matching need to disable implicit conversion and assignment from int, else we miss the entire point of guarding against initializer list mismatch. To fix that we usually add user defined literal suffixes:

Mips operator ""_mips(uint32_t);
Samples operator ""_samples(uint32_t);

const auto format = Texture::Format::R16G16B16A16_SFLOAT;
const auto usage = Texture::Usage::COLOR_ATTACHMENT | Texture::Usage::TRANSFER_SRC;

// Works
Texture::Desc desc1( format, usage, device.get_extent(), 1_mips, 4_samples );

// Compile error
Texture::Desc desc2( format, usage, device.get_extent(), 4_samples, 1_mips );

Finally, we’ve solved it. But don’t you notice something? Doesn’t 4_samples and 1_mips look awfully close to .samples = 4 and .mips = 1? Except one of them requires an entire strong type library and the other is supported natively by the compiler.

It goes deeper

So far in our example we’ve kept to cases where we specified most if not all the members. Or at the very least we specified the ones that came first in declaration order. But that’s not how every struct is laid out.

Let’s look at another example from my library:

Pipeline::Desc desc { .color_format = draw_image.get_format(),
                      .depth_format = depth_image.get_format(),
                      .push_constants_size = sizeof( push_constants ) };

And here’s the struct definition as of this article’s writing:

class Pipeline
{
    // ...
    struct Desc
    {
      // Graphics pipelines only
      Texture::Format color_format;
      Texture::Format depth_format;
      PrimitiveTopology topology = PrimitiveTopology::TRIANGLE_LIST;
      CullMode cull_mode = CullMode::FRONT;
      FrontFace front_face = FrontFace::CLOCKWISE;
      // Compute & graphics pipelines
      uint32_t push_constants_size = 0;
    };
};

As you may notice, we are skipping a bunch of values here and leaving them to their defaults. Without that we would have needed to repeat a lot of code only to say “keep those values as they would be otherwise”.

But most importantly, this does not only apply to declaring variables on the stack. Now, we can finally do this:

auto tex = device.create_texture(
                Texture::Desc { .format = Texture::Format::D32_SFLOAT,
                                .usage = Texture::Usage::DEPTH_STENCIL_ATTACHMENT,
                                .extent = draw_image_extent,
                                .samples = 4 } );

Or even this:

// The compiler will deduce what type we are initializing from the function declaration
auto tex = device.create_texture( { .format = Texture::Format::D32_SFLOAT,
                                    .usage = Texture::Usage::DEPTH_STENCIL_ATTACHMENT,
                                    .extent = draw_image_extent,
                                    .samples = 4 } );

Python programmers will look at this and exclaim “Look at what they need to mimic a fraction of our power!”. Indeed, Python has had support for named function arguments since the initial 1.0 release in 1994. Meanwhile in C++ the best we have is a combinatorial explosion of overloads with default values that cannot possibly scale.

If we look back at the history of my library the API used to look like this:

class Device
{
    raii::Texture create_texture( Texture::Format format,
                                  Texture::Usage usage,
                                  Extent2D extent,
                                  int samples = 1,
                                  int mips = 1 );
};

This API would not scale well as it grows and we add more and more options to texture creation, and adding overloads wouldn’t really help.

Is passing a struct and using designed initializers to fill it kind of cheating? Maybe. Is it better than hoping that C++ will one day have named function parameters? Absolutely. Especially because it does the job as a byproduct of a feature that is itself already quite neat to use even for initializing local variables (or any variable really).

Limitations

The main complaint I have read about this feature is that unlike C99, it doesn’t allow for arbitrary ordering. Worse, if out of order initializers are used in a C header it will fail to compile when included in C++, regardless of whether it’s wrapped in extern "C" or not.

On one hand I can see why one would like the rule to be relaxed for types declared as extern "C" but sadly this isn’t how it works. extern "C" does not revert the language grammar and semantics to C, it only changes linkage. You can still use any manner of C++ features in an extern "C" block and the compiler will be just fine with it (you probably shouldn’t because it will definitely fail to compile with C clients, but technically you can).

The thing is, in C++ I want the order of declaration enforced, the same way I’d like the clang-tidy warning for out of constructor initializer list to be a hard error. In C++ order of construction matters because constructing an struct member can invoke all manners of side effects and it’s not a good idea to mislead the programmer who might think members will be constructed in the order they are initialized as opposed to the order they are declared. Enforcing both to be the same sidesteps that issue entirely.

I have pondered the idea of relaxing the ordering rule for trivial types but doing so would likely set a type definition in stone, as adding any nontrivial member (or changing its definition to be nontrivial) would break every client, which is usually not a thing you want in an API.

Another departure from C that’s worth mentioning is that since C++ allows for member initializers in declarations, we can have default values for omitted members that are something other than 0. I found that 1 is also a fairly common default value (like in texture mips and samples for example) and those cannot be expressed in the C99 version.

To me the only pet-peeve is that designed initializers cannot be forwarded. This does not compile:

std::vector<Texture::Desc> v;
v.emplace_back( .format = Texture::Format::D32_SFLOAT,
                .usage = Texture::Usage::DEPTH_STENCIL_ATTACHMENT,
                .extent = draw_image_extent,
                .samples = 4 );

You can fix this by wrapping the arguments:

v.emplace_back( Texture::Desc{ .format = Texture::Format::D32_SFLOAT,
                               .usage = Texture::Usage::DEPTH_STENCIL_ATTACHMENT,
                               .extent = draw_image_extent,
                               .samples = 4 } );
// Or simply
v.push_back( { .format = Texture::Format::D32_SFLOAT,
               .usage = Texture::Usage::DEPTH_STENCIL_ATTACHMENT,
               .extent = draw_image_extent,
               .samples = 4 } );

I’ve looked at the assembly generated and it’s basically the same without or without optimizations, at least for trivial types, so it’s only a minor annoyance, but still worth mentioning.

Wrapping up

Of all the language features of late, this is the one that I think has (and will) change how I design APIs the most (remember I said language features, because on the library side there has been things like std::span which has had an impact similar to std::string_view).

Only time will tell how this measures up compared to bigger features, but it’s a reminder that language evolution does not have to come with a very big paper to be impactful.

A Year With Graphics

2025-12-30T00:00:00+00:00

I had done some work with graphics while working on various titles at Paradox, but I never felt really confident about it like I would have been about C++ or multithreading or the few other topics I’ve talked about in the past. Sure I had done some work with it, figured out what the point of shaders is (the answer is: they shade) and migrated Hearts of Iron IV from DirectX 9 to 11, but it still felt a bit mystical. So I decided to use my spare time between contracts this year to catch up.

If at first you don’t succeed…

This wasn’t the first attempt I had made. A few years back I had run across a series of articles online claiming to make it “easy to understand”, only to welcome the reader with thousands of lines of Vulkan bootstrap code (all in C, of course). I found the whole thing utterly impossible to digest and moved on with my life.

This year, I started again with raylib after a suggestion from a past coworker. I had more experience with DirectX than OpenGL, but in combination with the Learn OpenGL this proved easy enough to catch-up. I combined it with the C++ wrapper to avoid manual resource management because that’s definitely not a thing we should be doing 30 years after RAII was invented.

Running into some limitations, I then gave a shot to SDL3_GPU after seeing a presentation from Mike Shah. The main value I found in it was making it obvious that one needs to bulk buffer updates in big batches because doing small mmap()/memcpy()/munmap() is really really inefficient for GPUs. In exchange however I’d lost all the other features brought by raylib, including the math library and asset loading.

Sadly (and despite being officially released in 2025), SDL3_GPU still enforces antiquated patterns like vertex buffer layout description and other fixed function pipeline relics. In general the API lacks support for bindless resources and it’s unclear when (or if) it will be added.

And so after all those adventures I was back on Vulkan. And this time it feels like I succeeded.

Learning

This whole process brought me back to a topic that is dear to me: learning. More precisely, how does one get into something new without easy access to experts that can point you in the right direction? After years in the C++ conference circuit I had kind of taken for granted that there was always someone a DM away from the answer, or at least a good lead towards it.

There’s a lot of stuff out there, so how does one find the right resource? Assuming we can at least avoid nonsense written by AI (hint: you can ask google for pages published before 2023), that’s still a big haystack. One of the hardest part, I found, was to figure out what was the latest trends and best practices. C++ is not the only tech topic guilty of using the word “modern” to describe patterns that are now a decade old…

One of the things that helped were the presentations made at ACM SIGGRAPH. While far from perfect (finding anything on their website is near impossible and reposts on Youtube seem to happen on a random schedule) and often hard to get into as beginner, the slides did come with neat bibliographies which proved very useful to vet sources and articles. If it’s cited in a recent presentation about IDTech or Frostbite, it’s probably solid.

Eventually I found out about the Rendering Engine Architecture Conference which talks are consistently uploaded on Youtube. I don’t know the whole story behind it (they claim to be “a reaction to the conferences they used to attend”) but after the hurdles of accessing SIGGRAPH (to say nothing about GDC) I certainly think they might be on to something.

Results

My last rewrite started by following up the (unofficial) “Vulkan Guide” which proved useful to start up (although it’s going through a rewrite and the last chapters are still missing), which some extra inspiration found in a hobby project called Kaleidoscope that the social media algorithm randomly threw at me.

I put my own thin Vulkan abstraction out on github although I don’t think there’s much interesting stuff going on in there for now. The only feature I’d say is possibly worth a look at is the pipeline manager that implements background shader recompilation if the source changes. If you’re curious about multi-threaded asset loading in general, I made some experiment with various solutions a month ago.

In general there’s a lot of stuff missing or possibly inefficient, as I try to only add features as I need them. If decades of API development have taught me one thing, it’s to never write something you don’t have a client for.

Behold!

If you bothered to check the repository, you might have noticed that I used C++20 modules. It only took 5 years, and I still needed to hack around Intellisense but I finally got to use modules. And yes it’s an amazing quality of life for compile times when including C++ libraries. When it works.

The other C++20 feature I can’t live without now it’s designated initializers. In my opinion they beat any form of builder pattern or constructor overloads.

What’s next?

My initial plan was to implement mesh shading, but I was at first hesitant given the minimum hardware requirements (RTX 2xxx series and later if I’m not mistaken). A recent post by Sebastian Aaltonen is starting to convince me that this is a reasonable baseline, and that my API isn’t going in a terrible direction. Phew!

My most recent watch has been the Niagara series by Arseny Kapoulkine which I found very knowledgeable, but the streaming format can make the pacing a bit tedious at times. I wish I could find a more edited video series. If anyone has any recommendation, I’m all ears. If not, this might mean there a gap waiting to be filled.

Speaking of streaming-like content, I could not end this post without mentioning Freya Holmér’s channel for anyone who would like a refresher on either graphics math or shader basics. Again, please note this is also captured from a streaming format and so the editing (or lack thereof) might annoy some.

The search continues

As this year ends, I am left with an interesting question that has been a theme through this whole article: have I managed to catch up? To answer the question requires not only looking at what I’ve learnt, but more importantly figuring out what I don’t know. And here lies the catch: you can never really know what you don’t know. At best you can do what I’ve done in this article: put something out there, and see if someone points you at something you missed.

Happy new year!

What makes a game tick? Part 8 - Data Driven Multi-Threading Implementation

2025-12-11T00:00:00+00:00

In our last episode in this series we presented the concept of task-based parallelism with scheduling driven by data accesses. I recommend going back to it for a quick reminder because today we are gonna talk about implementation. Let’s get coding!

Access denied

To be able to implement data-driven task parallelism, we need to establish two rules:

Tasks must declare what data they read and write
Data accesses within tasks must go through some middle man that will ensure they are only accessing data they declared

This may sound obvious, but this has some pretty important implications. First of all, no pointers. Or references. Unique pointers and containers are OK as long as no other objects is allowed to keep a pointer to them.

Here are some examples:

struct Army
{
    // Direct data members, all fine
    float health;
    float morale;
    float attack;
    float defense;
    
    Country* controller; // Bad
    const Province* location; // Still bad
    const Country& owner;  // Also bad

    std::vector<Equipment> equipment;  // Collection of direct members, fine
    std::unique_ptr<Emblem> emblem; // Fine as long as only accessed through the Army
};

So if we cannot have pointers, what can we do? Obviously we can’t just declare that every object should not have relationship to any other object, but we have to express those relationships in a way that does not allow for unchecked pointer following.

A pointer wrapper

A simple solution is to make a thin wrapper around pointers:

template <typename T>
class obj_ptr {
public:
    constexpr obj_ptr() = default;
    constexpr explicit obj_ptr(T* obj) : m_ptr(obj) {}
    constexpr obj_ptr& operator=(T* obj)
    {
        m_ptr = obj;
        return *this;
    }

    constexpr void clear() { m_ptr = nullptr; }
    constexpr explicit operator bool() const { return m_ptr != nullptr; }
    constexpr auto operator<=>(const obj_ptr&) const = default;

private:
    template<typename... Types>
    friend class accessor;

    T* m_ptr = nullptr;
};

This obj_ptr wraps a pointer and offers similar logic except it’s missing the actual de-reference operations. As such it is safe to use because no one outside of the friend class accessor can access the m_ptr member. Then the next step is to actually implement this accessor class that will act as our de-reference proxy.

// Is a type allowed read-only access by a given accessor?
template<typename T, typename... AllowedTypes>
concept ro_access = details::contains_type<std::add_const_t<T>, AllowedTypes...>;

// Is a type allowed read-write access by a given accessor?
template<typename T, typename... AllowedTypes>
concept rw_access = details::contains_type<T, AllowedTypes...>;

template<typename... Types>
class accessor
{
public:
    constexpr accessor(const accessor&) = default;
    constexpr accessor(accessor&&) noexcept = default;

    template <rw_access<Types...> T>
    T* get(obj_ptr<T> ref)
    {
        return ref.m_ptr;
    }

    template <ro_access<Types...> T>
    const T* get(obj_ptr<T> ref)
    {
        return ref.m_ptr;
    }
private:
    constexpr accessor() = default;
    friend class task_executor;
};

Now, how does this work? The concepts ro_access and rw_access acts as a barrier, only emitting a get() functions for a given T if that type is part of Types, and with the right const qualifier on the returned pointer. For example:

void UpdateProvince(accessor<Province, Army, const CountryDiplomacy> access, obj_ptr<Province> p)
{
    // Ok, Province is part of accessor's types
    Province* province = access.get( p );
    for ( obj_ptr<Army> a : province->m_armies )
    {
        // Also ok, access.get() returns an `Army*` that is downcast to `const Army*`
        const Army* army_on_province = access.get( a );
        // Compute something ...
    }
    // Compile error: only have const access to diplomacy
    CountryDiplomacy* owner_diplo = access.get( province->m_owner_diplo ); 
    // Compile error: no access to navies
    const Navy* navy_in_port = access.get( province->m_navy_in_port ); 
}

As you can see, this way we ensure a given task (like UpdateProvinces()) only accesses what it promised it would, in the way it promised it would (read or write) or else the task will fail to compile. With that guarantee in mind, we can now check that two tasks are compatible to run in parallel, and we can even do it at compile time. All it takes is extracting the type list from the first argument’s signature, and check if any non const type in one appears on the other.

A viral pattern

One important consequence of using this kind of technique is that by necessity every function that a task may call now has to add an accessor parameter to its signature. We should of course add a constructor to accessor that allows for creating subs-accessors with either less types or types demoted from non-const to const. Still, it is a thing that will annoy programmers and designers alike when iterating over features and that should be kept in mind.

Especially because one of the effect of such a viral pattern is that when one find themselves needing an extra data accessor down in a leaf function, they need to edit the signature of all callers recursively to add the required types. On the other hand, I found that this encourages (forces?) programmers to see the performance of a given gameplay change, because by its nature it forces them to bubble that change all the way up to the task declaration. It will also make it quite blatant to the pull request reviewer when a new read/write access to common type is added.

Speaking of, the other advantage of this technique is that it makes it very obvious which game objects are used all the time and would be potential candidates for being split, as we shown in the previous chapter. Of course this does not replace a profiler, sometimes a type is only shared by a few tasks, but they are all very expensive to run and should definitely be parallelized instead of run serially. You should always be profiling your tick.

Going further

There’s of course more to be implemented here. We haven’t written the scheduler, nor talked about how we should handle the game object’s primary storage. We will discuss part of those next time, although the specifics of how to implement a full scheduler might be better done as part a github repository (no promises though).

Finally, I’d like to remind readers that this is a per type access control solution, not a per object. It is still possible to subdivide the work by turning a loop into a parallel variant, but that may warrant more discussions also. Until then!