Diagnosing game crashes

When doing user testing, there will be surprises. The players will do things you people would never believe; and there is a small, but insignificant chance that this will result in the game crashing. Not the nice kind of crash with an error message and some default action, but the nasty kind with the OS terminating the process. The good news is that when the OS does that, it will leave a record of what happened at the point of crash. A minidump.

Crashes and symbols

So, what is inside a minidump? Essentially, it contains the details of the encountered error and the location of the error. Unfortunately, it’s not very human-friendly; the error is the signal number (so far so good), and the error location is the offset in the binary.

signal 11 (SIGSEGV)

Kopi-Linux-Shipping 0x0000000000200000 + 7727680
Kopi-Linux-Shipping 0x0000000000200000 + 7708a09
Kopi-Linux-Shipping 0x0000000000200000 + 3743c3a
Kopi-Linux-Shipping 0x0000000000200000 + 373f8ac
Kopi-Linux-Shipping 0x0000000000200000 + 373f585
Kopi-Linux-Shipping 0x0000000000200000 + 373f8ac
Kopi-Linux-Shipping 0x0000000000200000 + 35eade7
Kopi-Linux-Shipping 0x0000000000200000 + 37407f7
Kopi-Linux-Shipping 0x0000000000200000 + 7139434
Kopi-Linux-Shipping 0x0000000000200000 + 712b9ac
Kopi-Linux-Shipping 0x0000000000200000 + 6e007ca
Kopi-Linux-Shipping 0x0000000000200000 + 6acaf69
Kopi-Linux-Shipping 0x0000000000200000 + 6ac9d1a
Kopi-Linux-Shipping 0x0000000000200000 + 6ac99fc
Kopi-Linux-Shipping 0x0000000000200000 + 6ad06cd
Kopi-Linux-Shipping 0x0000000000200000 + 6dc6357
Kopi-Linux-Shipping 0x0000000000200000 + 32ff799
Kopi-Linux-Shipping 0x0000000000200000 + 32ff02f
Kopi-Linux-Shipping 0x0000000000200000 + 32fe1b1
Kopi-Linux-Shipping 0x0000000000200000 + 6dc2217
Kopi-Linux-Shipping 0x0000000000200000 + 6dbf455
Kopi-Linux-Shipping 0x0000000000200000 + 67ef366
Kopi-Linux-Shipping 0x0000000000200000 + 6615f40
Kopi-Linux-Shipping 0x0000000000200000 + 70fc819
Kopi-Linux-Shipping 0x0000000000200000 + 70fe05a
Kopi-Linux-Shipping 0x0000000000200000 + 70ed22f
            libc.so 0x00007f18f9729000 + 23850
            libc.so 0x00007f18f9729000 + 2390a
Kopi-Linux-Shipping 0x0000000000200000 + 32d3029

Better than nothing, but not particularly helpful. We can get a little more help from the OS and the runtime; for example, on SteamDeck, we get a little more developer-friendly crash report.

OS version Linux 6.1.52-valve9-1-neptune-61 (network name: steamdeck)
Running 4 x86_64 processors (8 logical cores)
Exception was "SIGSEGV: invalid attempt to write memory at address 0x00000005ff3759df"

<SOURCE START>
<SOURCE END>

<CALLSTACK START>
Kopi-Linux-Shipping!UnknownFunction(0x7727680)
Kopi-Linux-Shipping!UnknownFunction(0x7708a08)
Kopi-Linux-Shipping!UnknownFunction(0x3743c39)
Kopi-Linux-Shipping!UnknownFunction(0x373f8ab)
Kopi-Linux-Shipping!UnknownFunction(0x373f584)
Kopi-Linux-Shipping!UnknownFunction(0x373f8ab)
Kopi-Linux-Shipping!UnknownFunction(0x35eade6)
Kopi-Linux-Shipping!UnknownFunction(0x37407f6)
Kopi-Linux-Shipping!UnknownFunction(0x7139433)
Kopi-Linux-Shipping!UnknownFunction(0x712b9ab)
Kopi-Linux-Shipping!UnknownFunction(0x6e007c9)
Kopi-Linux-Shipping!UnknownFunction(0x6acaf68)
Kopi-Linux-Shipping!UnknownFunction(0x6ac9d19)
Kopi-Linux-Shipping!UnknownFunction(0x6ac99fb)
Kopi-Linux-Shipping!UnknownFunction(0x6ad06cc)
Kopi-Linux-Shipping!UnknownFunction(0x6dc6356)
Kopi-Linux-Shipping!UnknownFunction(0x32ff798)
Kopi-Linux-Shipping!UnknownFunction(0x32ff02e)
Kopi-Linux-Shipping!UnknownFunction(0x32fe1b0)
Kopi-Linux-Shipping!UnknownFunction(0x6dc2216)
Kopi-Linux-Shipping!UnknownFunction(0x6dbf454)
Kopi-Linux-Shipping!UnknownFunction(0x67ef365)
Kopi-Linux-Shipping!UnknownFunction(0x6615f3f)
Kopi-Linux-Shipping!UnknownFunction(0x70fc818)
Kopi-Linux-Shipping!UnknownFunction(0x70fe059)
Kopi-Linux-Shipping!UnknownFunction(0x70ed22e)
libc.so.6!UnknownFunction(0x2384f)
libc.so.6!__libc_start_main(+0x89)
Kopi-Linux-Shipping!UnknownFunction(0x32d3028)

<CALLSTACK END>

0 loaded modules

Report end!

This is somewhat more helpful with the message Exception was "SIGSEGV: invalid attempt to write memory at address 0x00000005ff3759df"; we know that we wrote to some memory location that we should not have. We need debugging symbols that match the minidump, and the source files that made the build. With these three things, we can have a first stab at finding out what happened. We will use the gdb debugger (since we are diagnosing a Linux build crash on SteamDeck). gdb will make use of the debug symbols that were generated as part of the build process.

$ gdb Kopi-Linux-Shipping
GNU gdb (Ubuntu 14.0.50.20230907-0ubuntu1) 14.0.50.20230907-git
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from Kopi-Linux-Shipping...
Reading symbols from /var/ue5_3/Linux/Kopi/Binaries/Linux/Kopi-Linux-Shipping.debug...

Now that we have the binary and the debug symbols loaded, we can find out what symbol lives at the address 0x0000000000200000 + 0x7727680.

(gdb) info symbol 0x0000000000200000 + 0x7727680
UKopiGameInstance::CrashGame() + 16 in section .text of /var/ue5_3/Linux/Kopi/Binaries/Linux/Kopi-Linux-Shipping

Aha! It is the somewhat suspicious-looking method UKopiGameInstance::CrashGame(); the + 16 represents the relative location of the failing instruction from the start of the method. Let’s keep digging for the precise location of the failing instruction. Note the * in front of the address in the info line command.

(gdb) info line *0x0000000000200000 + 0x7727680
Line 151 of "/home/uebuild/source/Source/Kopi/KopiGameInstance.cpp" starts at address 0x7927680 <_ZN17UKopiGameInstance9CrashGameEv+16> and ends at 0x7927686 <_ZN17UKopiGameInstance9CrashGameEv+22>.

Right! The culprit is at line 151 of /home/uebuild/source/Source/Kopi/KopiGameInstance.cpp. Because we kept all the sources alongside the binaries, we can take a look.

void UKopiGameInstance::CrashGame() {
	for (int* X = reinterpret_cast<int *>(0x5f3759df); ; ++X) {
    *X = 0xface0fb0;
  }
}

This will obviously segfault; and I hope you appreciate the nerdy references to the Fast inverse square root algorithm and the Face of Boe. (I’ll see myself out.) Having the failure point is nice and well, but we need the entire trace; we can use the above method to resolve the symbols to the appropriate lines

... + 0x7727680 UKopiGameInstance::CrashGame() [/home/uebuild/source/Source/Kopi/KopiGameInstance.cpp:151]
... + 0x7708a09 UKopiGameInstance::execCrashGame(UObject*, FFrame&, void*) [/home/uebuild/source/Intermediate/Build/Linux/UnrealGame/Inc/Kopi/UHT/KopiGameInstance.gen.cpp:27
... + 0x3743c3a UObject::execLetBool(UObject*, FFrame&, void*)()
... + 0x373f8ac ProcessLocalScriptFunction(UObject*, FFrame&, void*)()
... + 0x373f585 void ProcessScriptFunction<void (*)(UObject*, FFrame&, void*)>(UObject*, UFunction*, FFrame&, void*, void (*)(UObject*, FFrame&, void*))()
... + 0x373f8ac ProcessLocalScriptFunction(UObject*, FFrame&, void*)()
... + 0x35eade7 UFunction::Invoke(UObject*, FFrame&, void*)()
... + 0x37407f7 UObject::ProcessEvent(UFunction*, void*)()
... + 0x7139434 AActor::BeginPlay()()
... + 0x712b9ac APlayerController::BeginPlay()()
... + 0x6e007ca AKopiPlayerController::BeginPlay() [/home/uebuild/source/Source/Kopi/KopiPlayerController.cpp:11]
... + 0x6acaf69 AActor::DispatchBeginPlay(bool)()
... + 0x6ac9d1a AWorldSettings::NotifyBeginPlay()()
... + 0x6ac99fc AGameMode::HandleMatchHasStarted()()
... + 0x6ad06cd AGameMode::SetMatchState(FName)()
... + 0x6dc6357 AGameMode::StartPlay()()
... + 0x32ff799 UWorld::BeginPlay()()
... + 0x32ff02f UEngine::LoadMap(FWorldContext&, FURL, UPendingNetGame*, FString&)()
... + 0x32fe1b1 UEngine::Browse(FWorldContext&, FURL, FString&)()
... + 0x6dc2217 UEngine::TickWorldTravel(FWorldContext&, float)()
... + 0x6dbf455 UGameEngine::Tick(float, bool)()
... + 0x67ef366 FEngineLoop::Tick()()

... + 0x32d3029 _start()

Looking at where these things are coming from, it seems that the notable points are AKopiPlayerController::BeginPlay(), but there is not much interesting code there; in fact, it’s just the call to the inherited BeginPlay().

void AKopiPlayerController::BeginPlay()
{
	Super::BeginPlay();

	if (IsLocalPlayerController())
    ...
}

So, maybe it’s the blueprint.

Well, maybe is more like we definitely rigged it up to prove a point, but it nicely shows the investigating process.

Easy…

This all actually is easy; all that we have to do is to ensure that we keep all the relevant sources, preserve the entire environment in case we need to run the game for real; we need to be able to attach a debugger. We are of course using VCS, but being able to check out a specific revision of the source code is not enough. It is also necessary to ensure that the entire build environment is the same as it was at the point of building. Wait a second–the IT industry has solved this already with immutable builds and containers; we can’t ship containers to Steam, so the next best option is to keep a copy of the VMs that produced builds that were ultimately released.

And we have made just that; after a release build, we copy the VM that produced the build; if we are then unlucky enough to have to do the digging and debugging, we always know the failing game version, and we can easily look up the matching VM.

With this information, we know that the crashing game was 0.0.1 at VCS revision 1869; we can use that to look up the right VM, copy it again and use the copy of the copy for debugging.

We can even run the entire game in the VM (thanks to GPU virtualization); and with a remote or local debugger attached, we can really dig into the problems that we collect from our poor play testers. Though the additional benefits of being our play testers are not to be sniffed at!

Leave a Reply

Your email address will not be published. Required fields are marked *