Several months ago, I made the decision to switch from the i3 window manager, which uses the X display protocol, to Sway, which uses the new Wayland protocol. This decision was based off the fact that I display-specific workspaces were quite buggy in i3 and because I wanted to try something new. At first it went well, really well in fact; workspaces worked perfectly and any tearing I used to have was nonexistent.

All was going well until the last week of April after I updated Sway and rebooted. Sway instantly froze on the startup and kept keyboard input. After some investigation, I found that Waybar seemed to be the problem. Removing it from the config let Sway startup and work normally… or so I thought. Randomly, Sway would completely freeze, the same way that it did on startup, so it was debug time.

Investigation

While getting a Sway debug log was quite trivial,1 it didn’t yield much2. The ideal solution would be to find out where Sway is hanging with a core dump, but this posed difficult to do because both Sway and keyboard input was now frozen. Thus, I resorted to the SysRq shortcuts. These shortcuts are implemented in the kernel to perform basic, yet important, actions in cases like freezes. Below are some of the most common3 ones:

ShortcutNameDescription
Alt+SysRq+rUnrawTake control of keyboard back from the display server.
Alt+SysRq+eTerminateSend SIGTERM to all processes, allowing them to terminate gracefully.
Alt+SysRq+iKillSend SIGKILL to all processes, forcing them to terminate immediately.
Alt+SysRq+sSyncFlush data to disk.
Alt+SysRq+uUnmountUnmount and remount all filesystems read-only.
Alt+SysRq+bRebootReboot

Everything beyond here was aided by the generous help of Xyene. Thank you, if you ever read this.

So, to get a core dump of the sway process, I used the unraw shortcut followed by switching to a tty and getting a core dump:4

gcore <sway pid>
gdb /usr/bin/sway <core file>
bt full

Here’s the important part, really just the first line:

0  0x00007fcbb1012c6f in json_c_get_random_seed () at /usr/lib/libjson-c.so.5
1  0x00007fcbb1011fd6 in  () at /usr/lib/libjson-c.so.5
2  0x00007fcbb100c713 in json_object_object_add_ex () at /usr/lib/libjson-c.so.5
3  0x0000561dc53e42ff in ipc_json_describe_bar_config (bar=bar@entry=0x561dc6f0cbb0) at ../sway/sway/ipc-json.c:1013
        __PRETTY_FUNCTION__ = "ipc_json_describe_bar_config"
        json = 0x561dc74da8b0
        gaps = <optimized out>
        colors = <optimized out>
        tray_bindings = <optimized out>
        tray_bind = <optimized out>
# Truncated...

What this revealed was that Sway was actually freezing because of json-c, a JSON parsing library that Sway uses. Looking at the source code of json-c, it can be seen calling the json_c_get_random_seed function in an infinite loop while checking if the result is -1. And so, we found out where json-c is freezing, but the question of why remains.

Delving into json_c_get_random_seed, another function called get_rdrand_seed is ran to try to get a random number using the RDRAND cpu instruction. This seems fine,5 except when you take into account the fact my CPU is an AMD Ryzen 5 3600X… which sometimes has a horribly malfunctioning RDRAND instruction that always returns 0xFFFFFFFFFFFFFFFF (which is -1). This isn’t normally an issue because very few processes attempt to use RDRAND without checking if it fails, often relying on /dev/urandom instead. Both the linux kernel and systemd check to make sure RDRAND returns a sane random number.

Conclusion

So, to finally fix this glorious bug, Xyene introduced a check into the has_rdrand function (which checks whether to use the RDRAND instruction later on) that disables RDRAND if it returns the same value 10 times in a row. The important section can be seen below:

// Some CPUs advertise RDRAND in CPUID, but return 0xFFFFFFFF
// unconditionally. To avoid locking up later, test RDRAND here. If over
// 10 trials RDRAND has returned the same value, declare it broken.
_has_rdrand = 0;
int prev = get_rdrand_seed();
for (int i = 0; i < 10; i++) {
	int temp = get_rdrand_seed();
	if (temp != prev) {
		_has_rdrand = 1;
		break;
	}

	prev = temp;
}

This leaves the chances of disabling a correctly functioning RDRAND instruction a whopping 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004% (4.68e-95%).

A follow up patch was also required because of some inline assembly that tried to get the cpuid bit which can be seen here.


The associated Sway issue for this blog is swaywm/sway#5290 along with json-c/json-c#489 and json-c/json-c#590 for the json-c issues.


  1. Just use sway -d 2> log ↩︎

  2. See here. ↩︎

  3. These are the shortcuts needed for a safe reboot, taken from the Arch wiki. ↩︎

  4. This also required a build of Sway which didn’t strip symbols. This was done by using the sway-git package off the AUR. Core dump available here. ↩︎

  5. Now whether using RDRAND here is a good idea, I’m not really sure. All I know is that systemd includes a paragraph explaining why it does. ↩︎