SWAR UTF-8 Validation with the hidden RISC-V gem xperm4

I was reading through the RISC-V spec again a few days ago, as one does, and found one of the greatest bit-banging instructions I've seen so far: xperm4 and xperm8
These nifty instructions form the Zbkx ("Crossbar permutations") extension, which is part of the Zbkb "Bit-manipulation for Cryptography" extension, which are sadly currently not planned to be part of future application profiles. This is because the expectation is for the vector crypto extensions to replace it, so it shouldn't be mandatory, but I'll argue that these instructions are also very useful in other contexts.

Crossbar permutations: xperm4 and xperm8

The xperm4 and xperm8 instructions implement a "lookup table" for 4 and 8 bit elements inside a general purpose registers. This is quite analogous to the vrgather.vv vector instructions, just in general purpose registers instead.

xperm4 is particularly elegant, as a 64-bit general purpose register can hold 16 nibbles, and a nibble can address 16 indices, so you apply arbitrary bit-permutations in 4-bit lanes.

Here are some simple usage examples:

SWAR UTF-8 Validation

The title promised UTF-8 validation, so here we go!

I explained in the previous article how to implement UTF-8 to UTF-16 conversion, one step of which is validating the UTF-8 input. This was done using three 4-bit lookup tables, as described in "Validating UTF-8 In Less Than One Instruction Per Byte".
Sound familiar?
xperm4 gives us almost exactly what we need, the only difference is that we need to look up 16 8-bit elements, so we can just call xperm4 twice:

#include <riscv_bitmanip.h>

static uint64_t
perm4x8_lo(uint64_t x, uint64_t hi, uint64_t lo) {
	return (__riscv_xperm4_64(hi, x) & 0x0F0F0F0F0F0F0F0F) << 4
	     | (__riscv_xperm4_64(lo, x) & 0x0F0F0F0F0F0F0F0F);
}

static uint64_t
perm4x8_hi(uint64_t x, uint64_t hi, uint64_t lo) {
	return (__riscv_xperm4_64(hi, x) & 0xF0F0F0F0F0F0F0F0)
	     | (__riscv_xperm4_64(lo, x) & 0xF0F0F0F0F0F0F0F0) >> 4;
}

These two helpers implement 4-bit lookup of 8-bit elements, one uses the lower and one the higher nibble as the index. Now we can directly translate the firs part of the RVV code to scalar, however one puzzle piece is still missing.
We need to create a mask of 3 byte (0b111?????) and 4 byte (0b1111????) leading UTF-8 characters and combine this with the lookup table result. Notice how only the upper nibble is important. This calls for our friend xperm4. It can be used to map all matching nibbles to 0xFF and all others to zero, then we mask out the not needed result from the lower nibbles, and we are done.

static uint64_t
check_utf8_bytes(uint64_t in0, uint64_t next) {
	uint64_t in1 = in0>> 8|next<<56;
	uint64_t in2 = in0>>16|next<<48;
	uint64_t in3 = in0>>24|next<<40;
	uint64_t err = perm4x8_hi(in2, 0x4102888800000000, 0x9511000022222222)
	             & perm4x8_lo(in2, 0xCCDCCCCCCCC888AE, 0xBBBBBBBBBBBB3337)
	             & perm4x8_hi(in3, 0x0000BBAE00000000, 0x1111AAE611111111);
	uint64_t is_3 = __riscv_xperm4_64(0xFFull<<56, in1); /* 0b111????? */
	uint64_t is_4 = __riscv_xperm4_64(0xF0ull<<56, in0); /* 0b1111???? */
	return ((is_3|is_4) & 0x8080808080808080) ^ err;
}

There we go, now we have a primitive that we need to apply on aligned 64-bit slices of the input.
I don't expect you to follow this exactly, if you aren't familiar with this method of UTF-8 validation. Hopefully you could see how xpack4 was invaluable in implementing it.

For completeness, here is a function that uses our primitive to validate a byte stream.

static int
utf8_validate(const char *buf, size_t len)
{
	typedef uint64_t __attribute__((__may_alias__)) u64;
	uint64_t prev, cur, i;

	for (prev = 0; len && ((uintptr_t)buf & 7) != 0; --len)
		prev = prev>>8 | *buf++*1ull<<56;

	for (; len >= 8; len -= 8, buf += 8, prev = cur) {
		cur = *(u64*)buf;
		if (!((prev | (cur << 40)) & 0x8080808080808080)) continue;
		if (check_utf8_bytes(prev, cur)) return 0;
	}

	for (i=cur=0; len; --len, i += 8)
		cur |= *buf++*1ull << i;

	return !check_utf8_bytes(prev, cur) && !check_utf8_bytes(cur, 0);
}

We could handle the head and tail with separate scalar code, however it's easiest just use or primitive again, but with partially filled 64-bit numbers. This implementation also fast forwards if a 64-bit register contains only ASCII characters.

Benchmarks

For benchmarking, you usually need hardware, but currently there is no RISC-V chip that implements the scalar crypto instructions, that regular people can buy.

RISC-V isn't a proprietary ISA, though, which allows for open-source implementations. We'll be using the second and third generation of the open-source XiangShan high performance out-of-order cores. As of writing, XiangShanV3 outperforms any available and announced proprietary RISC-V dev board:

6.11 7.01 8.65 9.55 9.90 11.0 15.0 15.73 21.69 XT910 XiangShanV1 SiFive P550 XiangShanV2 ARM Cortex-A76 SiFive P650 XiangShanV3 ARM Cortex-X1 Apple M1 Estimated SPECint 2006/GHz (Proportional to IPC)

Interestingly MILK-V just announced a Laptop featuring XiangShanV2 cores, and there will be a presentation about the XiangShan project at Hot Chips 2024.

As input text, I've used the first lines of languages in the unicode_lipsum/lipsum dataset, and filled up the rest with Emojis to reach a total of 5000 bytes. Using a larger input that that wasn't possible, since simulating RTL is quite slow. For reference, I also included measurements from my Zen1 Desktop, and vectorized implementations from simdutf. I modified all implementations to remove the just ASCII fast path, because the regular scalar implementation can only use it with fast unaligned load/stores.
Anyway, here are the results:

ImplementationXiangShanV2XiangShanV3Ryzen 1600x (Zen1)
scalar54413 cycles20714 cycle24272 cycles
xperm4 SWAR15050 cycles7520 cycles---
LUT38085 cycles31529 cycle27010 cycles
simdutf---7349 cycles4255 cycles

Wow, we are about 2x faster than the regular scalar implementation, and surprisingly also very close to the RVV implementation.
That might sound weird, however, a likely explanation is that the XiangShanV3 RVV implementation is a relatively new addition to the core and still being worked on. Predicting and speculating across vsetvli instructions for example is currently not supported, but on the roadmap. It also has two xperm4 capable execution units, but only one that supports vrgather.