This document is in the Development state

Expect potential changes. This draft specification is likely to evolve before it is accepted as a standard. Implementations based on this draft may not conform to the future standard.

This specification is licensed under the Creative Commons Attribution 4.0 International License (CC-BY 4.0). The full license text is available at creativecommons.org/licenses/by/4.0/.

Copyright 2025 by RISC-V International.

1. Introduction

The document proposes new in-lane vrgather variants, which allow hardware implementations to provide specialized implementations. This is needed to be competitive with other contemporary SIMD/Vector ISAs, which already have dedicated in-lane shuffle instructions.

2. Motivation and Rational

The RISC-V Vector extension provides the vrgather.vv and vrgatherei16.vv instructions for arbitrarily shuffling elements within vector registers. While this type of instruction is very powerful and required for any modern SIMD/Vector ISA, there are a few challenges that complicate hardware implementations.

vrgather fundamentally scales quadratically in complexity with element count, so by extension VLEN and LMUL. However, many vrgather shuffles don’t require this full complexity, as they shuffle elements within power-of-two lanes of elements.

A fast and high throughput LMUL=1 vrgather implementation is non-negotiable for application class processors that expect to run general purpose code. LMUL>1 vrgather is usually implemented by applying a LMUL=1 vrgather primitive LMUL^2 times. This is fine because, if an application doesn’t need to gather across LMUL=1 vector register boundaries, it can use multiple LMUL=1 vrgathers instead, which scales linearly and also reduces register pressure.

When increasing VLEN, a fast LMUL=1 vrgather implementation grows quadratically in area. This is a cost application class processors must pay; however, this also increases instruction latency and potentially reduces the throughput affordable in design constraints. Ideally we’d want to add new vrgather instructions that specify a lane width and only shuffle within those lanes. This allows implementations to implement those vrgather variants with smaller lane sizes using a lower latency. Additionally, it is now possible to implement these on more execution units than was previously affordable if all execution units would have to implement the full vrgather.

AVX-512 and SVE2.1 have precedent for such instructions, specifically for shuffles within lanes of 128 bits. Looking at the different AVX-512 implementations in Table 1, we can see that the general shuffle vpermb indeed has a higher latency and lower throughput than the in-128-bit-lane shuffle vpshufb.

Table 1. latency/throughput of 256-bit vector registers shuffles (Abel & Reineke, 2019)
icelake tigerlake rocketlake alderlake-P Zen4 Zen5 (Yee, 2025)

vpshufb

1/2

1/2

1/2

1/2

2/2

2/2

vpermb

3/1

3/1

3/1

3/1

4/2

4/2

vpermw

3/1

3/1

3/1

3/1

4/2

4/2

These latency and throughput improvements are the main motivation for this proposal. For implementations with a native gather primitive smaller than VLEN, adding hardware that ensures the gather primitive is only applied between lanes that are actually read from actually covers more shuffles than the proposed in-lane shuffles. It also achieves linear scaling for things like reverse or expanding/compressing-shuffles (e.g. base64 en/decode). While this is a great fit for long vector DLEN<VLEN architectures, this isn’t really applicable for getting latency and throughput improvements in high-performance out-of-order application class processors, which usually implement a VLEN-wide gather primitive.

Figure 1 to Figure 2 visualize how knowing the lane size simplifies the amount of work the shuffle has to do.

SHUF
Figure 1. VLEN=256 SEW=32 4-element vrgather.vv shuffle
SHUF lane
Figure 2. VLEN=256 SEW=32 4-element vrgather128.vv shuffle

Figure 3 to Figure 4 show how this applies to the common usage pattern of using register gather instructions to implement lookup-tables.

LUT
Figure 3. VLEN=256 SEW=32 4-element vrgather.vv LUT
LUT lane
Figure 4. VLEN=256 SEW=32 4-element vrgather128.vv LUT

To summarize, we aim to propose new in-lane vrgather variants to allow implementations to easily provide specialized implementations of in-lane shuffles.

3. Design considerations

3.1. Defining the lane size

Should the lane size be defined based on the number of bits or the number of elements?

It is important that, for the choosen definition of lane size, shuffles of N-wide lanes of different element widths have similar complexity, because RVV uses the same instruction encoding for different element width.

Other ISAs define their in-lane shuffle instructions based on lane size. Most x86 and Arm processors don’t report different latency or throughput for shuffles of different element widths (see Table 1). So the lane size should be defined in terms of bits.

3.2. Instruction semantics

The usages of in-lane shuffles can be categorized in three categories:

  • lookup-table: The indices are variable, the values are constant, and the same values are in every lane.

  • static shuffle: The indices are constant and apply a constant permutation to the values.

  • dynamic shuffle: The indices and values are both variable, e.g. a different shuffle on every lane.

Appendix A indicates that the static shuffle use case is the most common, followed by the lookup-table use case, with dynamic shuffles seeing the least use.

The static shuffle usage can be nicely covered using a vector-scalar in-lane register gather that encodes the shuffle indices that are applied to each lane in a single 64-bit GPR. Specifically, one can encode 16 4-bit indices in a 64-bit GPR, so this type of shuffle only works with 16 or fewer elements in each lane. This suggests an instruction that can shuffle 16 SEW wide elements within 16*SEW bit wide lanes. As discussed in Section 3.1 however, we want to keep the lane size in bits constant for any given SEW.

To facilitate this, we can instead define an N-bit wide vector-scalar vrgather variant (vrgather<N>ei4.vx), which ignores SEW and always shuffles N/16 bit wide elements. So vrgather128ei4.vx shuffles 16 8-bit elements in 128-bit lanes, while vrgather512ei4.vx shuffles 16 32-bit elements in 512-bit lanes regardless of SEW.
To make the result still usable, when N/16 != SEW, the instruction operates under EEW=N/16, EMUL=LMUL and EVL=(VL*SEW+EEW-1)/EEW. This has the great knock-on effect of allowing type punning without two additional vl and vtype changing vsetvli instructions, which can be quite handy in complex SIMD code.

We want to address the other usages with vector-vector in-lane register gather variants. The instructions should only shuffle within N-bit lanes and use the significant lower bits of the indices in corresponding lanes to control the shuffles. The upper index bits should be implicitly masked out to better support the lookup-table use case, which would otherwise often need an additional AND operation.

One additional thing to consider is how the instructions deal with LMUL>1.

For the lookup-table usage, it would be nice if the instructions always read an EMUL=1 value vector, as this would save vector registers compared to LMUL=EMUL and still produce the same result. You could still use such an instruction to implement LMUL>1 dynamic shuffles, by executing multiple LMUL=1 instructions and temporarily adjusting vl and vtype. The same could be done with a EMUL=LMUL variant, if you want to use it to implement a lookup-table and save vector registers. Considering that the lookup-table usage seems to be more common than the dynamic shuffle one, the EMUL=1 value vector register source could be a good idea.

It does, however, run into problems when the lane size is larger than VLEN, in which case you EMUL=1 wouldn’t be enough. It also would be a new type of operation with different register dependencies than in currently defined instructions.

So in an effort of simplicity and to cover all usages, a simple vector-vector in-lane vrgather (vrgather<N>.vv) design with EMUL=LMUL was chosen.

4. Proposed Instructions

4.1. vrgather<N>.vv

Synopsis

N-bit in-lane vector register gather (128-bit, 256-bit, 512-bit and 1024-bit)

Mnemonics

vrgather128.vv vd, vs2, vs1, vm
vrgather256.vv vd, vs2, vs1, vm
vrgather512.vv vd, vs2, vs1, vm
vrgather1024.vv vd, vs2, vs1, vm

Encoding
Diagram
Description

The in-lane vector register gather instructions are register gather instruction that read elements from N-bit lanes of a first source vector register group at locations, within the lanes, given by the corresponding lanes of a second vector source register group. The index values are taken from the lowest N/SEW bits of the elements in the second vector. The source vector can be read at any index from within a lane regardless of vl.

Operation
if (vs1 == vd | vs2 == vd) then return Illegal_Instruction();
let idx = i/16*SEW + unsigned(vs1_val[i][N/SEW-1 .. 0]);
if idx < VLMAX then vs2_val[idx] else zeros();

4.2. vrgather<N>ei4.vx

Synopsis

N-bit in-lane vector-scalar register gather (128-bit, 256-bit, 512-bit and 1024-bit)

Mnemonics

vrgather128ei4.vx vd, vs2, rs1, vm
vrgather256ei4.vx vd, vs2, rs1, vm
vrgather512ei4.vx vd, vs2, rs1, vm
vrgather1024ei4.vx vd, vs2, rs1, vm

Encoding
Diagram
Description

The in-lane vector-scalar register gather instructions are vector register gather instructions that read elements from N-bit lanes of a source vector register group at locations within the lanes, given by the 16 4-bit indices encoded in the scalar register. The source vector can be read at any index from within a lane regardless of vl. The instructions operate at a EEW=N/16, EMUL=LMUL, EVL=(VL*SEW+EEW-1)/EEW, enabling efficient type punning.

Software should prefer the variant with the smallest lane for a given shuffle, even if EEW doesn’t match the element width it want to operate on. For example, duplicating the first 32-bit element in lanes of four 32-bit elements, so 128-bit lanes, can be accomplished with vrgather128ei4.vx and a scalar register with the value 0x3210321032103210.

Operation
let EEW = N/16;
let EMUL = LMUL;
let EVL = (VL*SEW+EEW-1)/EEW;
if (vs1 == vd) then return Illegal_Instruction();
let idx = i/16*EEW + unsigned(vs1_val[i%16*4+4 .. i%16*4]);
if idx < VLMAX then vs2_val[idx] else zeros();

Bibliography

Abel, A., & Reineke, J. (2019). uops.info: Characterizing Latency, Throughput, and Port Usage of Instructions on Intel Microarchitectures. Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 673–686. doi.org/10.1145/3297858.3304062

Yee, A. J. (2025). Zen5’s AVX512 Teardown + More…​ www.numberworld.org/blogs/2024_8_7_zen5_avx512_teardown/

Appendix A: Usage survey

The goal of this survey is to quanitify the usage of in-lane shuffles in existing SIMD code. It was conducted by inspecting various open source codebases by hand.

Table 2. vpermb/vpshufb occurrences in open-source codebases
usage dav1d x264 ffmpeg hyperscan simdutf simdjson hyperscan

vpermb

623

0

37

26

21

0

21

vpshufb

1347

184

182

104

91

10

104

Table 2 shows that in-lane shuffle operations are very common.

Table 3. Categorization of vpshufb usage in open-source codebases
vpshufb usage LUT constant-shuffle dynamic-shuffle

ffmpeg

14

168

0

simdutf

20

64

7

simdjson

8

0

2 (implements vcompress.vm)

Table 3 shows how common the different types of shuffles are. The large amount of constant-shuffle usage motivates the separate vrgather<N>ei4.vx instructions.