SpacemiT X60 (Banana Pi BPI-F3, SpacemiT K1) uarch-tool benchmarks

VLEN: 256

Detect all1s tail/mask policy with simple code snippet:
Tail agnostic policy: undisturbed
Mask agnostic policy: undisturbed
Is vl always set to min(AVL,VLMAX): yes
    Note: spec allows ceil(AVL/2)<=vl<=VLMAX for VLMAX<AVL<2*VLMAX

Measures how LMUL scheduling impacts when results are ready:
A) LMUL=8 v0 overlap with LMUL=1 v0:     424.8934125 cycles/iter
B) LMUL=8 v0 overlap with LMUL=1 v3:     422.8972702 cycles/iter
C) LMUL=8 v0 overlap with LMUL=1 v7:     436.9277095 cycles/iter
D) LMUL=8 v0 overlap with LMUL=1 v8:     425.8244981 cycles/iter
E) LMUL=8 v0 overlap with LMUL=1 v0..v8: 372.6737289 cycles/iter

The difference between A and C indices that results for the upper part of a vector register group gets completed later than the lower part, even when long dependencies are present.
(E) is an unexpected outlier; it should've performed similarly to (D), but certainly not better than (A). Presumably the result has to do with worse chaining for instructions that read/write to the same register.

Measures overhead of reinterpreting a mask as a vector:
A) reinterpret:       16.0153167 cycles/iter
B) don't reinterpret: 15.0142419 cycles/iter

While B is slighly faster than A, the difference seems to only be a single cycle, which doesn't match up with what would be expected if the mask would need to be repacked. The difference presumably is due to scheduling.