mizchi/simd

SIMD abstraction module for MoonBit. Automatically selects optimized implementations per target.

API

// Scalar implementations (all targets)
scalar_sum(arr : FixedArray[Int]) -> Int
scalar_dot_product(a : FixedArray[Int], b : FixedArray[Int]) -> Int
scalar_add(a : FixedArray[Int], b : FixedArray[Int], out : FixedArray[Int]) -> Unit
scalar_adler32(data : FixedArray[Byte]) -> UInt

// SIMD implementations (target-optimized)
simd_sum(arr : FixedArray[Int]) -> Int
simd_dot_product(a : FixedArray[Int], b : FixedArray[Int]) -> Int
simd_add(a : FixedArray[Int], b : FixedArray[Int], out : FixedArray[Int]) -> Unit
simd_adler32(data : FixedArray[Byte]) -> UInt

Target Implementations

Target	`simd_*` implementation	Speedup
native	C FFI (NEON/SSE intrinsics)	13–33x
wasm	scalar fallback	none
wasm-gc	scalar fallback	none
js	scalar fallback	none

Benchmarks

native (Apple Silicon M2)

scalar_sum_1024          1.60 µs
simd_sum_1024            0.12 µs   (13x)

scalar_dot_product_1024  2.07 µs
simd_dot_product_1024    0.15 µs   (14x)

scalar_add_1024          2.66 µs
simd_add_1024            0.08 µs   (33x)

scalar_adler32_4096     18.33 µs
simd_adler32_4096        1.09 µs   (17x)

Analysis: Where does the speedup come from?

Comparing against standalone C compiled with clang:

Implementation	sum (1024)	Notes
MoonBit scalar (TCC)	1600 ns	baseline
clang -O0 scalar	654 ns	TCC ≈ clang -O0
clang -O2 -fno-vectorize scalar	251 ns	clang optimization quality
clang -O2 -fno-vectorize + NEON	102 ns	explicit NEON gives 2.5x
clang -O2 scalar (auto-vectorization)	69 ns	clang auto-vectorizes to NEON
MoonBit C FFI (native-stub)	120 ns	≈ clang -O2 NEON

Conclusion: The 13–33x speedup is primarily due to TCC vs clang compiler quality, not SIMD instructions alone. NEON SIMD itself contributes only ~2–2.5x. clang -O2 auto-vectorizes scalar loops, making hand-written NEON intrinsics of limited additional value.

wasm / wasm-gc

Scalar fallback only — no SIMD benefit. However, wasm scalar is ~2x faster than native scalar (TCC) thanks to JIT optimization:

wasm scalar_sum_1024:   0.65 µs  (JIT outperforms TCC)
native scalar_sum_1024: 1.60 µs  (TCC codegen is slower)

Limitations & Future Work

wasm SIMD (v128) — blocked on Dwarfsm

MoonBit's inline WAT parser (Dwarfsm) does not fully support v128 SIMD instructions, preventing wasm SIMD optimization.

Dwarfsm v128 support status (verified by reading source / testing):

Feature	Status	Notes
`v128` as valtype (`param`/`result`)	✅ Works
`v128.load` / `v128.store`	✅ Works	Memory access
`f32x4.add`, `f32x4.mul`, `f64x2.add`, `f64x2.mul`	✅ Works	Float SIMD ops
`v128.const`	❌ Parse error	Constant initialization
`i32x4.add`, `i32x4.mul`, etc.	❌ Parse error	Integer SIMD ops
`i32x4.extract_lane`, `i8x16.shuffle`	❌ Parse error	Lane manipulation

Two remaining blockers:

Missing instruction table entries: i32x4.*, v128.const, etc. are not registered in dwarfsm_parse.ml. These can be added by extending the pattern table.
#external type maps to externref: #external type V128 compiles to externref in wasm, causing type mismatch with inline WAT's v128. A language-level v128 type or control over #external wasm representation is needed.

Related: moonbitlang/core#2844

wasm-gc SIMD — fundamentally blocked

GC-heap arrays cannot be addressed with v128.load — SIMD requires linear memory access.

Native compiler backend

MoonBit currently uses TCC as the default native compiler. If a clang/gcc backend is supported in the future, auto-vectorization would make C FFI SIMD largely unnecessary. The native-stub C files are compiled separately with the system compiler (clang on macOS), which is why C FFI already benefits from optimization.

Potential additions

simd_min / simd_max (array min/max)
simd_memcmp (byte array comparison)
Float support (FixedArray[Double] → f64x2)

Usage

just test          # All targets
just test-native   # native only
just bench-native  # native benchmark
just bench-wasm    # wasm benchmark

File Structure

src/
├── moon.pkg.json       # Target-specific files + native-stub config
├── scalar.mbt          # Scalar implementations (all targets)
├── simd_wasm.mbt       # wasm: scalar fallback
├── simd_wasm_gc.mbt    # wasm-gc: scalar fallback
├── simd_native.mbt     # native: extern "C" FFI declarations
├── simd_native.c       # C SIMD (NEON/SSE + scalar fallback)
├── simd_js.mbt         # js: scalar fallback
├── lib_test.mbt        # Correctness tests (17 tests)
└── lib_bench.mbt       # Benchmarks

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
README.md		README.md
bench_native.c		bench_native.c
justfile		justfile
moon.mod.json		moon.mod.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mizchi/simd

API

Target Implementations

Benchmarks

native (Apple Silicon M2)

Analysis: Where does the speedup come from?

wasm / wasm-gc

Limitations & Future Work

wasm SIMD (v128) — blocked on Dwarfsm

wasm-gc SIMD — fundamentally blocked

Native compiler backend

Potential additions

Usage

File Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mizchi/simd

API

Target Implementations

Benchmarks

native (Apple Silicon M2)

Analysis: Where does the speedup come from?

wasm / wasm-gc

Limitations & Future Work

wasm SIMD (v128) — blocked on Dwarfsm

wasm-gc SIMD — fundamentally blocked

Native compiler backend

Potential additions

Usage

File Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages