[Mono AOT] Improve `dn_simdhash` lookup on arm64

## Description

The current implementation for arm64 in the .NET runtime isn’t optimized. Since arm64 lacks a direct intrinsic equivalent to `_mm_movemask_epi8`, an emulation is used, which negatively impacts performance: https://github.com/dotnet/runtime/blob/367cf39652b1193d04ce3ac345a6384ecba53382/src/native/containers/dn-simdhash-arch.h#L93-L124

This optimization can improve AOT compilation (build time) on macOS-arm64 host of MAUI template app in debug config by ~80%:
 - SIMD emulation implementation:
   - AOT compilation of the dedup assembly: 247,423 ms
   - Isolated lookup for 1,000,000 iterations: 172 ms
 - Software lookup implementation (`g_hash_table_lookup`):
   - AOT compilation the dedup assembly: 47,692 ms
   - Isolated lookup for 1,000,000 iterations: 66 ms

## Alternative implementations
 - https://github.com/f4exb/cm256cc/blob/master/sse2neon.h
 - https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon

## Tasks
 - [x] https://github.com/dotnet/runtime/pull/113316
 - [x] https://github.com/dotnet/runtime/pull/113287
 - [x] https://github.com/dotnet/runtime/pull/113274

	// returns an index in range 0-13 on match, 14-32 if no match
	static DN_FORCEINLINE(uint32_t)
	find_first_matching_suffix_simd (
	dn_simdhash_search_vector needle,
	// Only used by the vectorized implementations; discarded by scalar.
	dn_simdhash_suffixes haystack
	) {
	#if defined(__wasm_simd128__)
	return ctz(wasm_i8x16_bitmask(wasm_i8x16_eq(needle.vec, haystack.vec)));
	#elif defined(_M_AMD64) \|\| defined(_M_X64) \|\| (_M_IX86_FP == 2) \|\| defined(__SSE2__)
	return ctz(_mm_movemask_epi8(_mm_cmpeq_epi8(needle.m128, haystack.m128)));
	#elif defined(__ARM_NEON)
	dn_simdhash_suffixes match_vector;
	// Completely untested.
	static const dn_simdhash_suffixes byte_mask = {
	{ 1, 2, 4, 8, 16, 32, 64, 128, 1, 2, 4, 8, 16, 32, 64, 128 }
	};
	union {
	uint8_t b[4];
	uint32_t u;
	} msb;
	match_vector.vec = vceqq_u8(needle.vec, haystack.vec);
	dn_simdhash_suffixes masked;
	masked.vec = vandq_u8(match_vector.vec, byte_mask.vec);
	msb.b[0] = vaddv_u8(vget_low_u8(masked.vec));
	msb.b[1] = vaddv_u8(vget_high_u8(masked.vec));
	return ctz(msb.u);
	#else
	dn_simdhash_assert(!"Scalar fallback should be in use here");
	return 32;
	#endif
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mono AOT] Improve `dn_simdhash` lookup on arm64 #113074

Description

Alternative implementations

Tasks

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Mono AOT] Improve dn_simdhash lookup on arm64 #113074

Description

Description

Alternative implementations

Tasks

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[Mono AOT] Improve `dn_simdhash` lookup on arm64 #113074