fix(parquet): align dictionary fallback with parquet-mr by twuebi · Pull Request #786 · apache/arrow-go

twuebi · 2026-04-28T13:18:56Z

Rationale for this change

On dictionary overflow, arrow-go always flushed the dictionary page and any buffered dict-encoded data pages before switching to PLAIN, even when no dict-encoded data page had been cut. On mid-cardinality columns the result was a 4-encoding chunk layout (PLAIN_DICTIONARY, PLAIN, RLE, PLAIN) that bloated output by 20-30% versus parquet-mr.

This was noticed when testing iceberg-go's recently added compaction feature, where some tables with particular high cardinality columns would see a 30% size increase after compaction.

What changes are included in this PR?

Mirror parquet-mr's FallbackValuesWriter:

Discard the dictionary and re-encode buffered indices as PLAIN when no dict-encoded data page has been flushed yet; only emit the dictionary page once a dict-encoded page is committed.
Before the first dict-encoded page, fall back to PLAIN if dict + indices >= raw input bytes.
Size dict-encoded pages by raw input bytes (not the RLE indices' encoded size) so the page cadence matches PLAIN.

Adds DictEncoder.FallBackTo / ObservedRawSize and exposes BinaryMemoTable.Value for the fallback translation.

Are these changes tested?

Yes, as part of the PR and also e2e while testing compaction in iceberg-go.

Are there any user-facing changes?

No public API changes, only observable thing should be the dropped double encoding.

On dictionary overflow, arrow-go always flushed the dictionary page and any buffered dict-encoded data pages before switching to PLAIN, even when no dict-encoded data page had been cut. On mid-cardinality columns the result was a 4-encoding chunk layout (PLAIN_DICTIONARY, PLAIN, RLE, PLAIN) that bloated output by 20-30% versus parquet-mr. Mirror parquet-mr's FallbackValuesWriter: - Discard the dictionary and re-encode buffered indices as PLAIN when no dict-encoded data page has been flushed yet; only emit the dictionary page once a dict-encoded page is committed. - Before the first dict-encoded page, fall back to PLAIN if dict + indices >= raw input bytes. - Size dict-encoded pages by raw input bytes (not the RLE indices' encoded size) so the page cadence matches PLAIN. Adds DictEncoder.FallBackTo / ObservedRawSize and exposes BinaryMemoTable.Value for the fallback translation.

zeroshade · 2026-04-29T16:14:19Z

+		rawSize := dictEnc.ObservedRawSize()
+		encodedSize := dictEnc.EstimatedDataEncodedSize()
+		dictSize := int64(dictEnc.DictEncodedSize())
+		if rawSize > 0 && dictSize+encodedSize >= rawSize {


do we actually need the rawSize > 0 check?

zeroshade · 2026-04-29T16:16:30Z

-			// To keep pages in consistent state,
-			// remove the pages that will be released using above defer call.


why remove the comment?

zeroshade · 2026-04-29T16:18:21Z

+	if err == nil {
+		w.dictPageWritten = true
+	}


Suggested change

if err == nil {

w.dictPageWritten = true

}

w.dictPageWritten = err == nil

zeroshade · 2026-04-29T16:25:29Z

+	// fallbackFn is set by each typed column writer at construction to its
+	// own FallbackToPlain. It lets the base FlushCurrentPage trigger
+	// fallback without needing to know the concrete value type.
+	fallbackFn func()


FallbackToPlain is already part of the ColumnChunkWriter interface, could we just modify logic in checkDictionarySizeLimit etc. instead of needing to pass the function callback like this?

zeroshade · 2026-04-29T16:31:28Z

+func (m *binaryMemoTableImpl) Value(i int) []byte {
+	return m.builder.Value(i)
+}


this is the legacy map-based implementation. Luckily this function already exists in internal/hashing/xxh3_memo_table.go for the binary memo table that is actually being used.

#

twuebi requested a review from zeroshade as a code owner April 28, 2026 13:18

twuebi added 2 commits April 29, 2026 17:52

int64

ad82eab

no more wrapping?

d2a4950

zeroshade reviewed Apr 29, 2026

View reviewed changes

twuebi added 2 commits April 30, 2026 10:52

review comments

c9b5a68

#

update magic numbers to reflect improved file sizes

84ecf26

zeroshade merged commit 2b2aa6b into apache:main Apr 30, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parquet): align dictionary fallback with parquet-mr#786

fix(parquet): align dictionary fallback with parquet-mr#786
zeroshade merged 5 commits intoapache:mainfrom
twuebi:tp/parquet-dict-fallback-parity

twuebi commented Apr 28, 2026

Uh oh!

zeroshade Apr 29, 2026

Uh oh!

zeroshade Apr 29, 2026

Uh oh!

zeroshade Apr 29, 2026

Uh oh!

zeroshade Apr 29, 2026

Uh oh!

zeroshade Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// To keep pages in consistent state,
		// remove the pages that will be released using above defer call.

Conversation

twuebi commented Apr 28, 2026

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

zeroshade Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

zeroshade Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants