Skip to content

PERF: Optimize CSV categorical parsing when categories are known#65018

Draft
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-17743
Draft

PERF: Optimize CSV categorical parsing when categories are known#65018
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel:perf-17743

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

@jbrockmendel jbrockmendel commented Apr 2, 2026

As I mentioned in #17743, I'm on the fence as to whether this is actually worth doing.

Summary

  • When read_csv receives a CategoricalDtype with pre-specified categories, map parsed values directly to category codes in a single pass using a pre-built hash table, skipping the factorize + recode_for_categories steps.
  • For non-string category types (datetime, float edge cases, bool), the optimization is attempted first via str() conversion and falls back gracefully to the existing _from_inferred_categories path if the string representations don't match the raw CSV tokens.
  • Adds ASV benchmark time_convert_known_categories to ReadCSVCategorical.

closes #17743

Test plan

  • Existing test_categorical.py tests pass (113 passed, 7 xfailed)
  • String, integer, float, datetime, timedelta, and boolean category types all produce correct results
  • Unexpected categories still emit Pandas4Warning and map to NA
  • Non-string types that fail string matching fall back correctly

🤖 Generated with Claude Code

When read_csv receives a CategoricalDtype with pre-specified categories,
map parsed values directly to category codes in a single pass using a
pre-built hash table, avoiding the factorize-then-recode steps.

For non-string category types (datetime, float edge cases, bool), the
optimization is attempted first and falls back to the existing path if
str() representations don't match the raw CSV tokens.

closes pandas-dev#17743

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PERF: Optimize _categorical_convert CSV parser when categories are known ahead of time

1 participant