PERF: Optimize CSV categorical parsing when categories are known#65018
Draft
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
Draft
PERF: Optimize CSV categorical parsing when categories are known#65018jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
jbrockmendel wants to merge 1 commit intopandas-dev:mainfrom
Conversation
When read_csv receives a CategoricalDtype with pre-specified categories, map parsed values directly to category codes in a single pass using a pre-built hash table, avoiding the factorize-then-recode steps. For non-string category types (datetime, float edge cases, bool), the optimization is attempted first and falls back to the existing path if str() representations don't match the raw CSV tokens. closes pandas-dev#17743 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
As I mentioned in #17743, I'm on the fence as to whether this is actually worth doing.
Summary
read_csvreceives aCategoricalDtypewith pre-specified categories, map parsed values directly to category codes in a single pass using a pre-built hash table, skipping the factorize +recode_for_categoriessteps.str()conversion and falls back gracefully to the existing_from_inferred_categoriespath if the string representations don't match the raw CSV tokens.time_convert_known_categoriestoReadCSVCategorical.closes #17743
Test plan
test_categorical.pytests pass (113 passed, 7 xfailed)Pandas4Warningand map to NA🤖 Generated with Claude Code