Skip to content
This repository was archived by the owner on Jun 24, 2024. It is now read-only.

Add HuggingFace's Tokenizer#271

Merged
philpax merged 22 commits intorustformers:mainfrom
RedBoxing:hf-tokenizer
May 29, 2023
Merged

Add HuggingFace's Tokenizer#271
philpax merged 22 commits intorustformers:mainfrom
RedBoxing:hf-tokenizer

Conversation

@RedBoxing
Copy link
Copy Markdown

Closes #35 and #212

Copy link
Copy Markdown
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a few nitpicks around error handling/naming

Copy link
Copy Markdown
Contributor

@danforbes danforbes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding work 💪🏻 Unbelievable first contribution 🚀

@philpax
Copy link
Copy Markdown
Collaborator

philpax commented May 24, 2023

I've updated this PR and made a few code quality changes, but the no-space issue means it can't be made the default option. I've asked for help with Tokenizers to determine why that's happening.

@philpax philpax mentioned this pull request May 25, 2023
@philpax
Copy link
Copy Markdown
Collaborator

philpax commented May 27, 2023

We just need to implement this huggingface/tokenizers#1141 (comment) and we should be able to take this across the line.

In addition to this, we should update to the latest non-HF tokenizer from llama.cpp (seeing as we're supporting both), but that can be done in a separate PR.

@RedBoxing RedBoxing requested a review from philpax May 28, 2023 09:25
@philpax philpax merged commit 0725865 into rustformers:main May 29, 2023
@philpax
Copy link
Copy Markdown
Collaborator

philpax commented May 29, 2023

Great work, @RedBoxing, and thanks for all of the help testing, everyone! Glad to have this in :)

@hhamud hhamud mentioned this pull request Aug 7, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use the HuggingFace llama Tokenizer

3 participants