release GVL while encoding / decoding tokens#90
Conversation
Encoding / decoding tokens is a CPU intensive task. Since tiktoken
doesn't call back in to Ruby when encoding / decoding, it should be safe
for us to release the GVL while doing it. This allows us to process
tokens in parallel on Ruby.
For example:
```ruby
require "tiktoken_ruby"
require "benchmark"
# Generate a large text to tokenize
LARGE_TEXT = ("Hello, world! This is a test of the tiktoken library. " * 1000).freeze
THREAD_COUNT = 4
ITERATIONS = 10
encoder = Tiktoken.encoding_for_model("gpt-4")
# Single-threaded baseline
single_threaded_time = Benchmark.realtime do
(THREAD_COUNT * ITERATIONS).times do
encoder.encode(LARGE_TEXT)
end
end
# Multi-threaded with GVL release
multi_threaded_time = Benchmark.realtime do
threads = THREAD_COUNT.times.map do
Thread.new do
ITERATIONS.times do
encoder.encode(LARGE_TEXT)
end
end
end
threads.each(&:join)
end
p SPEEDUP: (single_threaded_time / multi_threaded_time.to_f)
```
On the main branch we see no speedup (the output is 1.0). On this
branch, I see 3.4x speedup (though I think as the input grows in size,
we'll see numbers even closer to the theoretical max which would be
THREAD_COUNT)
|
👋 I've had bad luck with the GC marking things away in the past, e.g. gjtorikian/selma#81 I added a pretty basic stress test to this PR. Would you mind reviewing it and confirming that it's somewhat useful in verifying that nothing terrible is going to happen with this change? I also ended up exposing |
c23bb1f to
58e4baf
Compare
👋 Good to see you! (online)
The tests make sense, and should work. I'm personally not worried about GC in this case. It looks like we're converting all Ruby types in to Rust types before encoding them in the context that's passed to the GVL release routines. Before we merge this though, give me a bit of time to review exactly how The other thing I'm slightly worried about is that it looks like these CoreBPE contexts are typically singleton objects. Do we know if the underlying data structures are threadsafe? Is it legit to call
No problem! Thank you! |
I think this isn't a problem, since the Otherwise, there are some comments in the source that seem to hint at multi-threadedness working in Python, releasing their GIL. I may be misreading this though. |
Ah, great. Yes, this should work totally fine then.
Ok, I think everything is good. All of the context data types ( Thanks for your patience!! |
|
Thanks for your contribution! Release should be going out later today. I'm going to use this update as an excuse to finally sort out #79. |
Encoding / decoding tokens is a CPU intensive task. Since tiktoken doesn't call back in to Ruby when encoding / decoding, it should be safe for us to release the GVL while doing it. This allows us to process tokens in parallel on Ruby.
For example:
On the main branch we see no speedup (the output is 1.0). On this branch, I see 3.4x speedup (though I think as the input grows in size, we'll see numbers even closer to the theoretical max which would be THREAD_COUNT)
The main thing is that we don't want to allocate Ruby objects (or call out to the VM) inside our GVL callback. I think since Magnus is handling the Ruby object conversion after we've returned, this should be safe.