Refactor Prometheus and Add Request Level Metrics#2316
Refactor Prometheus and Add Request Level Metrics#2316simon-mo merged 52 commits intovllm-project:mainfrom robertgshaw2-redhat:rs/feature/metrics
Conversation
…asing values. computing averages on the client is an anti-pattern for prometheus metrics and should be computed on the prom server
… grafana dashboard
…rything a lot simplier so squished everything back to a single file.
|
Hi @rib-2, thank you for your contribution. This PR is definitely in the right direction, few things to start:
|
|
@simon-mo Sounds good. Thanks for the feedback.
For the existing logging message --> are you referring to the |
Yes! |
I like the idea of doing it in another PR |
…s to be compatible with prior verions (and adds back the gaugues that compute avg tput for backwards compatibility
|
Only outstanding item I think is the
@simon-mo requesting re-review |
NikolaBorisov
left a comment
There was a problem hiding this comment.
@simon-mo I think this is good, and should be merged
simon-mo
left a comment
There was a problem hiding this comment.
Thank you for the great work here. And thanks @NikolaBorisov for the review.
* Add vllm-online-serving * Add prom metrics * Update monitoring * remove logging * Add labels * Use vllm directly from upstream latest to pick up vllm-project/vllm#2316 * Roll back vllm to 0.3.0 * Get patch files for metrics in vllm-project/vllm#2316 * Update llm_engine.py * Write documents * Add vllm-online-serving/README-ko.md * write README.md
Summary
This PR does three things:
A) Addresses open feature request (#1870) by refactoring and extending initial implementation of metrics (#1890) to:
B) Creates an end-to-end example for how to monitoring vLLM with Prometheus and Grafana
C) Updates the existing metric implementations to follow Prometheus best practices, namely:
vllm:num_requests_runningshould bevllm_num_requests_running_totalGaugesratherCounters+ PromQLrate(Prom Docs) ->vllm:avg_generation_throughput_toks_per_secshould be aCountercalledvllm_generation_tokens_totaland use PromQLrate(vllm_generation_tokens_total[5s])to calc tokens / second during dashboarding.A) Implementation
Created / updated the following classes:
SequenceGroup: addedlast_token_timevariable andget_last_latency/get_e2e_latency) methods, which enables us to capture the request-level latencies if logging is enabled.LLMEngine: added aPrometheusLoggerand logic to createStats, making a cleaner interface between theLLMEngineand logging-related functionality._process_model_outputs, we callLLMEngine._get_statsto generateStatsthat are passed to thePrometheusLogger.log.PrometheusLogger: holds a list ofPrometheusMetricsand passes theStatgenerated by theLLMEngineto each.PrometheusMetric: holds a metric (aioprometheuscollectorCounter,Gauge,Histogram) and a function to extract the appropriate data fromStatsWithin this framework, created a registry of
PrometheusMetrics:Currently Supported Include:
counter_prompt_tokens--> used with rate() to calculate prompt token throughputcounter_generation_tokens--> used with rate() to calculate generation token throughputgauge_scheduler_runninggauge_scheduler_swappedgauge_scheduler_waitinggauge_gpu_cache_usagegauge_cpu_cache_usagehistogram_time_to_first_token--> exposes counters needed to calculate avg ttft, P50, P90, P95, P99histogram_inter_token_latency--> exposes counters needed to calculate avg itl, P50, P90, P95, P99histogram_e2e_request_latency--> exposes counters needed to calculate e2e request latency, P50, P90, P95, P99See the Example for a dashboard that shows how these exposed metrics should be monitored
B) Example
See examples/production_monitoring for an end-to-end example. I included a Grafana dashboard configuration which shows how these metrics should be monitored.
C) Best Practices
I recognize these changes have breaking impacts on the metrics exposed to users.
Key changes include:
vllm:num_requests_swapped-->vllm_requests_stopped_total)vllm:avg_prompt_throughput_toks_per_s/vllm:avg_generation_throughput_toks_per_s) to be total tokens processed counters (vllm_prompt_tokens_total/vllm_generation_tokens_total)rate(vllm_prompt_tokens_total[30s])My sense is that this is a very new feature, so Im not sure how much user impact there is. However, I think the changes I am suggesting are justified. I am happy to revert these if requested.
Overhead
I used the benchmarking scripts to test performance with and without the logger on an L4 GPU. There is very minor latency.
benchmark_serving.pyClient:python3 benchmark_serving.py --backend vllm --tokenizer mistralai/Mistral-7B-v0.1 --dataset /home/robertgshaw/vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0 --num-prompts 200Launch with System Logging:
python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-requestspython3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-stats --disable-log-requestsNext Steps
Next steps to finalize the PR are:
Questions
Are there any other things I need to do?