731 lines
14 KiB
Markdown
731 lines
14 KiB
Markdown
```
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [api_server.py:580] Starting vLLM server on http://0.0.0.0:6006
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:37] Available routes are:
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /docs, Methods: HEAD, GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /tokenize, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /detokenize, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /load, Methods: GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /version, Methods: GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /health, Methods: GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /metrics, Methods: GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/models, Methods: GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /ping, Methods: GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /ping, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /invocations, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/responses, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/completions, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/messages, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
|
||
(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/completions/render, Methods: POST
|
||
(APIServer pid=17207) INFO: Started server process [17207]
|
||
(APIServer pid=17207) INFO: Waiting for application startup.
|
||
(APIServer pid=17207) INFO: Application startup complete.
|
||
|
||
```
|
||
|
||
# 一、最前面的基础信息
|
||
|
||
## 1)版本和模型路径
|
||
|
||
```bash
|
||
version 0.18.0
|
||
model /root/autodl-tmp/DeepSeek-R1-32B
|
||
```
|
||
|
||
表示:
|
||
|
||
- 当前 vLLM 版本:`0.18.0`
|
||
- 加载的模型目录:`/root/autodl-tmp/DeepSeek-R1-32B`
|
||
|
||
这是最基础的确认信息。
|
||
|
||
---
|
||
|
||
## 2)非默认启动参数
|
||
|
||
```bash
|
||
non-default args: {
|
||
'port': 6006,
|
||
'model': '/root/autodl-tmp/DeepSeek-R1-32B',
|
||
'max_model_len': 8192,
|
||
'served_model_name': ['deepseek-r1'],
|
||
'gpu_memory_utilization': 0.95
|
||
}
|
||
```
|
||
|
||
这里列的是你手动指定、不同于默认值的参数。
|
||
|
||
你的关键参数含义:
|
||
|
||
- `port: 6006`
|
||
服务监听端口是 6006
|
||
|
||
- `model: /root/autodl-tmp/DeepSeek-R1-32B`
|
||
模型目录
|
||
|
||
- `max_model_len: 8192`
|
||
最大上下文长度 8192 token
|
||
|
||
- `served_model_name: deepseek-r1`
|
||
API 层暴露给客户端的模型名
|
||
|
||
- `gpu_memory_utilization: 0.95`
|
||
vLLM 最多使用 95% GPU 显存来做模型和缓存分配
|
||
|
||
---
|
||
|
||
# 二、模型识别与调度信息
|
||
|
||
## 3)模型架构识别
|
||
|
||
```bash
|
||
Resolved architecture: Qwen2ForCausalLM
|
||
```
|
||
|
||
说明 vLLM 识别出这个模型底层架构是:
|
||
|
||
- `Qwen2ForCausalLM`
|
||
|
||
虽然你目录叫 DeepSeek-R1-32B,但很多 DeepSeek 模型底层是兼容 Qwen 架构的,所以这是正常的。
|
||
|
||
---
|
||
|
||
## 4)最大长度设置
|
||
|
||
```bash
|
||
Using max model len 8192
|
||
```
|
||
|
||
表示最终采用的最大序列长度是 8192。
|
||
|
||
这个值会直接影响:
|
||
|
||
- KV Cache 占用
|
||
- 最大并发
|
||
- 显存使用
|
||
|
||
---
|
||
|
||
## 5)Chunked Prefill
|
||
|
||
```bash
|
||
Chunked prefill is enabled with max_num_batched_tokens=8192.
|
||
```
|
||
|
||
说明启用了 **分块预填充(chunked prefill)**。
|
||
|
||
作用:
|
||
|
||
- 输入 prompt 很长时,不是一次性全部 prefill,而是按块处理
|
||
- 能改善吞吐和显存使用
|
||
- 对长上下文模型很有帮助
|
||
|
||
---
|
||
|
||
## 6)异步调度
|
||
|
||
```bash
|
||
Asynchronous scheduling is enabled.
|
||
```
|
||
|
||
表示启用了 **异步调度器**。
|
||
|
||
作用:
|
||
|
||
- 更好地调度多个请求
|
||
- 通常提升吞吐
|
||
- 对服务化部署更友好
|
||
|
||
---
|
||
|
||
# 三、Engine 初始化配置
|
||
|
||
这段很重要:
|
||
|
||
```bash
|
||
Initializing a V1 LLM engine (v0.18.0) with config: ...
|
||
```
|
||
|
||
这是整个推理引擎的详细配置总表。里面关键信息包括:
|
||
|
||
---
|
||
|
||
## 7)dtype
|
||
|
||
```bash
|
||
dtype=torch.bfloat16
|
||
```
|
||
|
||
模型以 `bfloat16` 精度运行。
|
||
|
||
说明:
|
||
|
||
- 显存比 fp32 小很多
|
||
- 通常速度更快
|
||
- 需要 GPU 支持 bf16
|
||
|
||
---
|
||
|
||
## 8)并行配置
|
||
|
||
```bash
|
||
tensor_parallel_size=1
|
||
pipeline_parallel_size=1
|
||
data_parallel_size=1
|
||
```
|
||
|
||
表示当前都是单卡/单副本:
|
||
|
||
- 张量并行 TP = 1
|
||
- 流水线并行 PP = 1
|
||
- 数据并行 DP = 1
|
||
|
||
也就是:**你现在实际上是单卡部署**。
|
||
|
||
---
|
||
|
||
## 9)前缀缓存
|
||
|
||
```bash
|
||
enable_prefix_caching=True
|
||
```
|
||
|
||
表示启用了 **prefix caching(前缀缓存)**。
|
||
|
||
作用:
|
||
|
||
- 如果多个请求共享相同前缀 prompt
|
||
- 可以复用前缀计算结果
|
||
- 节省时间,提高吞吐
|
||
|
||
---
|
||
|
||
## 10)编译配置
|
||
|
||
```bash
|
||
compilation_config=...
|
||
backend='inductor'
|
||
cudagraph_mode=FULL_AND_PIECEWISE
|
||
```
|
||
|
||
说明 vLLM 启用了编译优化和 CUDA Graph:
|
||
|
||
- `torch.compile`
|
||
- `inductor`
|
||
- CUDA Graph
|
||
|
||
这会提高推理性能,但启动时会多花一些时间做 warmup 和 graph capture。
|
||
|
||
---
|
||
|
||
# 四、分布式与 rank 信息
|
||
|
||
## 11)world size / rank
|
||
|
||
```bash
|
||
world_size=1 rank=0 local_rank=0 ... backend=nccl
|
||
```
|
||
|
||
说明:
|
||
|
||
- 总进程数:1
|
||
- 当前 rank:0
|
||
- 本地 rank:0
|
||
- 通信后端:NCCL
|
||
|
||
虽然你只有单卡,但 vLLM 内部还是走统一的分布式初始化逻辑,这是正常的。
|
||
|
||
---
|
||
|
||
## 12)并行角色分配
|
||
|
||
```bash
|
||
rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0
|
||
```
|
||
|
||
表示当前唯一这个进程兼任所有角色。
|
||
|
||
---
|
||
|
||
# 五、模型加载阶段
|
||
|
||
## 13)开始加载模型
|
||
|
||
```bash
|
||
Starting to load model /root/autodl-tmp/DeepSeek-R1-32B...
|
||
```
|
||
|
||
开始读取模型权重。
|
||
|
||
---
|
||
|
||
## 14)Attention 后端
|
||
|
||
```bash
|
||
Using FLASH_ATTN attention backend ...
|
||
Using FlashAttention version 2
|
||
```
|
||
|
||
说明 vLLM 最终选用了:
|
||
|
||
- `FLASH_ATTN`
|
||
- 版本 `FlashAttention v2`
|
||
|
||
这是高性能 attention 实现,通常是比较理想的情况。
|
||
|
||
候选后端里还有:
|
||
|
||
- FLASHINFER
|
||
- TRITON_ATTN
|
||
- FLEX_ATTENTION
|
||
|
||
但最终选择了 FlashAttention。
|
||
|
||
---
|
||
|
||
## 15)checkpoint shard 加载进度
|
||
|
||
```bash
|
||
Loading safetensors checkpoint shards: ...
|
||
```
|
||
|
||
表示模型权重是分片存储的,一共 `8` 个 shard。
|
||
|
||
日志里看到:
|
||
|
||
- 0/8
|
||
- 1/8
|
||
- …
|
||
- 8/8
|
||
|
||
说明模型权重成功全部加载完成。
|
||
|
||
---
|
||
|
||
## 16)权重加载耗时
|
||
|
||
```bash
|
||
Loading weights took 13.38 seconds
|
||
```
|
||
|
||
纯粹“读权重文件并装入”的时间是 13.38 秒。
|
||
|
||
---
|
||
|
||
## 17)模型加载显存占用
|
||
|
||
```bash
|
||
Model loading took 61.06 GiB memory and 14.101032 seconds
|
||
```
|
||
|
||
这是一个非常关键的信息:
|
||
|
||
- 模型加载后占用了 `61.06 GiB` 显存
|
||
- 总耗时约 `14.1 秒`
|
||
|
||
这能帮助你判断:
|
||
|
||
- 模型本体有多大
|
||
- 剩余显存还能留多少给 KV Cache 和 CUDA Graph
|
||
|
||
---
|
||
|
||
# 六、编译和缓存信息
|
||
|
||
## 18)torch.compile 缓存目录
|
||
|
||
```bash
|
||
Using cache directory: /root/.cache/vllm/torch_compile_cache/...
|
||
```
|
||
|
||
说明 vLLM 使用这个目录缓存 `torch.compile` 的编译结果。
|
||
|
||
好处:
|
||
|
||
- 下次重启如果配置没变,可能直接复用缓存
|
||
- 启动速度更快
|
||
|
||
---
|
||
|
||
## 19)编译耗时
|
||
|
||
```bash
|
||
Dynamo bytecode transform time: 4.52 s
|
||
torch.compile took 7.13 s in total
|
||
```
|
||
|
||
表示:
|
||
|
||
- Dynamo 变换用了 4.52 秒
|
||
- 整体 compile 用了 7.13 秒
|
||
|
||
---
|
||
|
||
## 20)直接从缓存加载编译图
|
||
|
||
```bash
|
||
Directly load the compiled graph(s) ... from the cache
|
||
Directly load AOT compilation from path ...
|
||
```
|
||
|
||
说明你这次并不是全量重新编译,而是 **命中了历史缓存**。
|
||
|
||
所以启动会快很多。
|
||
|
||
---
|
||
|
||
# 七、warmup 和 profiling
|
||
|
||
## 21)初始 profiling / warmup
|
||
|
||
```bash
|
||
Initial profiling/warmup run took 2.69 s
|
||
```
|
||
|
||
表示模型做了一次预热运行:
|
||
|
||
- 激活算子
|
||
- 建立执行图
|
||
- 为后面正式推理做准备
|
||
|
||
---
|
||
|
||
# 八、KV Cache 相关信息
|
||
|
||
这部分是服务性能、并发能力最关键的。
|
||
|
||
## 22)KV cache block override
|
||
|
||
```bash
|
||
Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
|
||
```
|
||
|
||
说明最终 GPU KV Cache block 数被设为 `512`。
|
||
|
||
这个值决定了可缓存多少 token。
|
||
|
||
---
|
||
|
||
## 23)CUDA graph 显存估计
|
||
|
||
```bash
|
||
Estimated CUDA graph memory: 0.93 GiB total
|
||
```
|
||
|
||
表示 CUDA graph 本身大概占用 `0.93 GiB` 显存。
|
||
|
||
---
|
||
|
||
## 24)可用于 KV cache 的显存
|
||
|
||
```bash
|
||
Available KV cache memory: 11.89 GiB
|
||
```
|
||
|
||
这是很关键的一行:
|
||
|
||
- 模型权重 + 编译 + graph 占完之后
|
||
- 剩下给 KV Cache 的显存是 `11.89 GiB`
|
||
|
||
---
|
||
|
||
## 25)GPU KV cache size
|
||
|
||
```bash
|
||
GPU KV cache size: 48,704 tokens
|
||
```
|
||
|
||
表示 GPU 上最多可缓存约:
|
||
|
||
- `48,704 token`
|
||
|
||
这是总 token 容量,不是单请求。
|
||
|
||
---
|
||
|
||
## 26)最大并发估算
|
||
|
||
```bash
|
||
Maximum concurrency for 8,192 tokens per request: 5.95x
|
||
```
|
||
|
||
意思是:
|
||
|
||
- 如果每个请求都占满 `8192 token`
|
||
- 理论上最多支持接近 `5.95` 个这样的请求同时驻留
|
||
|
||
也就是大约 **5~6 个满长请求并发**
|
||
|
||
这是一个“理论上限估计”,实际情况还会受生成长度、调度、碎片等影响。
|
||
|
||
---
|
||
|
||
# 九、CUDA Graph 捕获
|
||
|
||
## 27)捕获过程
|
||
|
||
```bash
|
||
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): ...
|
||
Capturing CUDA graphs (decode, FULL): ...
|
||
```
|
||
|
||
表示 vLLM 正在为不同 batch size / 场景捕获 CUDA Graph。
|
||
|
||
作用:
|
||
|
||
- 降低 kernel launch 开销
|
||
- 提升推理速度
|
||
- 尤其对稳定 batch 的在线推理很有用
|
||
|
||
---
|
||
|
||
## 28)捕获结果
|
||
|
||
```bash
|
||
Graph capturing finished in 13 secs, took 0.93 GiB
|
||
CUDA graph pool memory: 0.93 GiB (actual), 0.93 GiB (estimated)
|
||
```
|
||
|
||
说明:
|
||
|
||
- Graph capture 总共耗时 13 秒
|
||
- 占显存 0.93 GiB
|
||
- 实测和预估几乎一致
|
||
|
||
这是正常且比较理想的结果。
|
||
|
||
---
|
||
|
||
## 29)Engine 初始化总耗时
|
||
|
||
```bash
|
||
init engine (profile, create kv cache, warmup model) took 25.79 seconds
|
||
```
|
||
|
||
表示引擎最终初始化总耗时约:
|
||
|
||
- `25.79 秒`
|
||
|
||
包括:
|
||
|
||
- profiling
|
||
- 创建 KV cache
|
||
- warmup
|
||
|
||
---
|
||
|
||
# 十、API 服务层信息
|
||
|
||
## 30)支持的任务
|
||
|
||
```bash
|
||
Supported tasks: ['generate']
|
||
```
|
||
|
||
说明当前模型/服务支持的任务是:
|
||
|
||
- `generate`
|
||
|
||
也就是文本生成类任务。
|
||
|
||
---
|
||
|
||
## 31)generation_config 覆盖提醒
|
||
|
||
```bash
|
||
Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: {'temperature': 0.6, 'top_p': 0.95}
|
||
```
|
||
|
||
表示模型目录里的 `generation_config.json` 覆盖了 vLLM 默认采样参数。
|
||
|
||
当前默认采样变成:
|
||
|
||
- `temperature = 0.6`
|
||
- `top_p = 0.95`
|
||
|
||
如果你不想让模型自带配置生效,可以加:
|
||
|
||
```bash
|
||
--generation-config vllm
|
||
```
|
||
|
||
---
|
||
|
||
## 32)chat template 格式检测
|
||
|
||
```bash
|
||
Detected the chat template content format to be 'string'
|
||
```
|
||
|
||
说明 vLLM 检测到聊天模板内容格式是 `string`。
|
||
|
||
一般不用管,除非你自定义 chat template。
|
||
|
||
---
|
||
|
||
## 33)服务启动地址
|
||
|
||
```bash
|
||
Starting vLLM server on http://0.0.0.0:6006
|
||
```
|
||
|
||
表示:
|
||
|
||
- 监听所有网卡:`0.0.0.0`
|
||
- 端口:`6006`
|
||
|
||
如果你本机访问,就是:
|
||
|
||
```bash
|
||
http://127.0.0.1:6006
|
||
```
|
||
|
||
如果外部访问,要换成宿主机 IP。
|
||
|
||
---
|
||
|
||
# 十一、可用接口列表
|
||
|
||
日志中列出了所有路由。最常用的是:
|
||
|
||
## OpenAI 兼容接口
|
||
|
||
- `/v1/chat/completions`
|
||
- `/v1/completions`
|
||
- `/v1/models`
|
||
|
||
## 其他接口
|
||
|
||
- `/health`:健康检查
|
||
- `/metrics`:监控指标
|
||
- `/version`:版本信息
|
||
- `/tokenize`:分词
|
||
- `/detokenize`:反分词
|
||
|
||
---
|
||
|
||
# 十二、Uvicorn/FastAPI 启动完成
|
||
|
||
最后几行:
|
||
|
||
```bash
|
||
Started server process [70536]
|
||
Waiting for application startup.
|
||
Application startup complete.
|
||
```
|
||
|
||
这表示:
|
||
|
||
- API server 进程已经启动
|
||
- 应用初始化完成
|
||
- 服务已经可以接收请求
|
||
|
||
也就是说:**你的 vLLM 已经成功启动完成了。**
|
||
|
||
---
|
||
|
||
# 十三、这份日志里最值得关注的几个核心指标
|
||
|
||
如果你以后只想快速判断服务状态,重点看这几项:
|
||
|
||
---
|
||
|
||
## 1)模型是否识别成功
|
||
|
||
```bash
|
||
Resolved architecture: Qwen2ForCausalLM
|
||
```
|
||
|
||
---
|
||
|
||
## 2)模型是否加载成功
|
||
|
||
```bash
|
||
Loading weights took 13.38 seconds
|
||
Model loading took 61.06 GiB memory
|
||
```
|
||
|
||
---
|
||
|
||
## 3)KV Cache 有多少
|
||
|
||
```bash
|
||
Available KV cache memory: 11.89 GiB
|
||
GPU KV cache size: 48,704 tokens
|
||
```
|
||
|
||
---
|
||
|
||
## 4)理论最大并发
|
||
|
||
```bash
|
||
32
|
||
Maximum concurrency for 8,192 tokens per request: 5.95x
|
||
1.5
|
||
Maximum concurrency for 8,192 tokens per request: 318.54x
|
||
|
||
1.5 128k
|
||
Maximum concurrency for 131,072 tokens per request: 19.91x
|
||
```
|
||
|
||
---
|
||
|
||
## 5)服务是否真正启动
|
||
|
||
```bash
|
||
Starting vLLM server on http://0.0.0.0:6006
|
||
Application startup complete.
|
||
```
|
||
|
||
---
|
||
|
||
# 十四、根据你的日志,可以总结成一句话
|
||
|
||
你的这次启动结果是:
|
||
|
||
- vLLM 版本:`0.18.0`
|
||
- 模型:`DeepSeek-R1-32B`
|
||
- 底层架构:`Qwen2ForCausalLM`
|
||
- 精度:`bfloat16`
|
||
- 单卡部署
|
||
- 模型权重占用显存:`61.06 GiB`
|
||
- 可用 KV Cache 显存:`11.89 GiB`
|
||
- KV Cache 总容量:`48,704 tokens`
|
||
- 在 `8192` 上下文长度下理论并发约:`5.95`
|
||
- 服务监听地址:`http://0.0.0.0:6006`
|
||
- OpenAI 兼容接口可用:`/v1/chat/completions`
|
||
|
||
---
|
||
|
||
# 十五、你可以马上这样测试服务
|
||
|
||
比如用 curl:
|
||
|
||
```bash
|
||
curl http://127.0.0.1:6006/v1/chat/completions \
|
||
-H "Content-Type: application/json" \
|
||
-d '{
|
||
"model": "deepseek-r1",
|
||
"messages": [
|
||
{"role": "user", "content": "你好,介绍一下你自己"}
|
||
],
|
||
"temperature": 0.6,
|
||
"top_p": 0.95,
|
||
"max_tokens": 128
|
||
}'
|
||
```
|
||
|
||
```
|
||
Value error, User-specified max_model_len (200000) is greater than the derived max_model_len (max_position_embeddings=131072.0 or model_max_length=None in model's config.json
|
||
```
|