1774597379

2026-03-27 15:43:03 +08:00
parent ab0cbad418
commit e4a339bd77
43 changed files with 2973 additions and 179 deletions
@@ -0,0 +1,730 @@
+```
+(APIServer pid=17207) INFO 03-24 13:28:31 [api_server.py:580] Starting vLLM server on http://0.0.0.0:6006
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:37] Available routes are:
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /docs, Methods: HEAD, GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /tokenize, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /detokenize, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /load, Methods: GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /version, Methods: GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /health, Methods: GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /metrics, Methods: GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/models, Methods: GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /ping, Methods: GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /ping, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /invocations, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/responses, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/completions, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/messages, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
+(APIServer pid=17207) INFO 03-24 13:28:31 [launcher.py:46] Route: /v1/completions/render, Methods: POST
+(APIServer pid=17207) INFO:     Started server process [17207]
+(APIServer pid=17207) INFO:     Waiting for application startup.
+(APIServer pid=17207) INFO:     Application startup complete.
+
+```
+
+# 一、最前面的基础信息
+
+## 1）版本和模型路径
+
+```bash
+version 0.18.0
+model   /root/autodl-tmp/DeepSeek-R1-32B
+```
+
+表示：
+
+- 当前 vLLM 版本：`0.18.0`
+- 加载的模型目录：`/root/autodl-tmp/DeepSeek-R1-32B`
+
+这是最基础的确认信息。
+
+---
+
+## 2）非默认启动参数
+
+```bash
+non-default args: {
+  'port': 6006,
+  'model': '/root/autodl-tmp/DeepSeek-R1-32B',
+  'max_model_len': 8192,
+  'served_model_name': ['deepseek-r1'],
+  'gpu_memory_utilization': 0.95
+}
+```
+
+这里列的是你手动指定、不同于默认值的参数。
+
+你的关键参数含义：
+
+- `port: 6006`
+  服务监听端口是 6006
+
+- `model: /root/autodl-tmp/DeepSeek-R1-32B`
+  模型目录
+
+- `max_model_len: 8192`
+  最大上下文长度 8192 token
+
+- `served_model_name: deepseek-r1`
+  API 层暴露给客户端的模型名
+
+- `gpu_memory_utilization: 0.95`
+  vLLM 最多使用 95% GPU 显存来做模型和缓存分配
+
+---
+
+# 二、模型识别与调度信息
+
+## 3）模型架构识别
+
+```bash
+Resolved architecture: Qwen2ForCausalLM
+```
+
+说明 vLLM 识别出这个模型底层架构是：
+
+- `Qwen2ForCausalLM`
+
+虽然你目录叫 DeepSeek-R1-32B，但很多 DeepSeek 模型底层是兼容 Qwen 架构的，所以这是正常的。
+
+---
+
+## 4）最大长度设置
+
+```bash
+Using max model len 8192
+```
+
+表示最终采用的最大序列长度是 8192。
+
+这个值会直接影响：
+
+- KV Cache 占用
+- 最大并发
+- 显存使用
+
+---
+
+## 5）Chunked Prefill
+
+```bash
+Chunked prefill is enabled with max_num_batched_tokens=8192.
+```
+
+说明启用了 **分块预填充（chunked prefill）**。
+
+作用：
+
+- 输入 prompt 很长时，不是一次性全部 prefill，而是按块处理
+- 能改善吞吐和显存使用
+- 对长上下文模型很有帮助
+
+---
+
+## 6）异步调度
+
+```bash
+Asynchronous scheduling is enabled.
+```
+
+表示启用了 **异步调度器**。
+
+作用：
+
+- 更好地调度多个请求
+- 通常提升吞吐
+- 对服务化部署更友好
+
+---
+
+# 三、Engine 初始化配置
+
+这段很重要：
+
+```bash
+Initializing a V1 LLM engine (v0.18.0) with config: ...
+```
+
+这是整个推理引擎的详细配置总表。里面关键信息包括：
+
+---
+
+## 7）dtype
+
+```bash
+dtype=torch.bfloat16
+```
+
+模型以 `bfloat16` 精度运行。
+
+说明：
+
+- 显存比 fp32 小很多
+- 通常速度更快
+- 需要 GPU 支持 bf16
+
+---
+
+## 8）并行配置
+
+```bash
+tensor_parallel_size=1
+pipeline_parallel_size=1
+data_parallel_size=1
+```
+
+表示当前都是单卡/单副本：
+
+- 张量并行 TP = 1
+- 流水线并行 PP = 1
+- 数据并行 DP = 1
+
+也就是：**你现在实际上是单卡部署**。
+
+---
+
+## 9）前缀缓存
+
+```bash
+enable_prefix_caching=True
+```
+
+表示启用了 **prefix caching（前缀缓存）**。
+
+作用：
+
+- 如果多个请求共享相同前缀 prompt
+- 可以复用前缀计算结果
+- 节省时间，提高吞吐
+
+---
+
+## 10）编译配置
+
+```bash
+compilation_config=...
+backend='inductor'
+cudagraph_mode=FULL_AND_PIECEWISE
+```
+
+说明 vLLM 启用了编译优化和 CUDA Graph：
+
+- `torch.compile`
+- `inductor`
+- CUDA Graph
+
+这会提高推理性能，但启动时会多花一些时间做 warmup 和 graph capture。
+
+---
+
+# 四、分布式与 rank 信息
+
+## 11）world size / rank
+
+```bash
+world_size=1 rank=0 local_rank=0 ... backend=nccl
+```
+
+说明：
+
+- 总进程数：1
+- 当前 rank：0
+- 本地 rank：0
+- 通信后端：NCCL
+
+虽然你只有单卡，但 vLLM 内部还是走统一的分布式初始化逻辑，这是正常的。
+
+---
+
+## 12）并行角色分配
+
+```bash
+rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0
+```
+
+表示当前唯一这个进程兼任所有角色。
+
+---
+
+# 五、模型加载阶段
+
+## 13）开始加载模型
+
+```bash
+Starting to load model /root/autodl-tmp/DeepSeek-R1-32B...
+```
+
+开始读取模型权重。
+
+---
+
+## 14）Attention 后端
+
+```bash
+Using FLASH_ATTN attention backend ...
+Using FlashAttention version 2
+```
+
+说明 vLLM 最终选用了：
+
+- `FLASH_ATTN`
+- 版本 `FlashAttention v2`
+
+这是高性能 attention 实现，通常是比较理想的情况。
+
+候选后端里还有：
+
+- FLASHINFER
+- TRITON_ATTN
+- FLEX_ATTENTION
+
+但最终选择了 FlashAttention。
+
+---
+
+## 15）checkpoint shard 加载进度
+
+```bash
+Loading safetensors checkpoint shards: ...
+```
+
+表示模型权重是分片存储的，一共 `8` 个 shard。
+
+日志里看到：
+
+- 0/8
+- 1/8
+- …
+- 8/8
+
+说明模型权重成功全部加载完成。
+
+---
+
+## 16）权重加载耗时
+
+```bash
+Loading weights took 13.38 seconds
+```
+
+纯粹“读权重文件并装入”的时间是 13.38 秒。
+
+---
+
+## 17）模型加载显存占用
+
+```bash
+Model loading took 61.06 GiB memory and 14.101032 seconds
+```
+
+这是一个非常关键的信息：
+
+- 模型加载后占用了 `61.06 GiB` 显存
+- 总耗时约 `14.1 秒`
+
+这能帮助你判断：
+
+- 模型本体有多大
+- 剩余显存还能留多少给 KV Cache 和 CUDA Graph
+
+---
+
+# 六、编译和缓存信息
+
+## 18）torch.compile 缓存目录
+
+```bash
+Using cache directory: /root/.cache/vllm/torch_compile_cache/...
+```
+
+说明 vLLM 使用这个目录缓存 `torch.compile` 的编译结果。
+
+好处：
+
+- 下次重启如果配置没变，可能直接复用缓存
+- 启动速度更快
+
+---
+
+## 19）编译耗时
+
+```bash
+Dynamo bytecode transform time: 4.52 s
+torch.compile took 7.13 s in total
+```
+
+表示：
+
+- Dynamo 变换用了 4.52 秒
+- 整体 compile 用了 7.13 秒
+
+---
+
+## 20）直接从缓存加载编译图
+
+```bash
+Directly load the compiled graph(s) ... from the cache
+Directly load AOT compilation from path ...
+```
+
+说明你这次并不是全量重新编译，而是 **命中了历史缓存**。
+
+所以启动会快很多。
+
+---
+
+# 七、warmup 和 profiling
+
+## 21）初始 profiling / warmup
+
+```bash
+Initial profiling/warmup run took 2.69 s
+```
+
+表示模型做了一次预热运行：
+
+- 激活算子
+- 建立执行图
+- 为后面正式推理做准备
+
+---
+
+# 八、KV Cache 相关信息
+
+这部分是服务性能、并发能力最关键的。
+
+## 22）KV cache block override
+
+```bash
+Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
+```
+
+说明最终 GPU KV Cache block 数被设为 `512`。
+
+这个值决定了可缓存多少 token。
+
+---
+
+## 23）CUDA graph 显存估计
+
+```bash
+Estimated CUDA graph memory: 0.93 GiB total
+```
+
+表示 CUDA graph 本身大概占用 `0.93 GiB` 显存。
+
+---
+
+## 24）可用于 KV cache 的显存
+
+```bash
+Available KV cache memory: 11.89 GiB
+```
+
+这是很关键的一行：
+
+- 模型权重 + 编译 + graph 占完之后
+- 剩下给 KV Cache 的显存是 `11.89 GiB`
+
+---
+
+## 25）GPU KV cache size
+
+```bash
+GPU KV cache size: 48,704 tokens
+```
+
+表示 GPU 上最多可缓存约：
+
+- `48,704 token`
+
+这是总 token 容量，不是单请求。
+
+---
+
+## 26）最大并发估算
+
+```bash
+Maximum concurrency for 8,192 tokens per request: 5.95x
+```
+
+意思是：
+
+- 如果每个请求都占满 `8192 token`
+- 理论上最多支持接近 `5.95` 个这样的请求同时驻留
+
+也就是大约 **5~6 个满长请求并发**
+
+这是一个“理论上限估计”，实际情况还会受生成长度、调度、碎片等影响。
+
+---
+
+# 九、CUDA Graph 捕获
+
+## 27）捕获过程
+
+```bash
+Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): ...
+Capturing CUDA graphs (decode, FULL): ...
+```
+
+表示 vLLM 正在为不同 batch size / 场景捕获 CUDA Graph。
+
+作用：
+
+- 降低 kernel launch 开销
+- 提升推理速度
+- 尤其对稳定 batch 的在线推理很有用
+
+---
+
+## 28）捕获结果
+
+```bash
+Graph capturing finished in 13 secs, took 0.93 GiB
+CUDA graph pool memory: 0.93 GiB (actual), 0.93 GiB (estimated)
+```
+
+说明：
+
+- Graph capture 总共耗时 13 秒
+- 占显存 0.93 GiB
+- 实测和预估几乎一致
+
+这是正常且比较理想的结果。
+
+---
+
+## 29）Engine 初始化总耗时
+
+```bash
+init engine (profile, create kv cache, warmup model) took 25.79 seconds
+```
+
+表示引擎最终初始化总耗时约：
+
+- `25.79 秒`
+
+包括：
+
+- profiling
+- 创建 KV cache
+- warmup
+
+---
+
+# 十、API 服务层信息
+
+## 30）支持的任务
+
+```bash
+Supported tasks: ['generate']
+```
+
+说明当前模型/服务支持的任务是：
+
+- `generate`
+
+也就是文本生成类任务。
+
+---
+
+## 31）generation_config 覆盖提醒
+
+```bash
+Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: {'temperature': 0.6, 'top_p': 0.95}
+```
+
+表示模型目录里的 `generation_config.json` 覆盖了 vLLM 默认采样参数。
+
+当前默认采样变成：
+
+- `temperature = 0.6`
+- `top_p = 0.95`
+
+如果你不想让模型自带配置生效，可以加：
+
+```bash
+--generation-config vllm
+```
+
+---
+
+## 32）chat template 格式检测
+
+```bash
+Detected the chat template content format to be 'string'
+```
+
+说明 vLLM 检测到聊天模板内容格式是 `string`。
+
+一般不用管，除非你自定义 chat template。
+
+---
+
+## 33）服务启动地址
+
+```bash
+Starting vLLM server on http://0.0.0.0:6006
+```
+
+表示：
+
+- 监听所有网卡：`0.0.0.0`
+- 端口：`6006`
+
+如果你本机访问，就是：
+
+```bash
+http://127.0.0.1:6006
+```
+
+如果外部访问，要换成宿主机 IP。
+
+---
+
+# 十一、可用接口列表
+
+日志中列出了所有路由。最常用的是：
+
+## OpenAI 兼容接口
+
+- `/v1/chat/completions`
+- `/v1/completions`
+- `/v1/models`
+
+## 其他接口
+
+- `/health`：健康检查
+- `/metrics`：监控指标
+- `/version`：版本信息
+- `/tokenize`：分词
+- `/detokenize`：反分词
+
+---
+
+# 十二、Uvicorn/FastAPI 启动完成
+
+最后几行：
+
+```bash
+Started server process [70536]
+Waiting for application startup.
+Application startup complete.
+```
+
+这表示：
+
+- API server 进程已经启动
+- 应用初始化完成
+- 服务已经可以接收请求
+
+也就是说：**你的 vLLM 已经成功启动完成了。**
+
+---
+
+# 十三、这份日志里最值得关注的几个核心指标
+
+如果你以后只想快速判断服务状态，重点看这几项：
+
+---
+
+## 1）模型是否识别成功
+
+```bash
+Resolved architecture: Qwen2ForCausalLM
+```
+
+---
+
+## 2）模型是否加载成功
+
+```bash
+Loading weights took 13.38 seconds
+Model loading took 61.06 GiB memory
+```
+
+---
+
+## 3）KV Cache 有多少
+
+```bash
+Available KV cache memory: 11.89 GiB
+GPU KV cache size: 48,704 tokens
+```
+
+---
+
+## 4）理论最大并发
+
+```bash
+32
+Maximum concurrency for 8,192 tokens per request: 5.95x
+1.5
+Maximum concurrency for 8,192 tokens per request: 318.54x
+
+1.5 128k
+Maximum concurrency for 131,072 tokens per request: 19.91x
+```
+
+---
+
+## 5）服务是否真正启动
+
+```bash
+Starting vLLM server on http://0.0.0.0:6006
+Application startup complete.
+```
+
+---
+
+# 十四、根据你的日志，可以总结成一句话
+
+你的这次启动结果是：
+
+- vLLM 版本：`0.18.0`
+- 模型：`DeepSeek-R1-32B`
+- 底层架构：`Qwen2ForCausalLM`
+- 精度：`bfloat16`
+- 单卡部署
+- 模型权重占用显存：`61.06 GiB`
+- 可用 KV Cache 显存：`11.89 GiB`
+- KV Cache 总容量：`48,704 tokens`
+- 在 `8192` 上下文长度下理论并发约：`5.95`
+- 服务监听地址：`http://0.0.0.0:6006`
+- OpenAI 兼容接口可用：`/v1/chat/completions`
+
+---
+
+# 十五、你可以马上这样测试服务
+
+比如用 curl：
+
+```bash
+curl http://127.0.0.1:6006/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "deepseek-r1",
+    "messages": [
+      {"role": "user", "content": "你好，介绍一下你自己"}
+    ],
+    "temperature": 0.6,
+    "top_p": 0.95,
+    "max_tokens": 128
+  }'
+```
+
+```
+ Value error, User-specified max_model_len (200000) is greater than the derived max_model_len (max_position_embeddings=131072.0 or model_max_length=None in model's config.json
+```