vllm.entrypoints.openai.cli_args ¶
This file contains the command line arguments for the vLLM's OpenAI-compatible server. It is kept in a separate file for documentation purposes.
Classes:
-
BaseFrontendArgs–Base arguments for the OpenAI-compatible frontend server.
-
FrontendArgs–Arguments for the OpenAI-compatible frontend server.
Functions:
-
make_arg_parser–Create the CLI argument parser used by the OpenAI API server.
-
validate_parsed_serve_args–Quick checks for model serve args that raise prior to loading.
BaseFrontendArgs ¶
Base arguments for the OpenAI-compatible frontend server.
This base class does not include host, port, and server-specific arguments like SSL, CORS, and HTTP server settings. Those arguments are added by the subclasses.
Methods:
-
add_cli_args–Register CLI arguments for this frontend class.
Attributes:
-
chat_template(str | None) –The file path to the chat template, or the template in single-line form
-
chat_template_content_format(ChatTemplateContentFormatOption) –The format to render message content within a chat template.
-
default_chat_template_kwargs(dict[str, Any] | None) –Default keyword arguments to pass to the chat template renderer.
-
enable_auto_tool_choice(bool) –Enable auto tool choice for supported models. Use
--tool-call-parser -
enable_force_include_usage(bool) –If set to True, including usage on every request.
-
enable_log_deltas(bool) –If set to False, output deltas will not be logged. Relevant only if
-
enable_log_outputs(bool) –If set to True, log model outputs (generations).
-
enable_prompt_tokens_details(bool) –If set to True, enable prompt_tokens_details in usage.
-
enable_server_load_tracking(bool) –If set to True, enable tracking server_load_metrics in the app state.
-
enable_tokenizer_info_endpoint(bool) –Enable the
/tokenizer_infoendpoint. May expose chat -
exclude_tools_when_tool_choice_none(bool) –If specified, exclude tool definitions in prompts when
-
fingerprint_mode(Literal['full', 'hash', 'custom', 'none']) –Controls the
system_fingerprintfield on responses. -
fingerprint_value(str | None) –Literal fingerprint string used when
--fingerprint-mode=custom. -
log_config_file(str | None) –Path to logging config JSON file for both vllm and uvicorn
-
log_error_stack(bool) –If set to True, log the stack trace of error responses
-
lora_modules(list[LoRAModulePath] | None) –LoRA modules configurations in either 'name=path' format or JSON format
-
max_log_len(int | None) –Max number of prompt characters or prompt ID numbers being printed in
-
response_role(str) –The role name to return if
request.add_generation_prompt=true. -
return_tokens_as_token_ids(bool) –When
--max-logprobsis specified, represents single tokens as -
tokens_only(bool) –If set to True, only enable the Tokens In<>Out endpoint.
-
tool_call_parser(str | None) –Select the tool call parser depending on the model that you're using.
-
tool_parser_plugin(str) –Special the tool parser plugin write to parse the model-generated tool
-
tool_server(str | None) –Comma-separated list of host:port pairs (IPv4, IPv6, or hostname).
-
trust_request_chat_template(bool) –Whether to trust the chat template provided in the request. If False,
Source code in vllm/entrypoints/openai/cli_args.py
69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 | |
chat_template = None class-attribute instance-attribute ¶
The file path to the chat template, or the template in single-line form for the specified model.
chat_template_content_format = 'auto' class-attribute instance-attribute ¶
The format to render message content within a chat template.
- "string" will render the content as a string. Example:
"Hello World" - "openai" will render the content as a list of dictionaries, similar to OpenAI schema. Example:
[{"type": "text", "text": "Hello world!"}]
default_chat_template_kwargs = None class-attribute instance-attribute ¶
Default keyword arguments to pass to the chat template renderer. These will be merged with request-level chat_template_kwargs, with request values taking precedence. Useful for setting default behavior for reasoning models. Example: '{"enable_thinking": false}' to disable thinking mode by default for Qwen3/DeepSeek models.
enable_auto_tool_choice = False class-attribute instance-attribute ¶
Enable auto tool choice for supported models. Use --tool-call-parser to specify which parser to use.
enable_force_include_usage = False class-attribute instance-attribute ¶
If set to True, including usage on every request.
enable_log_deltas = True class-attribute instance-attribute ¶
If set to False, output deltas will not be logged. Relevant only if --enable-log-outputs is set.
enable_log_outputs = False class-attribute instance-attribute ¶
If set to True, log model outputs (generations). Requires --enable-log-requests. As with --enable-log-requests, information is only logged at INFO level at maximum.
enable_prompt_tokens_details = False class-attribute instance-attribute ¶
If set to True, enable prompt_tokens_details in usage.
enable_server_load_tracking = False class-attribute instance-attribute ¶
If set to True, enable tracking server_load_metrics in the app state.
enable_tokenizer_info_endpoint = False class-attribute instance-attribute ¶
Enable the /tokenizer_info endpoint. May expose chat templates and other tokenizer configuration.
exclude_tools_when_tool_choice_none = False class-attribute instance-attribute ¶
If specified, exclude tool definitions in prompts when tool_choice='none'.
fingerprint_mode = 'full' class-attribute instance-attribute ¶
Controls the system_fingerprint field on responses.
full(default):vllm-<version>[-<parallelism>]-<hash8>. Encodes server version, non-trivial parallelism degrees (tp/pp/dp/ep), and an 8-char config hash.hash:vllm-<version>-<hash8>. Parallelism stripped.custom: emits the literal string from--fingerprint-value.none: the field is omitted (serialized asnull).
fingerprint_value = None class-attribute instance-attribute ¶
Literal fingerprint string used when --fingerprint-mode=custom.
log_config_file = envs.VLLM_LOGGING_CONFIG_PATH class-attribute instance-attribute ¶
Path to logging config JSON file for both vllm and uvicorn
log_error_stack = envs.VLLM_SERVER_DEV_MODE class-attribute instance-attribute ¶
If set to True, log the stack trace of error responses
lora_modules = None class-attribute instance-attribute ¶
LoRA modules configurations in either 'name=path' format or JSON format or JSON list format. Example (old format): 'name=path' Example (new format): {"name": "name", "path": "lora_path", "base_model_name": "id"}
max_log_len = None class-attribute instance-attribute ¶
Max number of prompt characters or prompt ID numbers being printed in log. The default of None means unlimited.
response_role = 'assistant' class-attribute instance-attribute ¶
The role name to return if request.add_generation_prompt=true.
return_tokens_as_token_ids = False class-attribute instance-attribute ¶
When --max-logprobs is specified, represents single tokens as strings of the form 'token_id:{token_id}' so that tokens that are not JSON-encodable can be identified.
tokens_only = False class-attribute instance-attribute ¶
If set to True, only enable the Tokens In<>Out endpoint. This is intended for use in a Disaggregated Everything setup.
tool_call_parser = None class-attribute instance-attribute ¶
Select the tool call parser depending on the model that you're using. This is used to parse the model-generated tool call into OpenAI API format. Required for --enable-auto-tool-choice. You can choose any option from the built-in parsers or register a plugin via --tool-parser-plugin.
tool_parser_plugin = '' class-attribute instance-attribute ¶
Special the tool parser plugin write to parse the model-generated tool into OpenAI API format, the name register in this plugin can be used in --tool-call-parser.
tool_server = None class-attribute instance-attribute ¶
Comma-separated list of host:port pairs (IPv4, IPv6, or hostname). Examples: 127.0.0.1:8000, [::1]:8000, localhost:1234. Or demo for built-in demo tools (browser and Python code interpreter). WARNING: The demo Python tool executes model-generated code in Docker without network isolation by default. See the security guide for more information.
trust_request_chat_template = False class-attribute instance-attribute ¶
Whether to trust the chat template provided in the request. If False, the server will always use the chat template specified by --chat-template or the ones from tokenizer.
_customize_cli_kwargs(frontend_kwargs) classmethod ¶
Customize argparse kwargs before arguments are registered.
Subclasses should override this and call super()._customize_cli_kwargs(frontend_kwargs) first.
Source code in vllm/entrypoints/openai/cli_args.py
add_cli_args(parser) classmethod ¶
Register CLI arguments for this frontend class.
Subclasses should override _customize_cli_kwargs instead of this method so that base-class postprocessing is always applied.
Source code in vllm/entrypoints/openai/cli_args.py
FrontendArgs ¶
Bases: BaseFrontendArgs
Arguments for the OpenAI-compatible frontend server.
Attributes:
-
allow_credentials(bool) –Allow credentials.
-
allowed_headers(list[str]) –Allowed headers.
-
allowed_methods(list[str]) –Allowed methods.
-
allowed_origins(list[str]) –Allowed origins.
-
api_key(list[str] | None) –If provided, the server will require one of these keys to be presented in
-
data_parallel_supervisor_port(int) –HTTP port for aggregated health endpoints in multi-port external LB
-
disable_access_log_for_endpoints(str | None) –Comma-separated list of endpoint paths to exclude from uvicorn access
-
disable_fastapi_docs(bool) –Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint.
-
disable_uvicorn_access_log(bool) –Disable uvicorn access log.
-
dp_supervisor_probe_failure_threshold(int) –Number of consecutive connection-error retries before a child health
-
dp_supervisor_probe_interval_s(float) –Seconds between aggregated health probes in multi-port external LB mode.
-
dp_supervisor_probe_timeout_s(float) –Seconds to wait between retries when a child health probe fails with a
-
enable_flash_late_interaction(bool) –If set, run pooling score MaxSim on GPU in the API server process.
-
enable_offline_docs(bool) –Enable offline FastAPI documentation for air-gapped environments.
-
enable_request_id_headers(bool) –If specified, API server will add X-Request-Id header to responses.
-
enable_ssl_refresh(bool) –Refresh SSL Context when SSL certificate files change
-
h11_max_header_count(int) –Maximum number of HTTP headers allowed in a request for h11 parser.
-
h11_max_incomplete_event_size(int) –Maximum size (bytes) of an incomplete HTTP event (header or body) for
-
host(str | None) –Host name.
-
middleware(list[str]) –Additional ASGI middleware to apply to the app. We accept multiple
-
port(int) –Port number.
-
root_path(str | None) –FastAPI root_path when app is behind a path based routing proxy.
-
ssl_ca_certs(str | None) –The CA certificates file.
-
ssl_cert_reqs(int) –Whether client certificate is required (see stdlib ssl module's).
-
ssl_certfile(str | None) –The file path to the SSL cert file.
-
ssl_ciphers(str | None) –SSL cipher suites for HTTPS (TLS 1.2 and below only).
-
ssl_keyfile(str | None) –The file path to the SSL key file.
-
uds(str | None) –Unix domain socket path. If set, host and port arguments are ignored.
-
uvicorn_log_level(Literal['critical', 'error', 'warning', 'info', 'debug', 'trace']) –Log level for uvicorn.
Source code in vllm/entrypoints/openai/cli_args.py
224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 | |
allow_credentials = False class-attribute instance-attribute ¶
Allow credentials.
allowed_headers = field(default_factory=(lambda: ['*'])) class-attribute instance-attribute ¶
Allowed headers.
allowed_methods = field(default_factory=(lambda: ['*'])) class-attribute instance-attribute ¶
Allowed methods.
allowed_origins = field(default_factory=(lambda: ['*'])) class-attribute instance-attribute ¶
Allowed origins.
api_key = None class-attribute instance-attribute ¶
If provided, the server will require one of these keys to be presented in the header.
data_parallel_supervisor_port = 9256 class-attribute instance-attribute ¶
HTTP port for aggregated health endpoints in multi-port external LB mode.
disable_access_log_for_endpoints = None class-attribute instance-attribute ¶
Comma-separated list of endpoint paths to exclude from uvicorn access logs. This is useful to reduce log noise from high-frequency endpoints like health checks. Example: "/health,/metrics,/ping". When set, access logs for requests to these paths will be suppressed while keeping logs for other endpoints.
disable_fastapi_docs = False class-attribute instance-attribute ¶
Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint.
disable_uvicorn_access_log = False class-attribute instance-attribute ¶
Disable uvicorn access log.
dp_supervisor_probe_failure_threshold = 3 class-attribute instance-attribute ¶
Number of consecutive connection-error retries before a child health probe is declared failed in multi-port external LB mode.
dp_supervisor_probe_interval_s = 5.0 class-attribute instance-attribute ¶
Seconds between aggregated health probes in multi-port external LB mode.
dp_supervisor_probe_timeout_s = 5.0 class-attribute instance-attribute ¶
Seconds to wait between retries when a child health probe fails with a connection error in multi-port external LB mode.
enable_flash_late_interaction = True class-attribute instance-attribute ¶
If set, run pooling score MaxSim on GPU in the API server process. Can significantly improve late-interaction scoring performance.
enable_offline_docs = False class-attribute instance-attribute ¶
Enable offline FastAPI documentation for air-gapped environments. Uses vendored static assets bundled with vLLM.
enable_request_id_headers = False class-attribute instance-attribute ¶
If specified, API server will add X-Request-Id header to responses.
enable_ssl_refresh = False class-attribute instance-attribute ¶
Refresh SSL Context when SSL certificate files change
h11_max_header_count = H11_MAX_HEADER_COUNT_DEFAULT class-attribute instance-attribute ¶
Maximum number of HTTP headers allowed in a request for h11 parser. Helps mitigate header abuse. Default: 256.
h11_max_incomplete_event_size = H11_MAX_INCOMPLETE_EVENT_SIZE_DEFAULT class-attribute instance-attribute ¶
Maximum size (bytes) of an incomplete HTTP event (header or body) for h11 parser. Helps mitigate header abuse. Default: 4194304 (4 MB).
host = None class-attribute instance-attribute ¶
Host name.
middleware = field(default_factory=(lambda: [])) class-attribute instance-attribute ¶
Additional ASGI middleware to apply to the app. We accept multiple --middleware arguments. The value should be an import path. If a function is provided, vLLM will add it to the server using @app.middleware('http'). If a class is provided, vLLM will add it to the server using app.add_middleware().
port = 8000 class-attribute instance-attribute ¶
Port number.
root_path = None class-attribute instance-attribute ¶
FastAPI root_path when app is behind a path based routing proxy.
ssl_ca_certs = None class-attribute instance-attribute ¶
The CA certificates file.
ssl_cert_reqs = int(ssl.CERT_NONE) class-attribute instance-attribute ¶
Whether client certificate is required (see stdlib ssl module's).
ssl_certfile = None class-attribute instance-attribute ¶
The file path to the SSL cert file.
ssl_ciphers = None class-attribute instance-attribute ¶
SSL cipher suites for HTTPS (TLS 1.2 and below only). Example: 'ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-CHACHA20-POLY1305'
ssl_keyfile = None class-attribute instance-attribute ¶
The file path to the SSL key file.
uds = None class-attribute instance-attribute ¶
Unix domain socket path. If set, host and port arguments are ignored.
uvicorn_log_level = 'info' class-attribute instance-attribute ¶
Log level for uvicorn.
make_arg_parser(parser) ¶
Create the CLI argument parser used by the OpenAI API server.
We rely on the helper methods of FrontendArgs and AsyncEngineArgs to register all arguments instead of manually enumerating them here. This avoids code duplication and keeps the argument definitions in one place.
Source code in vllm/entrypoints/openai/cli_args.py
validate_parsed_serve_args(args) ¶
Quick checks for model serve args that raise prior to loading.