`vllm.entrypoints.openai.cli_args` ¶

This file contains the command line arguments for the vLLM's OpenAI-compatible server. It is kept in a separate file for documentation purposes.

Classes:

BaseFrontendArgs –

Base arguments for the OpenAI-compatible frontend server.
FrontendArgs –

Arguments for the OpenAI-compatible frontend server.

Functions:

make_arg_parser –

Create the CLI argument parser used by the OpenAI API server.
validate_parsed_serve_args –

Quick checks for model serve args that raise prior to loading.

`BaseFrontendArgs` ¶

Base arguments for the OpenAI-compatible frontend server.

This base class does not include host, port, and server-specific arguments like SSL, CORS, and HTTP server settings. Those arguments are added by the subclasses.

Methods:

add_cli_args –

Register CLI arguments for this frontend class.

Attributes:

chat_template (str | None) –

The file path to the chat template, or the template in single-line form
chat_template_content_format (ChatTemplateContentFormatOption) –

The format to render message content within a chat template.
default_chat_template_kwargs (dict[str, Any] | None) –

Default keyword arguments to pass to the chat template renderer.
enable_auto_tool_choice (bool) –

Enable auto tool choice for supported models. Use --tool-call-parser
enable_force_include_usage (bool) –

If set to True, including usage on every request.
enable_log_deltas (bool) –

If set to False, output deltas will not be logged. Relevant only if
enable_log_outputs (bool) –

If set to True, log model outputs (generations).
enable_prompt_tokens_details (bool) –

If set to True, enable prompt_tokens_details in usage.
enable_server_load_tracking (bool) –

If set to True, enable tracking server_load_metrics in the app state.
enable_tokenizer_info_endpoint (bool) –

Enable the /tokenizer_info endpoint. May expose chat
exclude_tools_when_tool_choice_none (bool) –

If specified, exclude tool definitions in prompts when
fingerprint_mode (Literal['full', 'hash', 'custom', 'none']) –

Controls the system_fingerprint field on responses.
fingerprint_value (str | None) –

Literal fingerprint string used when --fingerprint-mode=custom.
log_config_file (str | None) –

Path to logging config JSON file for both vllm and uvicorn
log_error_stack (bool) –

If set to True, log the stack trace of error responses
lora_modules (list[LoRAModulePath] | None) –

LoRA modules configurations in either 'name=path' format or JSON format
max_log_len (int | None) –

Max number of prompt characters or prompt ID numbers being printed in
response_role (str) –

The role name to return if request.add_generation_prompt=true.
return_tokens_as_token_ids (bool) –

When --max-logprobs is specified, represents single tokens as
tokens_only (bool) –

If set to True, only enable the Tokens In<>Out endpoint.
tool_call_parser (str | None) –

Select the tool call parser depending on the model that you're using.
tool_parser_plugin (str) –

Special the tool parser plugin write to parse the model-generated tool
tool_server (str | None) –

Comma-separated list of host:port pairs (IPv4, IPv6, or hostname).
trust_request_chat_template (bool) –

Whether to trust the chat template provided in the request. If False,

Source code in vllm/entrypoints/openai/cli_args.py

@config
class BaseFrontendArgs:
    """Base arguments for the OpenAI-compatible frontend server.

    This base class does not include host, port, and server-specific arguments
    like SSL, CORS, and HTTP server settings. Those arguments are added by
    the subclasses.
    """

    lora_modules: list[LoRAModulePath] | None = None
    """LoRA modules configurations in either 'name=path' format or JSON format
    or JSON list format. Example (old format): `'name=path'` Example (new
    format): `{\"name\": \"name\", \"path\": \"lora_path\",
    \"base_model_name\": \"id\"}`"""
    chat_template: str | None = None
    """The file path to the chat template, or the template in single-line form
    for the specified model."""
    chat_template_content_format: ChatTemplateContentFormatOption = "auto"
    """The format to render message content within a chat template.

    * "string" will render the content as a string. Example: `"Hello World"`
    * "openai" will render the content as a list of dictionaries, similar to
      OpenAI schema. Example: `[{"type": "text", "text": "Hello world!"}]`"""
    trust_request_chat_template: bool = False
    """Whether to trust the chat template provided in the request. If False,
    the server will always use the chat template specified by `--chat-template`
    or the ones from tokenizer."""
    default_chat_template_kwargs: dict[str, Any] | None = None
    """Default keyword arguments to pass to the chat template renderer.
    These will be merged with request-level chat_template_kwargs,
    with request values taking precedence. Useful for setting default
    behavior for reasoning models. Example: '{"enable_thinking": false}'
    to disable thinking mode by default for Qwen3/DeepSeek models."""
    response_role: str = "assistant"
    """The role name to return if `request.add_generation_prompt=true`."""
    return_tokens_as_token_ids: bool = False
    """When `--max-logprobs` is specified, represents single tokens as
    strings of the form 'token_id:{token_id}' so that tokens that are not
    JSON-encodable can be identified."""
    enable_auto_tool_choice: bool = False
    """Enable auto tool choice for supported models. Use `--tool-call-parser`
    to specify which parser to use."""
    exclude_tools_when_tool_choice_none: bool = False
    """If specified, exclude tool definitions in prompts when
    tool_choice='none'."""
    tool_call_parser: str | None = None
    """Select the tool call parser depending on the model that you're using.
    This is used to parse the model-generated tool call into OpenAI API format.
    Required for `--enable-auto-tool-choice`. You can choose any option from
    the built-in parsers or register a plugin via `--tool-parser-plugin`."""
    tool_parser_plugin: str = ""
    """Special the tool parser plugin write to parse the model-generated tool
    into OpenAI API format, the name register in this plugin can be used in
    `--tool-call-parser`."""
    tool_server: str | None = None
    """Comma-separated list of host:port pairs (IPv4, IPv6, or hostname).
    Examples: 127.0.0.1:8000, [::1]:8000, localhost:1234. Or `demo` for
    built-in demo tools (browser and Python code interpreter). WARNING:
    The `demo` Python tool executes model-generated code in Docker without
    network isolation by default. See the security guide for more
    information."""
    log_config_file: str | None = envs.VLLM_LOGGING_CONFIG_PATH
    """Path to logging config JSON file for both vllm and uvicorn"""
    max_log_len: int | None = None
    """Max number of prompt characters or prompt ID numbers being printed in
    log. The default of None means unlimited."""
    enable_prompt_tokens_details: bool = False
    """If set to True, enable prompt_tokens_details in usage."""
    enable_server_load_tracking: bool = False
    """If set to True, enable tracking server_load_metrics in the app state."""
    enable_force_include_usage: bool = False
    """If set to True, including usage on every request."""
    enable_tokenizer_info_endpoint: bool = False
    """Enable the `/tokenizer_info` endpoint. May expose chat
    templates and other tokenizer configuration."""
    enable_log_outputs: bool = False
    """If set to True, log model outputs (generations).
    Requires `--enable-log-requests`. As with `--enable-log-requests`,
    information is only logged at INFO level at maximum."""
    enable_log_deltas: bool = True
    """If set to False, output deltas will not be logged. Relevant only if 
    --enable-log-outputs is set.
    """
    log_error_stack: bool = envs.VLLM_SERVER_DEV_MODE
    """If set to True, log the stack trace of error responses"""
    tokens_only: bool = False
    """
    If set to True, only enable the Tokens In<>Out endpoint.
    This is intended for use in a Disaggregated Everything setup.
    """
    fingerprint_mode: Literal["full", "hash", "custom", "none"] = "full"
    """Controls the ``system_fingerprint`` field on responses.

    - ``full`` (default): ``vllm-<version>[-<parallelism>]-<hash8>``. Encodes
      server version, non-trivial parallelism degrees (tp/pp/dp/ep), and an
      8-char config hash.
    - ``hash``: ``vllm-<version>-<hash8>``. Parallelism stripped.
    - ``custom``: emits the literal string from ``--fingerprint-value``.
    - ``none``: the field is omitted (serialized as ``null``).
    """
    fingerprint_value: str | None = None
    """Literal fingerprint string used when ``--fingerprint-mode=custom``."""

    @classmethod
    def _customize_cli_kwargs(
        cls,
        frontend_kwargs: dict[str, Any],
    ) -> dict[str, Any]:
        """Customize argparse kwargs before arguments are registered.

        Subclasses should override this and call
        ``super()._customize_cli_kwargs(frontend_kwargs)`` first.
        """
        # Special case: default_chat_template_kwargs needs json.loads type
        frontend_kwargs["default_chat_template_kwargs"]["type"] = json.loads

        # Special case: LoRA modules need custom parser action and
        # optional_type(str)
        frontend_kwargs["lora_modules"]["type"] = optional_type(str)
        frontend_kwargs["lora_modules"]["action"] = LoRAParserAction

        # Special case: Tool call parser shows built-in options.
        valid_tool_parsers = list(ToolParserManager.list_registered())
        parsers_str = ",".join(valid_tool_parsers)
        frontend_kwargs["tool_call_parser"]["metavar"] = (
            f"{{{parsers_str}}} or name registered in --tool-parser-plugin"
        )
        return frontend_kwargs

    @classmethod
    def add_cli_args(cls, parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
        """Register CLI arguments for this frontend class.

        Subclasses should override ``_customize_cli_kwargs`` instead of
        this method so that base-class postprocessing is always applied.
        """
        from vllm.engine.arg_utils import get_kwargs

        frontend_kwargs = get_kwargs(cls)
        frontend_kwargs = cls._customize_cli_kwargs(frontend_kwargs)

        group_name = cls.__name__.replace("Args", "")
        frontend_group = parser.add_argument_group(
            title=group_name,
            description=cls.__doc__,
        )
        for key, value in frontend_kwargs.items():
            extra_flags = value.pop("flags", [])
            frontend_group.add_argument(
                *extra_flags, f"--{key.replace('_', '-')}", **value
            )

        return parser

`chat_template = None` `class-attribute` `instance-attribute` ¶

The file path to the chat template, or the template in single-line form for the specified model.

`chat_template_content_format = 'auto'` `class-attribute` `instance-attribute` ¶

The format to render message content within a chat template.

"string" will render the content as a string. Example: "Hello World"
"openai" will render the content as a list of dictionaries, similar to OpenAI schema. Example: [{"type": "text", "text": "Hello world!"}]

`default_chat_template_kwargs = None` `class-attribute` `instance-attribute` ¶

Default keyword arguments to pass to the chat template renderer. These will be merged with request-level chat_template_kwargs, with request values taking precedence. Useful for setting default behavior for reasoning models. Example: '{"enable_thinking": false}' to disable thinking mode by default for Qwen3/DeepSeek models.

`enable_auto_tool_choice = False` `class-attribute` `instance-attribute` ¶

Enable auto tool choice for supported models. Use --tool-call-parser to specify which parser to use.

`enable_force_include_usage = False` `class-attribute` `instance-attribute` ¶

If set to True, including usage on every request.

`enable_log_deltas = True` `class-attribute` `instance-attribute` ¶

If set to False, output deltas will not be logged. Relevant only if --enable-log-outputs is set.

`enable_log_outputs = False` `class-attribute` `instance-attribute` ¶

If set to True, log model outputs (generations). Requires --enable-log-requests. As with --enable-log-requests, information is only logged at INFO level at maximum.

`enable_prompt_tokens_details = False` `class-attribute` `instance-attribute` ¶

If set to True, enable prompt_tokens_details in usage.

`enable_server_load_tracking = False` `class-attribute` `instance-attribute` ¶

If set to True, enable tracking server_load_metrics in the app state.

`enable_tokenizer_info_endpoint = False` `class-attribute` `instance-attribute` ¶

Enable the /tokenizer_info endpoint. May expose chat templates and other tokenizer configuration.

`exclude_tools_when_tool_choice_none = False` `class-attribute` `instance-attribute` ¶

If specified, exclude tool definitions in prompts when tool_choice='none'.

`fingerprint_mode = 'full'` `class-attribute` `instance-attribute` ¶

Controls the system_fingerprint field on responses.

full (default): vllm-<version>[-<parallelism>]-<hash8>. Encodes server version, non-trivial parallelism degrees (tp/pp/dp/ep), and an 8-char config hash.
hash: vllm-<version>-<hash8>. Parallelism stripped.
custom: emits the literal string from --fingerprint-value.
none: the field is omitted (serialized as null).

`fingerprint_value = None` `class-attribute` `instance-attribute` ¶

Literal fingerprint string used when --fingerprint-mode=custom.

`log_config_file = envs.VLLM_LOGGING_CONFIG_PATH` `class-attribute` `instance-attribute` ¶

Path to logging config JSON file for both vllm and uvicorn

`log_error_stack = envs.VLLM_SERVER_DEV_MODE` `class-attribute` `instance-attribute` ¶

If set to True, log the stack trace of error responses

`lora_modules = None` `class-attribute` `instance-attribute` ¶

LoRA modules configurations in either 'name=path' format or JSON format or JSON list format. Example (old format): 'name=path' Example (new format): {"name": "name", "path": "lora_path", "base_model_name": "id"}

`max_log_len = None` `class-attribute` `instance-attribute` ¶

Max number of prompt characters or prompt ID numbers being printed in log. The default of None means unlimited.

`response_role = 'assistant'` `class-attribute` `instance-attribute` ¶

The role name to return if request.add_generation_prompt=true.

`return_tokens_as_token_ids = False` `class-attribute` `instance-attribute` ¶

When --max-logprobs is specified, represents single tokens as strings of the form 'token_id:{token_id}' so that tokens that are not JSON-encodable can be identified.

`tokens_only = False` `class-attribute` `instance-attribute` ¶

If set to True, only enable the Tokens In<>Out endpoint. This is intended for use in a Disaggregated Everything setup.

`tool_call_parser = None` `class-attribute` `instance-attribute` ¶

Select the tool call parser depending on the model that you're using. This is used to parse the model-generated tool call into OpenAI API format. Required for --enable-auto-tool-choice. You can choose any option from the built-in parsers or register a plugin via --tool-parser-plugin.

`tool_parser_plugin = ''` `class-attribute` `instance-attribute` ¶

Special the tool parser plugin write to parse the model-generated tool into OpenAI API format, the name register in this plugin can be used in --tool-call-parser.

`tool_server = None` `class-attribute` `instance-attribute` ¶

Comma-separated list of host:port pairs (IPv4, IPv6, or hostname). Examples: 127.0.0.1:8000, [::1]:8000, localhost:1234. Or demo for built-in demo tools (browser and Python code interpreter). WARNING: The demo Python tool executes model-generated code in Docker without network isolation by default. See the security guide for more information.

`trust_request_chat_template = False` `class-attribute` `instance-attribute` ¶

Whether to trust the chat template provided in the request. If False, the server will always use the chat template specified by --chat-template or the ones from tokenizer.

`_customize_cli_kwargs(frontend_kwargs)` `classmethod` ¶

Customize argparse kwargs before arguments are registered.

Subclasses should override this and call super()._customize_cli_kwargs(frontend_kwargs) first.

Source code in vllm/entrypoints/openai/cli_args.py

@classmethod
def _customize_cli_kwargs(
    cls,
    frontend_kwargs: dict[str, Any],
) -> dict[str, Any]:
    """Customize argparse kwargs before arguments are registered.

    Subclasses should override this and call
    ``super()._customize_cli_kwargs(frontend_kwargs)`` first.
    """
    # Special case: default_chat_template_kwargs needs json.loads type
    frontend_kwargs["default_chat_template_kwargs"]["type"] = json.loads

    # Special case: LoRA modules need custom parser action and
    # optional_type(str)
    frontend_kwargs["lora_modules"]["type"] = optional_type(str)
    frontend_kwargs["lora_modules"]["action"] = LoRAParserAction

    # Special case: Tool call parser shows built-in options.
    valid_tool_parsers = list(ToolParserManager.list_registered())
    parsers_str = ",".join(valid_tool_parsers)
    frontend_kwargs["tool_call_parser"]["metavar"] = (
        f"{{{parsers_str}}} or name registered in --tool-parser-plugin"
    )
    return frontend_kwargs

`add_cli_args(parser)` `classmethod` ¶

Register CLI arguments for this frontend class.

Subclasses should override _customize_cli_kwargs instead of this method so that base-class postprocessing is always applied.

Source code in vllm/entrypoints/openai/cli_args.py

@classmethod
def add_cli_args(cls, parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
    """Register CLI arguments for this frontend class.

    Subclasses should override ``_customize_cli_kwargs`` instead of
    this method so that base-class postprocessing is always applied.
    """
    from vllm.engine.arg_utils import get_kwargs

    frontend_kwargs = get_kwargs(cls)
    frontend_kwargs = cls._customize_cli_kwargs(frontend_kwargs)

    group_name = cls.__name__.replace("Args", "")
    frontend_group = parser.add_argument_group(
        title=group_name,
        description=cls.__doc__,
    )
    for key, value in frontend_kwargs.items():
        extra_flags = value.pop("flags", [])
        frontend_group.add_argument(
            *extra_flags, f"--{key.replace('_', '-')}", **value
        )

    return parser

`FrontendArgs` ¶

Bases: BaseFrontendArgs

Arguments for the OpenAI-compatible frontend server.

Attributes:

allow_credentials (bool) –

Allow credentials.
allowed_headers (list[str]) –

Allowed headers.
allowed_methods (list[str]) –

Allowed methods.
allowed_origins (list[str]) –

Allowed origins.
api_key (list[str] | None) –

If provided, the server will require one of these keys to be presented in
data_parallel_supervisor_port (int) –

HTTP port for aggregated health endpoints in multi-port external LB
disable_access_log_for_endpoints (str | None) –

Comma-separated list of endpoint paths to exclude from uvicorn access
disable_fastapi_docs (bool) –

Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint.
disable_uvicorn_access_log (bool) –

Disable uvicorn access log.
dp_supervisor_probe_failure_threshold (int) –

Number of consecutive connection-error retries before a child health
dp_supervisor_probe_interval_s (float) –

Seconds between aggregated health probes in multi-port external LB mode.
dp_supervisor_probe_timeout_s (float) –

Seconds to wait between retries when a child health probe fails with a
enable_flash_late_interaction (bool) –

If set, run pooling score MaxSim on GPU in the API server process.
enable_offline_docs (bool) –

Enable offline FastAPI documentation for air-gapped environments.
enable_request_id_headers (bool) –

If specified, API server will add X-Request-Id header to responses.
enable_ssl_refresh (bool) –

Refresh SSL Context when SSL certificate files change
h11_max_header_count (int) –

Maximum number of HTTP headers allowed in a request for h11 parser.
h11_max_incomplete_event_size (int) –

Maximum size (bytes) of an incomplete HTTP event (header or body) for
host (str | None) –

Host name.
middleware (list[str]) –

Additional ASGI middleware to apply to the app. We accept multiple
port (int) –

Port number.
root_path (str | None) –

FastAPI root_path when app is behind a path based routing proxy.
ssl_ca_certs (str | None) –

The CA certificates file.
ssl_cert_reqs (int) –

Whether client certificate is required (see stdlib ssl module's).
ssl_certfile (str | None) –

The file path to the SSL cert file.
ssl_ciphers (str | None) –

SSL cipher suites for HTTPS (TLS 1.2 and below only).
ssl_keyfile (str | None) –

The file path to the SSL key file.
uds (str | None) –

Unix domain socket path. If set, host and port arguments are ignored.
uvicorn_log_level (Literal['critical', 'error', 'warning', 'info', 'debug', 'trace']) –

Log level for uvicorn.

Source code in vllm/entrypoints/openai/cli_args.py

@config
class FrontendArgs(BaseFrontendArgs):
    """Arguments for the OpenAI-compatible frontend server."""

    host: str | None = None
    """Host name."""
    port: int = 8000
    """Port number."""
    data_parallel_supervisor_port: int = 9256
    """HTTP port for aggregated health endpoints in multi-port external LB
    mode."""
    dp_supervisor_probe_interval_s: float = 5.0
    """Seconds between aggregated health probes in multi-port external LB mode."""
    dp_supervisor_probe_timeout_s: float = 5.0
    """Seconds to wait between retries when a child health probe fails with a
    connection error in multi-port external LB mode."""
    dp_supervisor_probe_failure_threshold: int = 3
    """Number of consecutive connection-error retries before a child health
    probe is declared failed in multi-port external LB mode."""
    uds: str | None = None
    """Unix domain socket path. If set, host and port arguments are ignored."""
    uvicorn_log_level: Literal[
        "critical", "error", "warning", "info", "debug", "trace"
    ] = "info"
    """Log level for uvicorn."""
    disable_uvicorn_access_log: bool = False
    """Disable uvicorn access log."""
    disable_access_log_for_endpoints: str | None = None
    """Comma-separated list of endpoint paths to exclude from uvicorn access
    logs. This is useful to reduce log noise from high-frequency endpoints
    like health checks. Example: "/health,/metrics,/ping".
    When set, access logs for requests to these paths will be suppressed
    while keeping logs for other endpoints."""
    allow_credentials: bool = False
    """Allow credentials."""
    allowed_origins: list[str] = field(default_factory=lambda: ["*"])
    """Allowed origins."""
    allowed_methods: list[str] = field(default_factory=lambda: ["*"])
    """Allowed methods."""
    allowed_headers: list[str] = field(default_factory=lambda: ["*"])
    """Allowed headers."""
    api_key: list[str] | None = None
    """If provided, the server will require one of these keys to be presented in
    the header."""
    ssl_keyfile: str | None = None
    """The file path to the SSL key file."""
    ssl_certfile: str | None = None
    """The file path to the SSL cert file."""
    ssl_ca_certs: str | None = None
    """The CA certificates file."""
    enable_ssl_refresh: bool = False
    """Refresh SSL Context when SSL certificate files change"""
    ssl_cert_reqs: int = int(ssl.CERT_NONE)
    """Whether client certificate is required (see stdlib ssl module's)."""
    ssl_ciphers: str | None = None
    """SSL cipher suites for HTTPS (TLS 1.2 and below only).
    Example: 'ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-CHACHA20-POLY1305'"""
    root_path: str | None = None
    """FastAPI root_path when app is behind a path based routing proxy."""
    middleware: list[str] = field(default_factory=lambda: [])
    """Additional ASGI middleware to apply to the app. We accept multiple
    --middleware arguments. The value should be an import path. If a function
    is provided, vLLM will add it to the server using
    `@app.middleware('http')`. If a class is provided, vLLM will
    add it to the server using `app.add_middleware()`."""
    enable_request_id_headers: bool = False
    """If specified, API server will add X-Request-Id header to responses."""
    disable_fastapi_docs: bool = False
    """Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint."""
    h11_max_incomplete_event_size: int = H11_MAX_INCOMPLETE_EVENT_SIZE_DEFAULT
    """Maximum size (bytes) of an incomplete HTTP event (header or body) for
    h11 parser. Helps mitigate header abuse. Default: 4194304 (4 MB)."""
    h11_max_header_count: int = H11_MAX_HEADER_COUNT_DEFAULT
    """Maximum number of HTTP headers allowed in a request for h11 parser.
    Helps mitigate header abuse. Default: 256."""
    enable_offline_docs: bool = False
    """
    Enable offline FastAPI documentation for air-gapped environments.
    Uses vendored static assets bundled with vLLM.
    """
    enable_flash_late_interaction: bool = True
    """If set, run pooling score MaxSim on GPU in the API server process.
    Can significantly improve late-interaction scoring performance."""

    @classmethod
    def _customize_cli_kwargs(
        cls,
        frontend_kwargs: dict[str, Any],
    ) -> dict[str, Any]:
        frontend_kwargs = super()._customize_cli_kwargs(frontend_kwargs)

        # Special case: allowed_origins, allowed_methods, allowed_headers all
        # need json.loads type
        # Should also remove nargs
        frontend_kwargs["allowed_origins"]["type"] = json.loads
        frontend_kwargs["allowed_methods"]["type"] = json.loads
        frontend_kwargs["allowed_headers"]["type"] = json.loads
        del frontend_kwargs["allowed_origins"]["nargs"]
        del frontend_kwargs["allowed_methods"]["nargs"]
        del frontend_kwargs["allowed_headers"]["nargs"]

        # Special case: Middleware needs to append action
        frontend_kwargs["middleware"]["action"] = "append"
        frontend_kwargs["middleware"]["type"] = str
        if "nargs" in frontend_kwargs["middleware"]:
            del frontend_kwargs["middleware"]["nargs"]
        frontend_kwargs["middleware"]["default"] = []

        # Special case: disable_access_log_for_endpoints is a single
        # comma-separated string, not a list
        if "nargs" in frontend_kwargs["disable_access_log_for_endpoints"]:
            del frontend_kwargs["disable_access_log_for_endpoints"]["nargs"]

        return frontend_kwargs

`allow_credentials = False` `class-attribute` `instance-attribute` ¶

Allow credentials.

`allowed_headers = field(default_factory=(lambda: ['*']))` `class-attribute` `instance-attribute` ¶

Allowed headers.

`allowed_methods = field(default_factory=(lambda: ['*']))` `class-attribute` `instance-attribute` ¶

Allowed methods.

`allowed_origins = field(default_factory=(lambda: ['*']))` `class-attribute` `instance-attribute` ¶

Allowed origins.

`api_key = None` `class-attribute` `instance-attribute` ¶

If provided, the server will require one of these keys to be presented in the header.

`data_parallel_supervisor_port = 9256` `class-attribute` `instance-attribute` ¶

HTTP port for aggregated health endpoints in multi-port external LB mode.

`disable_access_log_for_endpoints = None` `class-attribute` `instance-attribute` ¶

Comma-separated list of endpoint paths to exclude from uvicorn access logs. This is useful to reduce log noise from high-frequency endpoints like health checks. Example: "/health,/metrics,/ping". When set, access logs for requests to these paths will be suppressed while keeping logs for other endpoints.

`disable_fastapi_docs = False` `class-attribute` `instance-attribute` ¶

Disable FastAPI's OpenAPI schema, Swagger UI, and ReDoc endpoint.

`disable_uvicorn_access_log = False` `class-attribute` `instance-attribute` ¶

Disable uvicorn access log.

`dp_supervisor_probe_failure_threshold = 3` `class-attribute` `instance-attribute` ¶

Number of consecutive connection-error retries before a child health probe is declared failed in multi-port external LB mode.

`dp_supervisor_probe_interval_s = 5.0` `class-attribute` `instance-attribute` ¶

Seconds between aggregated health probes in multi-port external LB mode.

`dp_supervisor_probe_timeout_s = 5.0` `class-attribute` `instance-attribute` ¶

Seconds to wait between retries when a child health probe fails with a connection error in multi-port external LB mode.

`enable_flash_late_interaction = True` `class-attribute` `instance-attribute` ¶

If set, run pooling score MaxSim on GPU in the API server process. Can significantly improve late-interaction scoring performance.

`enable_offline_docs = False` `class-attribute` `instance-attribute` ¶

Enable offline FastAPI documentation for air-gapped environments. Uses vendored static assets bundled with vLLM.

`enable_request_id_headers = False` `class-attribute` `instance-attribute` ¶

If specified, API server will add X-Request-Id header to responses.

`enable_ssl_refresh = False` `class-attribute` `instance-attribute` ¶

Refresh SSL Context when SSL certificate files change

`h11_max_header_count = H11_MAX_HEADER_COUNT_DEFAULT` `class-attribute` `instance-attribute` ¶

Maximum number of HTTP headers allowed in a request for h11 parser. Helps mitigate header abuse. Default: 256.

`h11_max_incomplete_event_size = H11_MAX_INCOMPLETE_EVENT_SIZE_DEFAULT` `class-attribute` `instance-attribute` ¶

Maximum size (bytes) of an incomplete HTTP event (header or body) for h11 parser. Helps mitigate header abuse. Default: 4194304 (4 MB).

`host = None` `class-attribute` `instance-attribute` ¶

Host name.

`middleware = field(default_factory=(lambda: []))` `class-attribute` `instance-attribute` ¶

Additional ASGI middleware to apply to the app. We accept multiple --middleware arguments. The value should be an import path. If a function is provided, vLLM will add it to the server using @app.middleware('http'). If a class is provided, vLLM will add it to the server using app.add_middleware().

`port = 8000` `class-attribute` `instance-attribute` ¶

Port number.

`root_path = None` `class-attribute` `instance-attribute` ¶

FastAPI root_path when app is behind a path based routing proxy.

`ssl_ca_certs = None` `class-attribute` `instance-attribute` ¶

The CA certificates file.

`ssl_cert_reqs = int(ssl.CERT_NONE)` `class-attribute` `instance-attribute` ¶

Whether client certificate is required (see stdlib ssl module's).

`ssl_certfile = None` `class-attribute` `instance-attribute` ¶

The file path to the SSL cert file.

`ssl_ciphers = None` `class-attribute` `instance-attribute` ¶

SSL cipher suites for HTTPS (TLS 1.2 and below only). Example: 'ECDHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-CHACHA20-POLY1305'

`ssl_keyfile = None` `class-attribute` `instance-attribute` ¶

The file path to the SSL key file.

`uds = None` `class-attribute` `instance-attribute` ¶

Unix domain socket path. If set, host and port arguments are ignored.

`uvicorn_log_level = 'info'` `class-attribute` `instance-attribute` ¶

Log level for uvicorn.

`make_arg_parser(parser)` ¶

Create the CLI argument parser used by the OpenAI API server.

We rely on the helper methods of FrontendArgs and AsyncEngineArgs to register all arguments instead of manually enumerating them here. This avoids code duplication and keeps the argument definitions in one place.

Source code in vllm/entrypoints/openai/cli_args.py

def make_arg_parser(parser: FlexibleArgumentParser) -> FlexibleArgumentParser:
    """Create the CLI argument parser used by the OpenAI API server.

    We rely on the helper methods of `FrontendArgs` and `AsyncEngineArgs` to
    register all arguments instead of manually enumerating them here. This
    avoids code duplication and keeps the argument definitions in one place.
    """
    parser.add_argument(
        "model_tag",
        type=str,
        nargs="?",
        help="The model tag to serve (optional if specified in config)",
    )
    parser.add_argument(
        "--headless",
        action="store_true",
        default=False,
        help="Run in headless mode. See multi-node data parallel "
        "documentation for more details.",
    )
    parser.add_argument(
        "--api-server-count",
        "-asc",
        type=int,
        default=None,
        help="How many API server processes to run. "
        "Defaults to data_parallel_size if not specified.",
    )
    parser.add_argument(
        "--config",
        help="Read CLI options from a config file. "
        "Must be a YAML with the following options: "
        "https://docs.vllm.ai/en/latest/configuration/serve_args.html",
    )
    parser.add_argument(
        "--grpc",
        action="store_true",
        default=False,
        help="Launch a gRPC server instead of the HTTP OpenAI-compatible "
        "server. Requires: pip install vllm[grpc].",
    )
    parser = FrontendArgs.add_cli_args(parser)
    parser = AsyncEngineArgs.add_cli_args(parser)

    return parser

`validate_parsed_serve_args(args)` ¶

Quick checks for model serve args that raise prior to loading.

Source code in vllm/entrypoints/openai/cli_args.py

def validate_parsed_serve_args(args: argparse.Namespace):
    """Quick checks for model serve args that raise prior to loading."""
    if hasattr(args, "subparser") and args.subparser != "serve":
        return

    # Ensure that the chat template is valid; raises if it likely isn't
    validate_chat_template(args.chat_template)

    # Enable auto tool needs a tool call parser to be valid
    if args.enable_auto_tool_choice and not args.tool_call_parser:
        raise TypeError("Error: --enable-auto-tool-choice requires --tool-call-parser")
    if args.enable_log_outputs and not args.enable_log_requests:
        raise TypeError("Error: --enable-log-outputs requires --enable-log-requests")

    if args.data_parallel_multi_port_external_lb:
        from vllm.entrypoints.openai.dp_supervisor import (
            validate_multi_port_external_lb_args,
        )

        validate_multi_port_external_lb_args(args)

vllm.entrypoints.openai.cli_args ¶

BaseFrontendArgs ¶

chat_template = None class-attribute instance-attribute ¶

chat_template_content_format = 'auto' class-attribute instance-attribute ¶

default_chat_template_kwargs = None class-attribute instance-attribute ¶

enable_auto_tool_choice = False class-attribute instance-attribute ¶

enable_force_include_usage = False class-attribute instance-attribute ¶

enable_log_deltas = True class-attribute instance-attribute ¶

enable_log_outputs = False class-attribute instance-attribute ¶

enable_prompt_tokens_details = False class-attribute instance-attribute ¶

enable_server_load_tracking = False class-attribute instance-attribute ¶

enable_tokenizer_info_endpoint = False class-attribute instance-attribute ¶

exclude_tools_when_tool_choice_none = False class-attribute instance-attribute ¶

fingerprint_mode = 'full' class-attribute instance-attribute ¶

fingerprint_value = None class-attribute instance-attribute ¶

log_config_file = envs.VLLM_LOGGING_CONFIG_PATH class-attribute instance-attribute ¶

log_error_stack = envs.VLLM_SERVER_DEV_MODE class-attribute instance-attribute ¶

lora_modules = None class-attribute instance-attribute ¶

max_log_len = None class-attribute instance-attribute ¶

response_role = 'assistant' class-attribute instance-attribute ¶

return_tokens_as_token_ids = False class-attribute instance-attribute ¶

tokens_only = False class-attribute instance-attribute ¶

tool_call_parser = None class-attribute instance-attribute ¶

tool_parser_plugin = '' class-attribute instance-attribute ¶

tool_server = None class-attribute instance-attribute ¶

trust_request_chat_template = False class-attribute instance-attribute ¶

_customize_cli_kwargs(frontend_kwargs) classmethod ¶

add_cli_args(parser) classmethod ¶

FrontendArgs ¶

allow_credentials = False class-attribute instance-attribute ¶

allowed_headers = field(default_factory=(lambda: ['*'])) class-attribute instance-attribute ¶

allowed_methods = field(default_factory=(lambda: ['*'])) class-attribute instance-attribute ¶

allowed_origins = field(default_factory=(lambda: ['*'])) class-attribute instance-attribute ¶

api_key = None class-attribute instance-attribute ¶

data_parallel_supervisor_port = 9256 class-attribute instance-attribute ¶

disable_access_log_for_endpoints = None class-attribute instance-attribute ¶

disable_fastapi_docs = False class-attribute instance-attribute ¶

disable_uvicorn_access_log = False class-attribute instance-attribute ¶

dp_supervisor_probe_failure_threshold = 3 class-attribute instance-attribute ¶

dp_supervisor_probe_interval_s = 5.0 class-attribute instance-attribute ¶

dp_supervisor_probe_timeout_s = 5.0 class-attribute instance-attribute ¶

enable_flash_late_interaction = True class-attribute instance-attribute ¶

enable_offline_docs = False class-attribute instance-attribute ¶

enable_request_id_headers = False class-attribute instance-attribute ¶

enable_ssl_refresh = False class-attribute instance-attribute ¶

h11_max_header_count = H11_MAX_HEADER_COUNT_DEFAULT class-attribute instance-attribute ¶

h11_max_incomplete_event_size = H11_MAX_INCOMPLETE_EVENT_SIZE_DEFAULT class-attribute instance-attribute ¶

host = None class-attribute instance-attribute ¶

middleware = field(default_factory=(lambda: [])) class-attribute instance-attribute ¶

port = 8000 class-attribute instance-attribute ¶

root_path = None class-attribute instance-attribute ¶

ssl_ca_certs = None class-attribute instance-attribute ¶

ssl_cert_reqs = int(ssl.CERT_NONE) class-attribute instance-attribute ¶

ssl_certfile = None class-attribute instance-attribute ¶

ssl_ciphers = None class-attribute instance-attribute ¶

ssl_keyfile = None class-attribute instance-attribute ¶

uds = None class-attribute instance-attribute ¶

uvicorn_log_level = 'info' class-attribute instance-attribute ¶

make_arg_parser(parser) ¶

validate_parsed_serve_args(args) ¶

`vllm.entrypoints.openai.cli_args` ¶

`BaseFrontendArgs` ¶

`chat_template = None` `class-attribute` `instance-attribute` ¶

`chat_template_content_format = 'auto'` `class-attribute` `instance-attribute` ¶

`default_chat_template_kwargs = None` `class-attribute` `instance-attribute` ¶

`enable_auto_tool_choice = False` `class-attribute` `instance-attribute` ¶

`enable_force_include_usage = False` `class-attribute` `instance-attribute` ¶

`enable_log_deltas = True` `class-attribute` `instance-attribute` ¶

`enable_log_outputs = False` `class-attribute` `instance-attribute` ¶

`enable_prompt_tokens_details = False` `class-attribute` `instance-attribute` ¶

`enable_server_load_tracking = False` `class-attribute` `instance-attribute` ¶

`enable_tokenizer_info_endpoint = False` `class-attribute` `instance-attribute` ¶

`exclude_tools_when_tool_choice_none = False` `class-attribute` `instance-attribute` ¶

`fingerprint_mode = 'full'` `class-attribute` `instance-attribute` ¶

`fingerprint_value = None` `class-attribute` `instance-attribute` ¶

`log_config_file = envs.VLLM_LOGGING_CONFIG_PATH` `class-attribute` `instance-attribute` ¶

`log_error_stack = envs.VLLM_SERVER_DEV_MODE` `class-attribute` `instance-attribute` ¶

`lora_modules = None` `class-attribute` `instance-attribute` ¶

`max_log_len = None` `class-attribute` `instance-attribute` ¶

`response_role = 'assistant'` `class-attribute` `instance-attribute` ¶

`return_tokens_as_token_ids = False` `class-attribute` `instance-attribute` ¶

`tokens_only = False` `class-attribute` `instance-attribute` ¶

`tool_call_parser = None` `class-attribute` `instance-attribute` ¶

`tool_parser_plugin = ''` `class-attribute` `instance-attribute` ¶

`tool_server = None` `class-attribute` `instance-attribute` ¶

`trust_request_chat_template = False` `class-attribute` `instance-attribute` ¶

`_customize_cli_kwargs(frontend_kwargs)` `classmethod` ¶

`add_cli_args(parser)` `classmethod` ¶

`FrontendArgs` ¶

`allow_credentials = False` `class-attribute` `instance-attribute` ¶

`allowed_headers = field(default_factory=(lambda: ['*']))` `class-attribute` `instance-attribute` ¶

`allowed_methods = field(default_factory=(lambda: ['*']))` `class-attribute` `instance-attribute` ¶

`allowed_origins = field(default_factory=(lambda: ['*']))` `class-attribute` `instance-attribute` ¶

`api_key = None` `class-attribute` `instance-attribute` ¶

`data_parallel_supervisor_port = 9256` `class-attribute` `instance-attribute` ¶

`disable_access_log_for_endpoints = None` `class-attribute` `instance-attribute` ¶

`disable_fastapi_docs = False` `class-attribute` `instance-attribute` ¶

`disable_uvicorn_access_log = False` `class-attribute` `instance-attribute` ¶

`dp_supervisor_probe_failure_threshold = 3` `class-attribute` `instance-attribute` ¶

`dp_supervisor_probe_interval_s = 5.0` `class-attribute` `instance-attribute` ¶

`dp_supervisor_probe_timeout_s = 5.0` `class-attribute` `instance-attribute` ¶

`enable_flash_late_interaction = True` `class-attribute` `instance-attribute` ¶

`enable_offline_docs = False` `class-attribute` `instance-attribute` ¶

`enable_request_id_headers = False` `class-attribute` `instance-attribute` ¶

`enable_ssl_refresh = False` `class-attribute` `instance-attribute` ¶

`h11_max_header_count = H11_MAX_HEADER_COUNT_DEFAULT` `class-attribute` `instance-attribute` ¶

`h11_max_incomplete_event_size = H11_MAX_INCOMPLETE_EVENT_SIZE_DEFAULT` `class-attribute` `instance-attribute` ¶

`host = None` `class-attribute` `instance-attribute` ¶

`middleware = field(default_factory=(lambda: []))` `class-attribute` `instance-attribute` ¶

`port = 8000` `class-attribute` `instance-attribute` ¶

`root_path = None` `class-attribute` `instance-attribute` ¶

`ssl_ca_certs = None` `class-attribute` `instance-attribute` ¶

`ssl_cert_reqs = int(ssl.CERT_NONE)` `class-attribute` `instance-attribute` ¶

`ssl_certfile = None` `class-attribute` `instance-attribute` ¶

`ssl_ciphers = None` `class-attribute` `instance-attribute` ¶

`ssl_keyfile = None` `class-attribute` `instance-attribute` ¶

`uds = None` `class-attribute` `instance-attribute` ¶

`uvicorn_log_level = 'info'` `class-attribute` `instance-attribute` ¶

`make_arg_parser(parser)` ¶

`validate_parsed_serve_args(args)` ¶