Hosted inference rollout is invite-first. Abuse-resistant keys, egress controls, and model allowlists ship with enterprise workspaces.

Models live todayConfidential path in development

Confidential model computeThe audit-ready AI layer your CISO, security team, and CFO keep asking for—not another “trust us” GPU rental.

Today we host frontier and open-weight models on standard, high-throughput infrastructure—OpenAI-compatible APIs, catalog pricing, and per-organizationId usage— without requiring confidential VMs. In development we are building the full stack: AMD SEV-SNP and Intel TDX, measured boot and dm-verity host roots, Kata-class isolation, open attestation flows with Sigstore Rekor transparency logs, and NVIDIA GPU attestation—so every cold start can ship receipts your risk team can verify before a token leaves the enclave.

See attestation stackDocumentation
TEE-backed VMsAttestation logs APIGPU device quotesPer-org usage
GET /v1/attestation/evidence — illustrative
GET https://inf.vocifer.com/v1/attestation/evidence200 OK
{
  "tee": "SEV-SNP | TDX",
  "guest_policy_hash": "sha256:…",
  "host_dm_verity": "sha512:…",
  "gpu": { "device": "H100", "quote": "…" },
  "rekor": { "log_index": "…", "uuid": "…" },
  "issued_at": "2026-05-09T12:00:00Z"
}

Illustrative attestation envelope—exact schema ships with preview access; production evidence chains to your verifier policies.

Our moat

A full-stack attestation chain—not a marketing checkbox

Most inference clouds stop at “we use encryption.” The confidential track is designed so each instance proves what it is before it serves: CPU TEE quote, firmware and host integrity, guest OS baseline, Kata-style isolation boundary, NVIDIA GPU attestation, and attestation artifacts anchored in a Sigstore Rekor transparency log for tamper-evident audit. If measurements diverge from your allowlist—new binary, unexpected driver, tampered init—the control plane recycles the node instead of silently continuing. Again: standard model hosting is available today without this stack; this section describes the roadmap moat. HTTPS APIs to fetch evidence and log entries evolve with preview (names and schema subject to design-partner feedback).

  1. Layer 1

    Hardware TEE — AMD SEV-SNP · Intel TDX

    Confidential VMs anchor trust in the CPU silicon: encrypted guest memory, remote attestation quotes, and reduced exposure to the virtualization stack in the vendor-defined threat model.

  2. Layer 2

    UEFI & measured boot

    Firmware and early boot are part of the measurement chain so the machine attests a known anchor before your policy even reaches userland.

  3. Layer 3

    Hypervisor, host OS & image integrity

    Hypervisor and host images ship with locked-down roots of trust—think dm-verity and related read-only, hash-chained roots plus signed update channels—so binaries and critical config cannot quietly diverge from what you approved.

  4. Layer 4

    Guest OS verification

    Kernel and userspace baselines are pinned; unexpected modules, init changes, or compromised drivers fail verification and trigger a controlled reprovision instead of silent service.

  5. Layer 5

    Kata Containers & workload boundary

    Inference runtimes sit behind a Kata-style lightweight-VM boundary—stronger isolation than namespaces alone—so each customer slice keeps a hardware-backed fence around model weights and KV state.

  6. Layer 6

    Transparency log — Sigstore Rekor

    Attestation artifacts and release events are designed to land in an append-only, publicly verifiable transparency log—Sigstore Rekor—so your security and finance stakeholders can trace what was proven, when, and that the log was not rewritten after the fact.

  7. Layer 7

    GPU attestation

    NVIDIA confidential-compute GPUs expose device quotes that pair with the CPU TEE evidence; NVML / attestation SDK flows validate the accelerator’s configuration before vLLM-class workers accept traffic.

Design partners & regulated teams

Request early access to the confidential attestation path. Production model hosting without this stack is available separately—use Request API access on the main CTAs. We prioritize design partners from waitlist notes for security review and joint architecture.

ATTESTED

Integrity you can query

Evidence flows from silicon to container: hardware TEE quotes, firmware measurements, dm-verity-backed host roots, guest policy, and accelerator attestation converge into signed reports you can poll over HTTPS.

FAST

Sub-second orchestration paths

Optimized scheduler + hardware-aware kernels keep prefill bounded and decode streaming smooth even when aggregates spike.

SIMPLE

Familiar HTTPS surface area

One auth scheme, idiomatic headers, deterministic error surfaces, and OpenAI-shaped payloads developers already know.

RELIABLE

Production incident muscle

Runbooks exercised weekly, granular health checks, graceful degradation tiers, and clear status semantics for routers.

LOW-COST

Token economics you can spreadsheet

List prices are enumerable from the catalog with no surprises—ideal when your finance stack reconciles usage against published SKUs.

Model library

Hosted SKUs spanning chat, embeddings, rerankers, and voice

Showcase SKUs preview the economics you expose publicly. Availability follows your allowlist—we keep reserved pools for latency-sensitive fleets and carve noisy research traffic into separate concurrency lanes.

Open-weight · verified checkpoint

meta-llama/llama-3.3-70b-instruct

General chat & agents

Instruction-tuned Llama family

128k context-class

Input $0.10 · Output $0.32 per 1M tokens

Open-weight · verified checkpoint

qwen/qwen3.5-122b-a10b

Code & long context

Qwen3.5 MoE flagship

Limits in live catalog

Input $0.26 · Output $2.08 per 1M tokens

Open-weight · verified checkpoint

mistralai/mistral-large

High reasoning budgets

Mistral frontier instruct

Limits in live catalog

Input $2.00 · Output $6.00 per 1M tokens (cache read $0.20)

Open-weight · verified checkpoint

deepseek/deepseek-v3.2

MoE · tool-friendly

DeepSeek V3.2

Limits in live catalog

Input $0.252 · Output $0.378 per 1M tokens (cache read $0.0252)

Open-weight · verified checkpoint

google/gemma-3-27b-it

Vision + agentic workloads

Gemma 3 multimodal instruct

Limits in live catalog

Input $0.08 · Output $0.16 per 1M tokens

Open-weight · verified checkpoint

Custom fine-tune

Private adapters & SLAs

Your weights, our stack

Provisioned GPUs

Private price list

Card prices are illustrative list rates (per-token USD × 10⁶) for each model id. Your live endpoint should still publish the same fields via GET /v1/models so automation never drifts from marketing copy.

Platform

Inference tailored to routers, SaaS dashboards, and internal agents alike

Borrowing the playbook from specialist clouds like DeepInfra and Inceptron, we unify hardware procurement, KV-cache-aware placement, telemetry, and customer-isolated metering so your product teams iterate on prompts—not rack logistics.

Continuous batching & autoscaling

Scheduler coalesces compatible requests, scales GPU worker pools on SLI metrics, and pins hot models into flash-friendly memory tiers.

Usage graph & export

Per-organizationId usage streams into your finance stack (Snowflake, BigQuery, Metronome) with line-item attribution for chargeback.

Latency SLO monitors

Track TTFT, inter-token gaps, and saturation per fleet. Alert on tail shifts before customer-facing SLAs breach.

Security hardening

TLS everywhere, optional mTLS, tenant-scoped API keys—and on the confidential track, TEE-backed nodes with attested boot and GPU evidence before traffic lands.

How you call Vocifer

Same paths you already automate: models catalog + chat completions

Production inference is served from a dedicated host such as inf.vocifer.com under the /v1 prefix. List SKUs, then POST chat completions with your API key—optionally including X-Vocifer-Organization-Id so usage stays tied to organizationId.

  • `GET https://inf.vocifer.com/v1/models` — catalog ids, context bounds, string pricing fields per modality.
  • `POST https://inf.vocifer.com/v1/chat/completions` — OpenAI-shaped messages, streaming SSE optional.
  • Bearer auth on every request; regional hostnames may differ per contract.
  • Drop-in with OpenAI SDKs by setting `base_url` to `https://inf.vocifer.com/v1`.
Documentation (curl, Python, Node, Go)

List models — curl

curl -sS "https://inf.vocifer.com/v1/models" \
  -H "Authorization: Bearer $VOCIFER_API_KEY"

Chat completion — curl

curl -sS "https://inf.vocifer.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $VOCIFER_API_KEY" \
  -d '{
    "model": "meta-llama/llama-3.3-70b-instruct",
    "messages": [
      {
        "role": "user",
        "content": "Summarize this incident timeline for execs."
      }
    ],
    "stream": true
  }'

Swap the hostname for the one issued to your workspace. Streaming follows the same SSE framing OpenAI-compatible clients expect (Vercel AI SDK, LangChain, LiteLLM, etc.).

Pricing

Transparent token pricing per 1M tokens

We publish list-style token rates: input, output, and cache read (when listed) per 1M tokens, expressed as per-token USD × 10⁶ for easy spreadsheet math. Figures below are representative SKUs; your contract and live GET /v1/models response are authoritative.

MiniMax M2.7

minimax/minimax-m2.7

Input $0.299 · Output $1.20 per 1M tokens

  • Catalog id: minimax/minimax-m2.7 (no input_cache_read in list)
  • High-capability general reasoning
  • OpenAI-compatible chat API

MiniMax M2.5

minimax/minimax-m2.5

Input $0.15 · Output $1.15 per 1M tokens

  • Cache read: $0.03 per 1M tokens
  • Balanced quality/latency profile
  • Usage export by organizationId

DeepSeek V3.2

deepseek/deepseek-v3.2

Input $0.252 · Output $0.378 per 1M tokens

  • Cache read: $0.0252 per 1M tokens
  • Catalog id: deepseek/deepseek-v3.2
  • Stable throughput under batch load

Qwen 3.5 122B A10B

qwen/qwen3.5-122b-a10b

Input $0.26 · Output $2.08 per 1M tokens

  • Catalog id: qwen/qwen3.5-122b-a10b (no cache-read price in list)
  • Strong multilingual + tool usage
  • Good fit for agent pipelines

Canonical pricing is always served via GET /v1/models with USD string values per token unit so your routers and finance exports stay aligned. Refresh the catalog as SKUs and list prices evolve.

Engineering FAQ

Answers for platform teams routing production traffic and wiring OpenAI-compatible clients to Vocifer-hosted inference.

Confidential inference, optional standard tier

Metered APIs on top, hardware attestation underneath—apply for the confidential preview or ship on the same OpenAI-compatible surface today.

Talk to solutions