ATTESTED
Integrity you can query
Evidence flows from silicon to container: hardware TEE quotes, firmware measurements, dm-verity-backed host roots, guest policy, and accelerator attestation converge into signed reports you can poll over HTTPS.
Hosted inference rollout is invite-first. Abuse-resistant keys, egress controls, and model allowlists ship with enterprise workspaces.
Models live todayConfidential path in development
Today we host frontier and open-weight models on standard, high-throughput infrastructure—OpenAI-compatible APIs, catalog pricing, and per-organizationId usage— without requiring confidential VMs. In development we are building the full stack: AMD SEV-SNP and Intel TDX, measured boot and dm-verity host roots, Kata-class isolation, open attestation flows with Sigstore Rekor transparency logs, and NVIDIA GPU attestation—so every cold start can ship receipts your risk team can verify before a token leaves the enclave.
GET https://inf.vocifer.com/v1/attestation/evidence → 200 OK { "tee": "SEV-SNP | TDX", "guest_policy_hash": "sha256:…", "host_dm_verity": "sha512:…", "gpu": { "device": "H100", "quote": "…" }, "rekor": { "log_index": "…", "uuid": "…" }, "issued_at": "2026-05-09T12:00:00Z" }
Illustrative attestation envelope—exact schema ships with preview access; production evidence chains to your verifier policies.
Our moat
Most inference clouds stop at “we use encryption.” The confidential track is designed so each instance proves what it is before it serves: CPU TEE quote, firmware and host integrity, guest OS baseline, Kata-style isolation boundary, NVIDIA GPU attestation, and attestation artifacts anchored in a Sigstore Rekor transparency log for tamper-evident audit. If measurements diverge from your allowlist—new binary, unexpected driver, tampered init—the control plane recycles the node instead of silently continuing. Again: standard model hosting is available today without this stack; this section describes the roadmap moat. HTTPS APIs to fetch evidence and log entries evolve with preview (names and schema subject to design-partner feedback).
Layer 1
Confidential VMs anchor trust in the CPU silicon: encrypted guest memory, remote attestation quotes, and reduced exposure to the virtualization stack in the vendor-defined threat model.
Layer 2
Firmware and early boot are part of the measurement chain so the machine attests a known anchor before your policy even reaches userland.
Layer 3
Hypervisor and host images ship with locked-down roots of trust—think dm-verity and related read-only, hash-chained roots plus signed update channels—so binaries and critical config cannot quietly diverge from what you approved.
Layer 4
Kernel and userspace baselines are pinned; unexpected modules, init changes, or compromised drivers fail verification and trigger a controlled reprovision instead of silent service.
Layer 5
Inference runtimes sit behind a Kata-style lightweight-VM boundary—stronger isolation than namespaces alone—so each customer slice keeps a hardware-backed fence around model weights and KV state.
Layer 6
Attestation artifacts and release events are designed to land in an append-only, publicly verifiable transparency log—Sigstore Rekor—so your security and finance stakeholders can trace what was proven, when, and that the log was not rewritten after the fact.
Layer 7
NVIDIA confidential-compute GPUs expose device quotes that pair with the CPU TEE evidence; NVML / attestation SDK flows validate the accelerator’s configuration before vLLM-class workers accept traffic.
Design partners & regulated teams
Request early access to the confidential attestation path. Production model hosting without this stack is available separately—use Request API access on the main CTAs. We prioritize design partners from waitlist notes for security review and joint architecture.
ATTESTED
Evidence flows from silicon to container: hardware TEE quotes, firmware measurements, dm-verity-backed host roots, guest policy, and accelerator attestation converge into signed reports you can poll over HTTPS.
FAST
Optimized scheduler + hardware-aware kernels keep prefill bounded and decode streaming smooth even when aggregates spike.
SIMPLE
One auth scheme, idiomatic headers, deterministic error surfaces, and OpenAI-shaped payloads developers already know.
RELIABLE
Runbooks exercised weekly, granular health checks, graceful degradation tiers, and clear status semantics for routers.
LOW-COST
List prices are enumerable from the catalog with no surprises—ideal when your finance stack reconciles usage against published SKUs.
Model library
Showcase SKUs preview the economics you expose publicly. Availability follows your allowlist—we keep reserved pools for latency-sensitive fleets and carve noisy research traffic into separate concurrency lanes.
Open-weight · verified checkpoint
General chat & agents
Instruction-tuned Llama family
128k context-class
Input $0.10 · Output $0.32 per 1M tokens
Open-weight · verified checkpoint
Code & long context
Qwen3.5 MoE flagship
Limits in live catalog
Input $0.26 · Output $2.08 per 1M tokens
Open-weight · verified checkpoint
High reasoning budgets
Mistral frontier instruct
Limits in live catalog
Input $2.00 · Output $6.00 per 1M tokens (cache read $0.20)
Open-weight · verified checkpoint
MoE · tool-friendly
DeepSeek V3.2
Limits in live catalog
Input $0.252 · Output $0.378 per 1M tokens (cache read $0.0252)
Open-weight · verified checkpoint
Vision + agentic workloads
Gemma 3 multimodal instruct
Limits in live catalog
Input $0.08 · Output $0.16 per 1M tokens
Open-weight · verified checkpoint
Private adapters & SLAs
Your weights, our stack
Provisioned GPUs
Private price list
Card prices are illustrative list rates (per-token USD × 10⁶) for each model id. Your live endpoint should still publish the same fields via GET /v1/models so automation never drifts from marketing copy.
Platform
Borrowing the playbook from specialist clouds like DeepInfra and Inceptron, we unify hardware procurement, KV-cache-aware placement, telemetry, and customer-isolated metering so your product teams iterate on prompts—not rack logistics.
Scheduler coalesces compatible requests, scales GPU worker pools on SLI metrics, and pins hot models into flash-friendly memory tiers.
Per-organizationId usage streams into your finance stack (Snowflake, BigQuery, Metronome) with line-item attribution for chargeback.
Track TTFT, inter-token gaps, and saturation per fleet. Alert on tail shifts before customer-facing SLAs breach.
TLS everywhere, optional mTLS, tenant-scoped API keys—and on the confidential track, TEE-backed nodes with attested boot and GPU evidence before traffic lands.
How you call Vocifer
Production inference is served from a dedicated host such as inf.vocifer.com under the /v1 prefix. List SKUs, then POST chat completions with your API key—optionally including X-Vocifer-Organization-Id so usage stays tied to organizationId.
List models — curl
curl -sS "https://inf.vocifer.com/v1/models" \ -H "Authorization: Bearer $VOCIFER_API_KEY"
Chat completion — curl
curl -sS "https://inf.vocifer.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $VOCIFER_API_KEY" \
-d '{
"model": "meta-llama/llama-3.3-70b-instruct",
"messages": [
{
"role": "user",
"content": "Summarize this incident timeline for execs."
}
],
"stream": true
}'Swap the hostname for the one issued to your workspace. Streaming follows the same SSE framing OpenAI-compatible clients expect (Vercel AI SDK, LangChain, LiteLLM, etc.).
Pricing
We publish list-style token rates: input, output, and cache read (when listed) per 1M tokens, expressed as per-token USD × 10⁶ for easy spreadsheet math. Figures below are representative SKUs; your contract and live GET /v1/models response are authoritative.
minimax/minimax-m2.7
Input $0.299 · Output $1.20 per 1M tokens
minimax/minimax-m2.5
Input $0.15 · Output $1.15 per 1M tokens
deepseek/deepseek-v3.2
Input $0.252 · Output $0.378 per 1M tokens
qwen/qwen3.5-122b-a10b
Input $0.26 · Output $2.08 per 1M tokens
Canonical pricing is always served via GET /v1/models with USD string values per token unit so your routers and finance exports stay aligned. Refresh the catalog as SKUs and list prices evolve.
Answers for platform teams routing production traffic and wiring OpenAI-compatible clients to Vocifer-hosted inference.
Confidential inference, optional standard tier
Metered APIs on top, hardware attestation underneath—apply for the confidential preview or ship on the same OpenAI-compatible surface today.