Headline numbers
Measured on Intel Xeon Platinum (AVX-512F),--release build:
| Setup | Latency | Throughput |
|---|---|---|
CPU fast_128, single thread | 14 ms/gate | ~70 gates/s |
CPU standard_128, single thread | 32 ms/gate | ~30 gates/s |
CPU conservative_128, single thread | 48 ms/gate | ~20 gates/s |
| GPU batch-128 (BYO GPU, GTX 1660 Ti) | 5.2 ms/gate | ~190 gates/s |
| GPU batch-128 (BYO GPU, RTX 4090, projected) | ~1.5 ms/gate | ~660 gates/s |
| GPU batch-128 (BYO GPU, H100, projected) | ~0.5 ms/gate | ~2000 gates/s |
api.wavis.xyz runs CPU-only at the latencies above.
GPU figures are for local SDK execution on customer hardware (BYO GPU);
GPU-accelerated cloud compute is on the Enterprise tier.
Where the time goes
A single gate bootstrap does four things:- External product (CMux) with the bootstrapping key — ~40% of time.
- Blind rotation of the test polynomial — ~35%.
- Sample extraction — ~5%.
- Key switching — ~20%.
Choosing a preset
WAVIS exposes three named TFHE parameter presets. They differ in noise budget, not security — all are 128-bit secure.| Preset | TLWE n | Latency | Noise budget | Best for |
|---|---|---|---|---|
fast_128 | 636 | 14 ms | Standard | Demos, dev, <1000-gate circuits |
standard_128 | 1024 | 32 ms | Higher | Default for production |
conservative_128 | 1024 | 48 ms | Highest | Long-running circuits, max safety margin |
fast_128. If your circuit is more than ~1000
gates deep, switch to standard_128. Use conservative_128 only for
high-stakes one-off computations where the extra 16 ms is worth the safety.
Batch evaluation (huge speedup on GPU)
Single gates are overhead-dominated on a GPU. Batches amortize the bootstrapping-key DRAM read across many gates simultaneously./tfhe/batch call get a 30% price
discount automatically — see billing.
CKKS for arithmetic workloads
For ML inference, statistics, and other “real number” workloads, use CKKS instead of TFHE. CKKS encrypts vectors of floats, supports add/multiply/rotate, and is ~100× faster than TFHE for the same arithmetic depth (because it processes thousands of slots in parallel via SIMD packing).| Operation | CKKS (SIMD-packed) | TFHE (per-bit) |
|---|---|---|
| Add 1024 floats | 1 ms | 14 s (1024 gates) |
| Multiply by constant | 5 ms | — |
| Bootstrap | ~500 ms | 14 ms (per gate) |
| Polynomial eval (deg 5) | 30 ms | — |
When FHE is the wrong tool
Be honest about throughput requirements. FHE is ~10⁵× slower than plaintext computation. Some workloads are simply not feasible:- Real-time video encoding. ~100 Mb/s plaintext → ~100 b/s FHE. No.
- Large transformer inference (70B params). Days per token. Not yet.
- Database scans over millions of rows. Use trusted hardware (SGX/SEV) or PIR.
- Small ML inference (≤7B params, ≤1K input dim)
- Boolean circuits with up to ~10K gates
- Privacy-sensitive aggregation/scoring
- Multi-party computation primitives
Profiling your circuit
/compute and /tfhe/gate response includes
compute_time_ms — use that for production monitoring.
Concurrent operations
The WAVIS API uses an internal thread pool. A single account can issue up to 32 concurrent compute requests before queueing kicks in. Pro tier raises this to 64; Scale to 256. For heavy parallel workloads:- Use
/tfhe/batchto amortize cost across many gates in one call (preferred). - Or open multiple HTTP connections and POST in parallel (HTTP/2 multiplexing supported).
Hardware recommendations
If you’re running the SDK locally with GPU:| Card | Memory bandwidth | Estimated NAND/s |
|---|---|---|
| GTX 1660 Ti | 288 GB/s | ~190 |
| RTX 3060 | 360 GB/s | ~240 |
| RTX 4090 | 1008 GB/s | ~660 |
| H100 SXM5 | 3350 GB/s | ~2000 |
Performance roadmap
- Q3 2026 — WebAssembly target. Browser-side gates, ~2× CPU baseline.
- Q4 2026 — Multi-GPU batching. Linear scaling across CUDA devices.
- 2027 — Algorithmic optimizations. Sub-linear bootstrapping research; targeting 4 ms/gate on standard hardware.
Next Steps
GPU Batch Example
Step-by-step batched evaluation
Key Management
Picking parameters for your circuit depth