Skip to main content
FHE has a reputation for being slow. WAVIS is designed to make it as fast as modern hardware allows — but performance is still the dominant practical constraint, so understanding the trade-offs matters.

Headline numbers

Measured on Intel Xeon Platinum (AVX-512F), --release build:
SetupLatencyThroughput
CPU fast_128, single thread14 ms/gate~70 gates/s
CPU standard_128, single thread32 ms/gate~30 gates/s
CPU conservative_128, single thread48 ms/gate~20 gates/s
GPU batch-128 (BYO GPU, GTX 1660 Ti)5.2 ms/gate~190 gates/s
GPU batch-128 (BYO GPU, RTX 4090, projected)~1.5 ms/gate~660 gates/s
GPU batch-128 (BYO GPU, H100, projected)~0.5 ms/gate~2000 gates/s
Server-side compute on api.wavis.xyz runs CPU-only at the latencies above. GPU figures are for local SDK execution on customer hardware (BYO GPU); GPU-accelerated cloud compute is on the Enterprise tier.

Where the time goes

A single gate bootstrap does four things:
  1. External product (CMux) with the bootstrapping key — ~40% of time.
  2. Blind rotation of the test polynomial — ~35%.
  3. Sample extraction — ~5%.
  4. Key switching — ~20%.
The dominant cost is memory bandwidth: the bootstrapping key (~20 MB) exceeds the L2 cache of every current GPU, so the bottleneck is streaming the key from DRAM. This means GPU performance scales close to linearly with DRAM bandwidth: 288 GB/s (GTX 1660 Ti) → 1.0 TB/s (RTX 4090) → 3.35 TB/s (H100).

Choosing a preset

WAVIS exposes three named TFHE parameter presets. They differ in noise budget, not security — all are 128-bit secure.
PresetTLWE nLatencyNoise budgetBest for
fast_12863614 msStandardDemos, dev, <1000-gate circuits
standard_128102432 msHigherDefault for production
conservative_128102448 msHighestLong-running circuits, max safety margin
In code:
import wavis_fhe as wv

keys = wv.keygen("fast_128")          # explicit
keys = wv.keygen()                    # defaults to "fast_128" in SDK
Or via REST:
POST /api/v1/tfhe/session
{ "preset": "standard_128" }
Rule of thumb: start with fast_128. If your circuit is more than ~1000 gates deep, switch to standard_128. Use conservative_128 only for high-stakes one-off computations where the extra 16 ms is worth the safety.

Batch evaluation (huge speedup on GPU)

Single gates are overhead-dominated on a GPU. Batches amortize the bootstrapping-key DRAM read across many gates simultaneously.
# CPU — batch is the same as serial
keys = wv.keygen()
results = keys.batch_nand([(a, b) for ... ])

# GPU — batch is dramatically faster than serial
keys_gpu = wv.keygen_gpu()
gates = [(a_i, b_i) for i in range(128)]
results = keys_gpu.batch_nand(gates)  # 5.2 ms/gate (vs. 48 ms single)
Or via REST:
curl -X POST https://api.wavis.xyz/api/v1/tfhe/batch \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "sess_...",
    "gates": [
      {"op": "NAND", "a": "...", "b": "..."},
      {"op": "AND",  "a": "...", "b": "..."}
    ]
  }'
Batch discount: ≥32 ops in a single /tfhe/batch call get a 30% price discount automatically — see billing.

CKKS for arithmetic workloads

For ML inference, statistics, and other “real number” workloads, use CKKS instead of TFHE. CKKS encrypts vectors of floats, supports add/multiply/rotate, and is ~100× faster than TFHE for the same arithmetic depth (because it processes thousands of slots in parallel via SIMD packing).
OperationCKKS (SIMD-packed)TFHE (per-bit)
Add 1024 floats1 ms14 s (1024 gates)
Multiply by constant5 ms
Bootstrap~500 ms14 ms (per gate)
Polynomial eval (deg 5)30 ms
CKKS uses different parameters — see Key Management for the parameter knobs.

When FHE is the wrong tool

Be honest about throughput requirements. FHE is ~10⁵× slower than plaintext computation. Some workloads are simply not feasible:
  • Real-time video encoding. ~100 Mb/s plaintext → ~100 b/s FHE. No.
  • Large transformer inference (70B params). Days per token. Not yet.
  • Database scans over millions of rows. Use trusted hardware (SGX/SEV) or PIR.
FHE shines for:
  • Small ML inference (≤7B params, ≤1K input dim)
  • Boolean circuits with up to ~10K gates
  • Privacy-sensitive aggregation/scoring
  • Multi-party computation primitives
Always benchmark your workload before committing.

Profiling your circuit

import wavis_fhe as wv
import time

keys = wv.keygen()
ct_a = keys.encrypt(False)
ct_b = keys.encrypt(True)

# Warm up
for _ in range(10):
    keys.nand(ct_a, ct_b)

# Time
t = time.perf_counter()
for _ in range(100):
    ct = keys.nand(ct_a, ct_b)
elapsed = (time.perf_counter() - t) / 100 * 1000
print(f"NAND: {elapsed:.1f} ms/gate")
For server-side timing, every /compute and /tfhe/gate response includes compute_time_ms — use that for production monitoring.

Concurrent operations

The WAVIS API uses an internal thread pool. A single account can issue up to 32 concurrent compute requests before queueing kicks in. Pro tier raises this to 64; Scale to 256. For heavy parallel workloads:
  • Use /tfhe/batch to amortize cost across many gates in one call (preferred).
  • Or open multiple HTTP connections and POST in parallel (HTTP/2 multiplexing supported).

Hardware recommendations

If you’re running the SDK locally with GPU:
CardMemory bandwidthEstimated NAND/s
GTX 1660 Ti288 GB/s~190
RTX 3060360 GB/s~240
RTX 40901008 GB/s~660
H100 SXM53350 GB/s~2000
For self-hosted Enterprise deployments, A100/H100 in NVLink-connected pairs give the best $/gate.

Performance roadmap

  • Q3 2026 — WebAssembly target. Browser-side gates, ~2× CPU baseline.
  • Q4 2026 — Multi-GPU batching. Linear scaling across CUDA devices.
  • 2027 — Algorithmic optimizations. Sub-linear bootstrapping research; targeting 4 ms/gate on standard hardware.

Next Steps

GPU Batch Example

Step-by-step batched evaluation

Key Management

Picking parameters for your circuit depth