Performance

FHE has a reputation for being slow. WAVIS is designed to make it as fast as modern hardware allows — but performance is still the dominant practical constraint, so understanding the trade-offs matters.

Headline numbers

Measured on Intel Xeon Platinum (AVX-512F), --release build:

Setup	Latency	Throughput
CPU `fast_128`, single thread	14 ms/gate	~70 gates/s
CPU `standard_128`, single thread	32 ms/gate	~30 gates/s
CPU `conservative_128`, single thread	48 ms/gate	~20 gates/s
GPU batch-128 (BYO GPU, GTX 1660 Ti)	5.2 ms/gate	~190 gates/s
GPU batch-128 (BYO GPU, RTX 4090, projected)	~1.5 ms/gate	~660 gates/s
GPU batch-128 (BYO GPU, H100, projected)	~0.5 ms/gate	~2000 gates/s

Server-side compute on api.wavis.xyz runs CPU-only at the latencies above. GPU figures are for local SDK execution on customer hardware (BYO GPU); GPU-accelerated cloud compute is on the Enterprise tier.

Where the time goes

A single gate bootstrap does four things:

External product (CMux) with the bootstrapping key — ~40% of time.
Blind rotation of the test polynomial — ~35%.
Sample extraction — ~5%.
Key switching — ~20%.

The dominant cost is memory bandwidth: the bootstrapping key (~20 MB) exceeds the L2 cache of every current GPU, so the bottleneck is streaming the key from DRAM. This means GPU performance scales close to linearly with DRAM bandwidth: 288 GB/s (GTX 1660 Ti) → 1.0 TB/s (RTX 4090) → 3.35 TB/s (H100).

Choosing a preset

WAVIS exposes three named TFHE parameter presets. They differ in noise budget, not security — all are 128-bit secure.

Preset	TLWE n	Latency	Noise budget	Best for
`fast_128`	636	14 ms	Standard	Demos, dev, <1000-gate circuits
`standard_128`	1024	32 ms	Higher	Default for production
`conservative_128`	1024	48 ms	Highest	Long-running circuits, max safety margin

In code:

import wavis_fhe as wv

keys = wv.keygen("fast_128")          # explicit
keys = wv.keygen()                    # defaults to "fast_128" in SDK

Or via REST:

POST /api/v1/tfhe/session
{ "preset": "standard_128" }

Rule of thumb: start with fast_128. If your circuit is more than ~1000 gates deep, switch to standard_128. Use conservative_128 only for high-stakes one-off computations where the extra 16 ms is worth the safety.

Batch evaluation (huge speedup on GPU)

Single gates are overhead-dominated on a GPU. Batches amortize the bootstrapping-key DRAM read across many gates simultaneously.

# CPU — batch is the same as serial
keys = wv.keygen()
results = keys.batch_nand([(a, b) for ... ])

# GPU — batch is dramatically faster than serial
keys_gpu = wv.keygen_gpu()
gates = [(a_i, b_i) for i in range(128)]
results = keys_gpu.batch_nand(gates)  # 5.2 ms/gate (vs. 48 ms single)

Or via REST:

curl -X POST https://api.wavis.xyz/api/v1/tfhe/batch \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "sess_...",
    "gates": [
      {"op": "NAND", "a": "...", "b": "..."},
      {"op": "AND",  "a": "...", "b": "..."}
    ]
  }'

Batch discount: ≥32 ops in a single /tfhe/batch call get a 30% price discount automatically — see billing.

CKKS for arithmetic workloads

For ML inference, statistics, and other “real number” workloads, use CKKS instead of TFHE. CKKS encrypts vectors of floats, supports add/multiply/rotate, and is ~100× faster than TFHE for the same arithmetic depth (because it processes thousands of slots in parallel via SIMD packing).

Operation	CKKS (SIMD-packed)	TFHE (per-bit)
Add 1024 floats	1 ms	14 s (1024 gates)
Multiply by constant	5 ms	—
Bootstrap	~500 ms	14 ms (per gate)
Polynomial eval (deg 5)	30 ms	—

CKKS uses different parameters — see Key Management for the parameter knobs.

When FHE is the wrong tool

Be honest about throughput requirements. FHE is ~10⁵× slower than plaintext computation. Some workloads are simply not feasible:

Real-time video encoding. ~100 Mb/s plaintext → ~100 b/s FHE. No.
Large transformer inference (70B params). Days per token. Not yet.
Database scans over millions of rows. Use trusted hardware (SGX/SEV) or PIR.

FHE shines for:

Small ML inference (≤7B params, ≤1K input dim)
Boolean circuits with up to ~10K gates
Privacy-sensitive aggregation/scoring
Multi-party computation primitives

Always benchmark your workload before committing.

Profiling your circuit

import wavis_fhe as wv
import time

keys = wv.keygen()
ct_a = keys.encrypt(False)
ct_b = keys.encrypt(True)

# Warm up
for _ in range(10):
    keys.nand(ct_a, ct_b)

# Time
t = time.perf_counter()
for _ in range(100):
    ct = keys.nand(ct_a, ct_b)
elapsed = (time.perf_counter() - t) / 100 * 1000
print(f"NAND: {elapsed:.1f} ms/gate")

For server-side timing, every /compute and /tfhe/gate response includes compute_time_ms — use that for production monitoring.

Concurrent operations

The WAVIS API uses an internal thread pool. A single account can issue up to 32 concurrent compute requests before queueing kicks in. Pro tier raises this to 64; Scale to 256. For heavy parallel workloads:

Use /tfhe/batch to amortize cost across many gates in one call (preferred).
Or open multiple HTTP connections and POST in parallel (HTTP/2 multiplexing supported).

Hardware recommendations

If you’re running the SDK locally with GPU:

Card	Memory bandwidth	Estimated NAND/s
GTX 1660 Ti	288 GB/s	~190
RTX 3060	360 GB/s	~240
RTX 4090	1008 GB/s	~660
H100 SXM5	3350 GB/s	~2000

For self-hosted Enterprise deployments, A100/H100 in NVLink-connected pairs give the best $/gate.

Performance roadmap

Q3 2026 — WebAssembly target. Browser-side gates, ~2× CPU baseline.
Q4 2026 — Multi-GPU batching. Linear scaling across CUDA devices.
2027 — Algorithmic optimizations. Sub-linear bootstrapping research; targeting 4 ms/gate on standard hardware.

Getting Started

Core Concepts

SDKs

Headline numbers

Where the time goes

Choosing a preset

Batch evaluation (huge speedup on GPU)

CKKS for arithmetic workloads

When FHE is the wrong tool

Profiling your circuit

Concurrent operations

Hardware recommendations

Performance roadmap

Next Steps

GPU Batch Example

Key Management

Getting Started

Core Concepts

SDKs

​Headline numbers

​Where the time goes

​Choosing a preset

​Batch evaluation (huge speedup on GPU)

​CKKS for arithmetic workloads

​When FHE is the wrong tool

​Profiling your circuit

​Concurrent operations

​Hardware recommendations

​Performance roadmap

​Next Steps

GPU Batch Example

Key Management

Headline numbers

Where the time goes

Choosing a preset

Batch evaluation (huge speedup on GPU)

CKKS for arithmetic workloads

When FHE is the wrong tool

Profiling your circuit

Concurrent operations

Hardware recommendations

Performance roadmap

Next Steps