Prerequisites
- NVIDIA GPU with Compute Capability ≥ 7.5 (Turing/RTX 20 series or newer)
- CUDA 11.8+ runtime installed
- Python 3.9+
Install
nvcc needed at
install time. Verify:
The basic batch pattern
Why batches are faster
The TFHE bootstrapping operation is memory-bandwidth-bound: each gate streams the ~20 MB bootstrapping key from GPU DRAM. Single-gate execution underutilizes the GPU — you pay the kernel-launch overhead and the key-streaming cost for one gate. Batching amortizes both:| Batch size | Per-gate latency (GTX 1660 Ti) | Speedup vs. single |
|---|---|---|
| 1 | 48 ms | 1.0× |
| 8 | 12 ms | 4.0× |
| 32 | 5.7 ms | 8.4× |
| 128 | 5.2 ms | 9.2× |
| 1024 | 5.1 ms | 9.4× (saturated) |
Mixed gate batches
batch_nand() only does NAND. For heterogeneous gates, use batch_gates():
Streaming pipeline
For continuous workloads (e.g., processing a stream of encrypted records), overlap encryption and gate evaluation:Multi-GPU dispatch
If you have multiple GPUs, partition the batch:Memory budget
The bootstrapping key is ~20 MB and lives on the GPU. Each ciphertext in the batch is ~2 KB. A batch of 1024 gates with 2 ciphertexts each consumes ~4 MB of input, ~2 MB of output, plus working memory. Total GPU memory required: ~64 MB forfast_128, ~96 MB for
standard_128. A 6 GB consumer GPU can run ~50 concurrent sessions.
Track usage:
Profiling
Comparing CPU vs. GPU
| Workload | CPU fast_128 | GPU fast_128 (GTX 1660 Ti) | Speedup |
|---|---|---|---|
| 1 gate | 14 ms | 48 ms | 0.3× (overhead) |
| 32 gates serial | 448 ms | 1.5 s (wrong call pattern) | 0.3× |
| 32 gates batched | 448 ms | 180 ms | 2.5× |
| 128 gates batched | 1.79 s | 660 ms | 2.7× |
| 1000-gate ripple-carry adder | 14 s | 14 s (serial dependency) | 1.0× |
- Single gates: CPU wins. GPU has launch overhead.
- Independent batches: GPU wins, often by 3×+.
- Serial dependencies (carry chains): GPU offers no help. Use CKKS instead for arithmetic.
Combining with WAVIS Cloud
Use the cloud API for keygen/auth/billing/audit and your local GPU for gate-heavy compute:Troubleshooting
RuntimeError: CUDA out of memory
Reduce batch size or use smaller poly_degree. For fast_128, batches up to
~10K should fit on a 6 GB card.
RuntimeError: NVRTC compilation failed
Your CUDA runtime is too old. WAVIS requires CUDA 11.8+. Update with:
GPU latency is worse than CPU
You’re calling single gates on GPU. Usebatch_* methods, or stick with CPU
keys (wv.keygen()).
Next Steps
Performance
Detailed latency/throughput tuning
API Reference
Server-side batch API