GPU Batch Evaluation (BYO GPU)

For workloads with many independent gate evaluations, batched execution on a local NVIDIA GPU is 3–10× faster than CPU and approaches 200 gates/s on consumer hardware (GTX 1660 Ti) or 2000+ gates/s on H100. This is BYO GPU — the SDK uses your own CUDA-capable card. WAVIS Cloud runs CPU-only at the API tier; for hosted GPU compute, see Enterprise.

Prerequisites

NVIDIA GPU with Compute Capability ≥ 7.5 (Turing/RTX 20 series or newer)
CUDA 11.8+ runtime installed
Python 3.9+

Install

pip install wavis-fhe[cuda]

The CUDA kernel is compiled at runtime via NVRTC — no nvcc needed at install time. Verify:

import wavis_fhe as wv
print(wv.cuda_available())   # True if a usable GPU is present
print(wv.cuda_device_info())
# {'name': 'NVIDIA GeForce GTX 1660 Ti', 'compute_capability': (7, 5), ...}

The basic batch pattern

import wavis_fhe as wv
import time

# 1. GPU keygen
keys_gpu = wv.keygen_gpu("fast_128")  # ~1 s on GPU vs. ~5 s on CPU

# 2. Encrypt a bunch of input pairs (encryption itself is CPU)
N = 128
pairs = [
    (keys_gpu.encrypt(i % 2 == 0), keys_gpu.encrypt(i % 3 == 0))
    for i in range(N)
]

# 3. Batch NAND on GPU
t = time.perf_counter()
results = keys_gpu.batch_nand(pairs)
elapsed = time.perf_counter() - t

print(f"Computed {N} NAND gates in {elapsed * 1000:.0f} ms")
print(f"Per-gate latency: {elapsed * 1000 / N:.1f} ms")

# 4. Decrypt and verify
for i, (a_ct, b_ct) in enumerate(pairs):
    a = keys_gpu.decrypt(a_ct)
    b = keys_gpu.decrypt(b_ct)
    expected = not (a and b)
    actual = keys_gpu.decrypt(results[i])
    assert expected == actual, f"Mismatch at index {i}"

print("All results verified ✓")

Output (GTX 1660 Ti):

Computed 128 NAND gates in 660 ms
Per-gate latency: 5.2 ms
All results verified ✓

Why batches are faster

The TFHE bootstrapping operation is memory-bandwidth-bound: each gate streams the ~20 MB bootstrapping key from GPU DRAM. Single-gate execution underutilizes the GPU — you pay the kernel-launch overhead and the key-streaming cost for one gate. Batching amortizes both:

Batch size	Per-gate latency (GTX 1660 Ti)	Speedup vs. single
1	48 ms	1.0×
8	12 ms	4.0×
32	5.7 ms	8.4×
128	5.2 ms	9.2×
1024	5.1 ms	9.4× (saturated)

Diminishing returns past batch-128. If you’re memory-constrained, 32 is the sweet spot.

Mixed gate batches

batch_nand() only does NAND. For heterogeneous gates, use batch_gates():

ops = [
    ("NAND", a1, b1),
    ("XOR",  a2, b2),
    ("AND",  a3, b3),
    ("OR",   a4, b4),
    # ... up to 1024 ops
]
results = keys_gpu.batch_gates(ops)

Heterogeneous batches are slightly slower (~6 ms/gate) because internally the GPU partitions by gate type and runs each partition as a homogeneous sub-batch.

Streaming pipeline

For continuous workloads (e.g., processing a stream of encrypted records), overlap encryption and gate evaluation:

import asyncio
from concurrent.futures import ThreadPoolExecutor

keys_gpu = wv.keygen_gpu()
executor = ThreadPoolExecutor(max_workers=2)

async def encrypt_async(bit):
    # Encryption is CPU; offload to thread pool
    return await asyncio.get_event_loop().run_in_executor(
        executor, keys_gpu.encrypt, bit
    )

async def process_stream(input_stream):
    async for bit_a, bit_b in input_stream:
        # While GPU evaluates batch N, CPU encrypts batch N+1
        ct_a, ct_b = await asyncio.gather(
            encrypt_async(bit_a),
            encrypt_async(bit_b),
        )
        yield (ct_a, ct_b)

# Collect into batches and dispatch
batch = []
async for pair in process_stream(input_iter):
    batch.append(pair)
    if len(batch) == 128:
        results = keys_gpu.batch_nand(batch)
        await emit_results(results)
        batch = []

Multi-GPU dispatch

If you have multiple GPUs, partition the batch:

import wavis_fhe as wv
from concurrent.futures import ThreadPoolExecutor

keys_g0 = wv.keygen_gpu(device=0)
keys_g1 = wv.keygen_gpu(device=1)

half = len(pairs) // 2

with ThreadPoolExecutor(max_workers=2) as ex:
    fut_a = ex.submit(keys_g0.batch_nand, pairs[:half])
    fut_b = ex.submit(keys_g1.batch_nand, pairs[half:])
    results = fut_a.result() + fut_b.result()

Important: each GPU instance has its own keypair. To move ciphertexts between GPUs, you’d need to re-encrypt under the target’s keys. For most multi-GPU workflows, scale by sharding the input data, not by sharing keys.

Memory budget

The bootstrapping key is ~20 MB and lives on the GPU. Each ciphertext in the batch is ~2 KB. A batch of 1024 gates with 2 ciphertexts each consumes ~4 MB of input, ~2 MB of output, plus working memory. Total GPU memory required: ~64 MB for fast_128, ~96 MB for standard_128. A 6 GB consumer GPU can run ~50 concurrent sessions. Track usage:

print(wv.cuda_memory_status())
# {'used_mb': 24, 'free_mb': 5912, 'total_mb': 5936}

Profiling

import time

# Measure the components
t0 = time.perf_counter()
ct_a = keys_gpu.encrypt(False)
ct_b = keys_gpu.encrypt(True)
t1 = time.perf_counter()

ct_result = keys_gpu.nand(ct_a, ct_b)
t2 = time.perf_counter()

bit = keys_gpu.decrypt(ct_result)
t3 = time.perf_counter()

print(f"Encrypt:  {(t1-t0)*1000:.1f} ms (CPU)")
print(f"Gate:     {(t2-t1)*1000:.1f} ms (GPU)")
print(f"Decrypt:  {(t3-t2)*1000:.1f} ms (CPU)")

For finer-grained kernel timing:

import wavis_fhe as wv
wv.cuda_enable_profiling()

# ... run your workload ...

print(wv.cuda_kernel_stats())
# Per-kernel call counts and total GPU time

Comparing CPU vs. GPU

Workload	CPU `fast_128`	GPU `fast_128` (GTX 1660 Ti)	Speedup
1 gate	14 ms	48 ms	0.3× (overhead)
32 gates serial	448 ms	1.5 s (wrong call pattern)	0.3×
32 gates batched	448 ms	180 ms	2.5×
128 gates batched	1.79 s	660 ms	2.7×
1000-gate ripple-carry adder	14 s	14 s (serial dependency)	1.0×

Lessons:

Single gates: CPU wins. GPU has launch overhead.
Independent batches: GPU wins, often by 3×+.
Serial dependencies (carry chains): GPU offers no help. Use CKKS instead for arithmetic.

Combining with WAVIS Cloud

Use the cloud API for keygen/auth/billing/audit and your local GPU for gate-heavy compute:

from wavis.sdk.client import WAVISClient
import wavis_fhe as wv

client = WAVISClient(api_key="wvs_live_...")

# Cloud: subscription, audit log, billing
client.billing.subscribe({"plan": "scale", "email": "..."})

# Local GPU: actual compute (free, on your hardware)
keys_gpu = wv.keygen_gpu("standard_128")
results = keys_gpu.batch_nand(pairs)

# Cloud: usage reporting (optional)
client.dashboard.usage()

Troubleshooting

`RuntimeError: CUDA out of memory`

Reduce batch size or use smaller poly_degree. For fast_128, batches up to ~10K should fit on a 6 GB card.

`RuntimeError: NVRTC compilation failed`

Your CUDA runtime is too old. WAVIS requires CUDA 11.8+. Update with:

nvidia-smi    # check driver version
# Update driver per your distro's instructions

GPU latency is worse than CPU

You’re calling single gates on GPU. Use batch_* methods, or stick with CPU keys (wv.keygen()).

Examples

GPU Batch Evaluation (BYO GPU)

Prerequisites

Install

The basic batch pattern

Why batches are faster

Mixed gate batches

Streaming pipeline

Multi-GPU dispatch

Memory budget

Profiling

Comparing CPU vs. GPU

Combining with WAVIS Cloud

Troubleshooting

`RuntimeError: CUDA out of memory`

`RuntimeError: NVRTC compilation failed`

GPU latency is worse than CPU

Next Steps

Performance

API Reference

Examples

​Prerequisites

​Install

​The basic batch pattern

​Why batches are faster

​Mixed gate batches

​Streaming pipeline

​Multi-GPU dispatch

​Memory budget

​Profiling

​Comparing CPU vs. GPU

​Combining with WAVIS Cloud

​Troubleshooting

​RuntimeError: CUDA out of memory

​RuntimeError: NVRTC compilation failed

​GPU latency is worse than CPU

​Next Steps

Performance

API Reference

Prerequisites

Install

The basic batch pattern

Why batches are faster

Mixed gate batches

Streaming pipeline

Multi-GPU dispatch

Memory budget

Profiling

Comparing CPU vs. GPU

Combining with WAVIS Cloud

Troubleshooting

`RuntimeError: CUDA out of memory`

`RuntimeError: NVRTC compilation failed`

GPU latency is worse than CPU

Next Steps