Grafting: Improving Performance and Usability of Homomorphic Encryption

Written by Johannes Mono (Ruhr University Bochum & CryptoLab)
Based on https://ia.cr/2024/1014

TL;DR: Grafting is a new approach for managing a CKKS ciphertext modulus. With so-called sprouts, we dedicate a few machine words to scaling and use word-sized primes for the remaining ciphertext modulus improving performance. With universal sprouts, we can represent any bit size up to the word size using powers-of-two and introduce arbitrary scaling for RNS-CKKS easing parameter and circuit design.

Homomorphic encryption is one of the most exciting technologies in modern cryptography by enhancing privacy and security in our data-driven world. In recent years, research has continuously improved two significant challenges: performance and usability. Grafting is a new technique improving both, performance and usability, for the homomorphic encryption scheme CKKS. Let’s try to understand the problems Grafting solves.

CKKS Parameters

Encryption schemes base their security on mathematically hard problems. The CKKS scheme bases it on the Learning with Errors over Rings (RLWE) problem. Sounds complicated? It is. Fortunately, we do not need to understand it, we only need to meet one important RLWE parameter: the number $q$, also called ciphertext modulus. CKKS needs a huge ciphertext modulus, it sometimes uses over 3000 bit! For comparison, we can store the number of atoms in the universe in only 266 bit. And a single CKKS encryption needs thousands of random numbers from $0$ to $q$. That sounds like lots of bits (and it is), but modern technology actually handles them with ease; even your smartphone can store many CKKS encryptions. Obviously, there is a catch: We use CKKS not only to store, but to compute on encryptions. And computing on thousands of numbers from $0$ to $q$ is not cheap, especially when we compute complex mathematical functions as in CKKS.

Despite its costs, CKKS is reasonably fast on modern machines. Modern machines operate on 64 bit at once, a so-called machine word. We want to use as few words as possible for our random numbers: the less words we compute on, the faster we are. For performance, we split up $q$ into $\ell$ small numbers, each using 64 bit:

\[\textstyle q = q_1 \cdot q_2 \cdot q_3 \cdot \ldots \cdot q_\ell = \prod_{i = 1}^\ell q_i \text{.}\]

Each number from $0$ to $q$ corresponds to exactly $\ell$ numbers from $0$ to $q_i$, one for each $q_i$ (Figure 1).

Figure 1: A ciphertext modulus $q$ split up into $\ell = 10$ small numbers, each using 64 bit.

This concept is also known as residue number system (RNS) and we can switch between one large number ($0$ to $q$) and $\ell$ small numbers ($0$ to $q_i$) anytime we want due to the Chinese Remainder Theorem. As with RLWE: We do not need to understand the math behind the RNS to understand Grafting, only what we use it for in CKKS. CKKS uses the RNS to improve performance, but it also uses it for the so-called scaling.

CKKS Scaling

CKKS encrypts approximate numbers which are, well, approximations of real numbers. An example you are probably familiar with is $\pi$: Teachers often approximate it as $3.14$ for calculations even though it has infinitely many digits. In CKKS, we approximate numbers using integers and a scaling factor. For $3.14$, we need a scaling factor $\Delta = 100$ to store it in an integer:

\[3.14 \cdot \Delta = 3.14 \cdot 100 = 314 \text{.}\]

For $3.14159265$, we need $\Delta = 100000000 = 10^8$:

\[3.14159265 \cdot \Delta = 3.14159265 \cdot 10^8 = 314159265 \text{.}\]

Alternatively, we round the scaled value and accept a worse approximation (for example with $\Delta = 100$):

\[\lceil 3.14159265 \cdot \Delta\rfloor = \lceil 314.159265 \rfloor = 314 \text{.}\]

The larger the scaling factor, the better our approximation—but the larger our integers and the CKKS parameters we will need to encrypt them. We can add two scaled numbers $x$ and $y$ and get their scaled sum:

\[(\Delta \cdot x) + (\Delta \cdot y) = \Delta \cdot (x + y) \text{.}\]

We can also multiply two scaled numbers

\[(\Delta \cdot x) \cdot (\Delta \cdot y) = \Delta \cdot (\Delta \cdot x \cdot y) \text{;}\]

however, now we get an additional factor $\Delta$ which we have to remove again. For unencrypted numbers, we simply divide by $\Delta$ to get the correct result. For encrypted numbers, it is not that simple.

Division in CKKS uses polynomial approximation, polynomial approximation needs multiplications, multiplications need divisions, divisions need polynomial approximations, polynomial approximations need … An endless cycle, I think you see the problem. Fortunately, we have a trick up our sleeves: Instead of dividing the encrypted numbers by $\Delta$, we remove one of the elements in $q = q_1 \cdot \ldots \cdot q_\ell$ with clever (and slightly complex) mathemathics. If $q_i \approx \Delta$, removing $q_i$ divides the encrypted numbers by $q_i \approx \Delta$ and we can remove the additional $\Delta$ from a multiplication! However, we cannot remove $q_i$ forever since $q$ only consists of $\ell$ small numbers. After $\ell - 1$ multiplications, we are left with only one small number: $q_1$. Then, we use a process called bootstrapping to go back to our big $q = q_1 \cdot \ldots \cdot q_\ell$.

Scaling imposes another restriction on our $q_i$. Our requirements are now as follows:

Each $q_i$ must be close to the scaling factor $\Delta$ for multiplications to work correctly.
Each $q_i$ needs to be a prime so we can use the RNS representation for performance.
Each $q_i$ should use 64 bit (the machine word size) for best performance.

But what if the scaling factor $\Delta$ uses much less than 64 bit? Then, we waste computational resources: A modern machine always computes on 64 bit, even if we do not use all of them. And currently, CKKS scaling uses much less than 64 bit wasting precious resources. Grafting with universal sprouts solves this issue: We can scale by anything we want and choose large $q_i$ to use the full machine word.

Grafting

The idea behind Grafting is surprisingly simple. What if we could have one special 64 bit word where we can choose any bit size between 0 and 64? Then we could scale to any bit size we want. Example (Figure 2): We are at $330 = 10 + 5 \cdot 64$ bit and want to go to $300 = 44 + 4 \cdot 64$ bit. We remove one large prime ($330 - 64 = 266$) and replace the special 10 bit word with a special 44 bit word ($266 - 10 + 44 = 300$).

Figure 2: Removing 30 bit from a ciphertext modulus without Grafting (top) and with Grafting (bottom). We need much less machine words for the same modulus size and can still easily scale by 30 bit.

Along these lines, we can choose any bit size for $q$ using our special word. If we remove one bit from it, we scale the encrypted numbers by $\Delta \approx 2$. Removing two bit from the special word scales the encrypted numbers by $\Delta \approx 4$, removing three bit by $\Delta \approx 8$, and so on. In some sense, we decouple the scaling that we need for a correct CKKS multiplications from the individual $q_i$. We call the special machine word a sprout, and we call it a universal sprout if it can have any bit size between 0 and 64. If you just came here to understand the idea behind Grafting, congratulations, you made it! That’s the idea, nothing more, nothing less.

Of course, reality is more complex (it’s always the same, huh): How do we actually realize a universal sprout? While all the mathematical details are in our paper, let’s try to understand it using much less math. Remember all the thousands of random numbers from $0$ to $q$ I talked about? They have a specific meaning, they are coefficients of a polynomial with degree $N - 1$:

\[a(x) = a_0 + a_1 x + a_2 x^2 + \ldots + a_{N-1} x^{N-1} \text{;}\]

each coefficient $a_i$ is a number from $0$ to $q$, each polynomial needs $N$ coefficients, and each ciphertext needs two of these polynomials. Sometimes, we need to multiply two polynomials and we multiply two polynomials with the so-called number theoretic transform (NTT). The Wikipedia article is not very helpful if you’re not a mathematician, so let’s stick to our approach: We only aim at understanding the consequences the NTT has for our $q_i$, not the math behind it. For the NTT, we need each $q_i$ to equal $1 \bmod 2 N$. If you’re not familiar with the modulo operation, don’t worry, I got you: The important thing for us is that it needs to be larger than $2 N$; and usually, $N$ is 15–20 bit large. Hence, each $q_i$ must also be at least 15–20 bit or larger. So how can we get a sprout to represent 0 to 20 bit?

One approach is as follows: We use the powers-of-two $0, 1, 2, 4, 8, \ldots, 2^{20}$ even if we then cannot use the NTT. And we cannot, but we can employ another trick. We move the power-of-two polynomials to a polynomial with a helper prime, use the NTT for polynomial multiplication, then go back to the power-of-two. This works as long as the helper prime is larger than 60 bit—which we fortunately have! That’s actually all we need, we have a universal sprout: We use powers-of-two for all bit sizes up to 20 with a helper prime, then use different primes up to 64 bit each. For $N = 2^{15}$, we reduce the number of $q_i$ from 20 to 12, so almost by 50%, and it results in up to two times better performance. But, in my personal opinion, that’s not the best part of Grafting. I’ve said it before, I’ll say it again: We decouple the scaling from the individual $q_i$ which makes CKKS much more usable. Two quick examples: Designing parameter sets is much easier now, and computing with arbitrary scalings is much easier now. And that’s it! You now know how Grafting improves the performance and usability of CKKS.