<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://ckks.org/feed.xml" rel="self" type="application/atom+xml"/><link href="https://ckks.org/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-05-10T01:21:31+00:00</updated><id>https://ckks.org/feed.xml</id><title type="html">CKKS.org</title><subtitle></subtitle><entry><title type="html">RadixCKKS: A General Framework for Integer Computation over CKKS</title><link href="https://ckks.org/blog/2026/radix-ckks/" rel="alternate" type="text/html" title="RadixCKKS: A General Framework for Integer Computation over CKKS"/><published>2026-04-07T04:01:00+00:00</published><updated>2026-04-07T04:01:00+00:00</updated><id>https://ckks.org/blog/2026/radix-ckks</id><content type="html" xml:base="https://ckks.org/blog/2026/radix-ckks/"><![CDATA[<ul> <li>Written by <a href="https://sites.google.com/view/cau-paiclab/home">Gyeongwon Cha</a> (Chung-ang university)</li> <li>Based on <a href="https://eprint.iacr.org/2025/1740">https://ia.cr/2025/1740</a> (Eurocrypt 2026)</li> </ul> <p><em>TL;DR: To handle large integers in FHE, a common approach is to decompose an integer into several small pieces, called digits, and perform computations based on them. One such approach is the radix-based approach, which decomposes a large integer into digits in base $B$ and carries out arithmetic on those digits via polynomial operations. However, after arithmetic operations, the resulting representation is no longer unique, which makes it difficult to directly perform non-arithmetic operations such as comparison or bitwise operations. The process of restoring such a disturbed digit representation back to its unique form is commonly called digit carry, and this step inherently requires non-arithmetic processing. In this post, we introduce a two-step homomorphic digit carry algorithm over CKKS. Our algorithm restores the digit representation to its unique form using $O(\log k)$ bootstrappings.</em></p> <hr/> <p><br/></p> <h2 id="introduction">Introduction</h2> <div style="margin-top: 1.5em;"></div> <p>Recently, the FHE community has shown growing interest in homomorphically evaluating integers of various sizes, ranging from machine-word-sized integers to RSA moduli. Surprisingly, 64-bit arithmetic appears quite simple from the plaintext perspective, much like multiplying two <code class="language-plaintext highlighter-rouge">long long</code> variables. In FHE, however, this is not straightforward, because the noise growth typically scales with the message size. Consequently, most existing approaches either maintain a sufficiently large gap between the message and the noise or decompose the message into smaller pieces and express the overall arithmetic through those pieces.</p> <p>The radix-based approach represents a plaintext integer in base $B$. Once decomposed in this way, an integer can be viewed as a polynomial whose coefficients are its digits, and the original integer is recovered by evaluating that polynomial at $X=B$. As a result, arithmetic on radix-decomposed integers can naturally be carried out through polynomial arithmetic. The key advantage of this approach is that the digit representation still preserves much of the structure of the underlying integer. For example, to compare two integers, one can examine their digits sequentially from the most significant digit to the least significant one. In this sense, radix representations are highly flexible, since they leave room for extending arithmetic to non-arithmetic operations such as comparison.</p> <p>The main challenge is that while polynomial arithmetic preserves the value of the represented integer, it does not preserve the uniqueness of its digit representation. After arithmetic operations, digit values may exceed their normal range, which in turn causes additional noise growth. More importantly, once the digit representation is no longer unique, non-arithmetic operations such as comparison can no longer be performed directly. Therefore, radix-based approaches must restore the non-unique digit representation to its unique form through digit carry.</p> <p>This carry procedure, however, has remained a major bottleneck, since it inherently requires remainder computation and must process the digits sequentially. Although [1, 2] achieved homomorphic digit carry using discrete CKKS, the number of required bootstrappings remained linear in the plaintext bit-length $k$. This naturally leads to the following question: can this linear cost be reduced further?</p> <p><strong>Notation.</strong> Throughout this post, we adopt the following notation for clarity. We denote homomorphic arithmetic operations between ciphertexts simply as $+,-,\times$. In addition, we denote the rotation operation by $\rho_r$, which rotates a ciphertext to the left by $r$ positions. If $r$ is negative, this corresponds to a right rotation.</p> <p><br/></p> <h2 id="key-idea-reducing-carry-in-base-b-to-carry-in-base-2">Key idea: reducing carry in base $B$ to carry in base $2$.</h2> <div style="margin-top: 1.5em;"></div> <p>Our key idea for reducing the complexity of the carry algorithm is the following:</p> <blockquote> <p>If every digit is smaller than $2B-1$, then carry propagation on a radix-$B$ digit vector behaves exactly like carry propagation in binary.</p> </blockquote> <p>To see why, recall that in base-$B$ carry, if the $i$-th digit $z_i$ is at least $B$, we propagate \(y_i = Q_B(z_i)\) to the next digit, where $Q_B(z_i)$ denotes the quotient of \(z_i\) divided by $B$. The difficulty is that once \(y_i\) is added to the $(i+1)$-th digit, the carry propagated to the $(i+2)$-th digit becomes $ Q_B(z_{i+1} + y_i), $ so the carry process becomes inherently sequential.</p> <p>However, if every digit is smaller than $2B-1$, then a much simpler structure emerges. In this case, $Q_B(z_i)$ can be at most $1$. Consequently, $z_{i+1}+y_i$ is also at most $2B-1$, and therefore the carry propagated to the next digit is again at most $1$. Intuitively, this makes radix-$B$ carry behave like binary carry, where each position propagates only a single carry bit.</p> <p>At this point, our problem boils down to the following two tasks:</p> <ol> <li>Homomorphically reducing each digit so that it is smaller than \(2B-1\).</li> <li>Designing a carry algorithm for digit vectors whose entries are all smaller than \(2B-1\).</li> </ol> <p><br/></p> <h2 id="2-step-carry-algorithm">2-step carry algorithm</h2> <div style="margin-top: 1.5em;"></div> <p><strong>Polynomial arithmetic.</strong> Since CKKS supports SIMD-style parallel operations across slots, a different method is needed to realize polynomial arithmetic. We simply use the DFT to evaluate polynomials in Fourier form, and then apply the iDFT to recover the result in coefficient form.</p> <p>When the plaintext modulus is \(\mathbb{Z}_{B^k}\), a single integer is packed into \(2k\) slots: the first \(k\) slots contain its radix-\(B\) digits, while the remaining \(k\) slots are padded with zeros. This zero-padding is introduced to avoid cyclic shifts. Accordingly, after the iDFT step, the lower \(k\) slots are masked out and reset to zero.</p> <p>Let the number of CKKS slots be \(N/2\), and assume that \(2k \mid (N/2)\). Then a single ciphertext can store a total of \(N/4k\) integers, where the packing is obtained by simply concatenating digit vectors of length \(2k\).</p> <p><strong>Modular reduction in CKKS.</strong> Modular reduction, as discussed in [3, 4], is a key operation in our construction. At a high level, this algorithm first transforms a slot-encoded CKKS ciphertext into coefficient encoding, reduces it modulo the base modulus \(q_0\), and then applies bootstrapping to recover a slot-encoded CKKS ciphertext.</p> <p>The interesting point is that if the ciphertext scaling at level \(q_0\) is adjusted to \(q_0/B\), then the resulting ciphertext has exactly the same form as a coefficient-encoded BFV ciphertext with plaintext modulus \(B\). Consequently, the message naturally undergoes the operation \([\cdot]_B\), while requiring only a single bootstrapping. For convenience, we denote the modular reduction algorithm with respect to \(B\) by \(\mathsf{Mod}_B\).</p> <p><strong>Functional bootstrapping in CKKS.</strong> Functional bootstrapping [3, 5] in discrete CKKS enables the evaluation of a look-up table (LUT) function, denoted as $\mathsf{LUT}$, simultaneously with the bootstrapping of a ciphertext.</p> <p>We denote the functional bootstrapping operation in CKKS by $\mathsf{CKKS.FBT}(\cdot,~\mathsf{LUT} = \cdot)$, where the first argument is the ciphertext to be bootstrapped and the second argument specifies the LUT function.</p> <p><br/></p> <h3 id="homomorphic-digit-reduction">Homomorphic digit reduction</h3> <div style="margin-top: 1.5em;"></div> <p>After arithmetic operations, we repeatedly invoke \(\mathsf{Mod}_B\) to reduce the digit size. Let the input digit vector be</p> \[\mathsf{ct}\longleftrightarrow \vec z = (z_1,z_2,\dots,z_k)\in\mathbb{Z}^k.\] <p>Here, we omit the last \(k\) slots for simplicity.</p> <p>Applying \(\mathsf{Mod}_B\) yields</p> \[\mathsf{ct}_{\mathsf{mod}}\longleftrightarrow \vec z_{\mathsf{mod}} = ([z_1]_B,[z_2]_B,\dots,[z_k]_B)\in\mathbb{Z}^k.\] <p>Moreover, since \(Q_B = \frac{\mathsf{id}-\mathsf{Mod}_B}{B},\) we can additionally obtain, at the cost of one extra multiplication level,</p> \[\mathsf{ct}_{Q}\longleftrightarrow \vec z_{Q} = (Q_B(z_1),Q_B(z_2),\dots,Q_B(z_k))\in\mathbb{Z}^k.\] <p>Now, by rotating \(\mathsf{ct}_{Q}\) one position to the right and adding it to \(\mathsf{ct}_{\mathsf{mod}}\), we obtain another digit vector representing the same integer. If \(\|\vec z\|_{\infty}=U\), then the infinity norm of the resulting digit vector is bounded by \(B + Q_B(U) = B + O(U/B).\)</p> <p>We call this entire procedure <strong>LazyCarry</strong>.</p> <p>For the unique digit representations of two integers, no digit reduction is needed after addition. In contrast, after multiplication, invoking LazyCarry \(O(\log k)\) times suffices to reduce all digit values below \(2B-1\).</p> <p>In practice, for \(B=16\), multiplying two integers in unique digit representation requires only 2 to 4 bootstrappings to reduce all digit values below \(2B-1\), for bit-lengths ranging from 16 bits to 2048 bits. If a larger base \(B\) is chosen, this number can be reduced even further.</p> <p>Repeated iterations of LazyCarry reduce the digit size rapidly, making it possible to continue arithmetic operations once the digits become sufficiently small. This is in the same spirit as [2].</p> <p><br/></p> <h3 id="homomorphic-digit-carry">Homomorphic digit carry</h3> <div style="margin-top: 1.5em;"></div> <p>We now take a closer look at the carry behavior of digit vectors whose entries are all smaller than \(2B-1\). The figure illustrates the case \(B=16\) and \(k=4\): the top shows carry propagation in base \(B\), while the bottom shows the corresponding carry behavior after reduction to binary.</p> <div class="row mt-3"> <div class="col-sm-12 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2604_Gyeongwon/c-480.webp 480w,/assets/img/blog/2604_Gyeongwon/c-800.webp 800w,/assets/img/blog/2604_Gyeongwon/c-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2604_Gyeongwon/c.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>Under this constraint, whenever the \(i\)-th digit propagates a value to the next digit, the \(i\)-th digit decreases by \(B\) and the \((i+1)\)-th digit increases by \(1\). This is exactly the same behavior observed in the binary-reduced representation.</p> <p>Therefore, if we can obtain a ciphertext encrypting a vector $\vec c$ whose entries indicate whether each digit propagates a carry to the next position—encoded as $1$ for true and $0$ for false—then the unique carried representation of the original digit vector $\vec z$ can be computed as follows :</p> \[z_{\mathsf{carried}} = \vec z - B\cdot\vec c + \rho_{-1}(\vec c) \tag{1}\] <p>For example, in the figure, $\vec c = (1,1,1,0)$.</p> <p>Reduction from a radix-\(B\) representation to a binary representation is performed via a LUT operation. The binary reduction is determined by the range of each digit; more precisely, we define</p> \[\phi(x)= \begin{cases} 0, &amp; \text{if } 0 \le x &lt; B-1,\\ 1, &amp; \text{if } x = B-1,\\ 2, &amp; \text{if } B \le x &lt; 2B-1. \end{cases}\] <p>It remains to design a function that determines, in the binary-reduced representation, whether a given digit propagates a value to the next position.</p> <p>We begin with the two-digit case \(z_1,z_2\), and design a function that determines whether \(z_2\) propagates a carry to the next digit.</p> <p>If \(z_2=0\), then even if \(z_1=2\) and propagates a carry into \(z_2\), the resulting value is still smaller than \(2\), and hence the output is \(0\). If \(z_2=2\), then for the same reason the output is always \(1\), regardless of the value of \(z_1\). The only nontrivial case is \(z_2=1\). In that case, the output depends on \(z_1\): if \(z_1=2\), the output should be \(1\); if \(z_1=0\), it should be \(0\); and if \(z_1=1\), the output depends on the previous digit. In the two-digit example, this case evaluates to \(0\).</p> <p>Motivated by this observation, we design a bivariate function \(f(x,y)\) that returns \(0\) if carry propagation is false, \(2\) if it is true, and \(1\) if the result is not yet determined:</p> \[f:\{0,1,2\}^2 \to \{0,1,2\}, \qquad (x,y)\mapsto \begin{cases} 0, &amp; \text{if } y=0,\\ x, &amp; \text{if } y=1,\\ 2, &amp; \text{if } y=2. \end{cases}\] <p>We then extend this to a \(k\)-digit vector \((z_1,\dots,z_k)\). The question of whether \(z_k\) propagates a carry to the next position (the case \(i&lt;k\) will be discussed shortly) can be expressed recursively using rotations via the function \(f_k\):</p> \[f_2 = f, \quad f_k(z_1,\dots,z_k) = f\bigl(z_1, f_{k-1}(z_2,\dots,z_k)\bigr).\] <p>At first glance, evaluating \(f_k\) appears to require \(k\) sequential calls to \(f\). However, observing that the structure of \(f\) is determined by its second argument, one can show that</p> \[f_k(z_1,\dots,z_k) = f\bigl(f_{k/2}(z_1,\dots,z_{k/2}),\, f_{k/2}(z_{k/2+1},\dots,z_k)\bigr).\] <p>Since the digits \(z_1,\dots,z_k\) are stored across multiple slots, this allows us to evaluate \(f_k\) using only \(\log k\) invocations of \(f\).</p> <p>For the \(i\)-th digit with \(i\le k\), the first \(k-i\) slots correspond to zero-padded slots from the previous integer. Hence it suffices to evaluate</p> \[f_k(0,\dots,0,m_1,\dots,m_i),\] <p>since zeros do not affect the carry behavior at all. The figure below illustrates the evaluation of $f_k$ for $k = 4$.</p> <div class="row mt-3"> <div class="col-sm-12 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2604_Gyeongwon/lc2c-480.webp 480w,/assets/img/blog/2604_Gyeongwon/lc2c-800.webp 800w,/assets/img/blog/2604_Gyeongwon/lc2c-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2604_Gyeongwon/lc2c.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>After evaluating \(f_k\), the positions that propagate a carry contain the value \(2\), while the remaining positions contain \(0\) or \(1\). We then evaluate a mapping function \(\tau\) that converts these values into \(1\) and \(0\), respectively. Finally, by applying \(\tau\) and evaluating Equation (1), the carry procedure is completed. We call this overall procedure <strong>LazyCarry-to-Carry</strong>.</p> <p><br/></p> <h3 id="whole-algorithm-description">Whole algorithm description</h3> <div style="margin-top: 1.5em;"></div> <p>The overall procedure of our two-step homomorphic carry algorithm is as follows.<br/> Let $\mathsf{ct}$ be a ciphertext that requires digit carry propagation.<br/> We assume that an upper bound on the corresponding digit vector of $\mathsf{ct}$ can be estimated.<br/> Given this bound, let $t$ denote the number of times the <strong>LazyCarry</strong> algorithm is applied to ensure that each digit becomes smaller than $2B - 1$.</p> <p>For example, consider a 64-bit multiplication with parameters $(B,k) = (2^4, 16)$.<br/> When multiplying two unique digit representations, the resulting upper bound is given by \(k(B-1)^2 = 3600.\) In this case, two iterations of <strong>LazyCarry</strong> are sufficient.</p> <ol> <li> <p>Repeat <strong>LazyCarry</strong> $t$ times : \(\mathsf{ct}_{\mathsf{lazycarry}} \gets \mathsf{LazyCarry}^{(t)}(\mathsf{ct})\)</p> </li> <li> <p>Evaluate the look-up table $\phi$ via functional bootstrapping : \(\mathsf{ct}_{\mathsf{bin}} \gets \mathsf{CKKS.FBT}(\mathsf{ct}_{\mathsf{lazycarry}}, \mathsf{LUT}=\phi)\)</p> </li> <li> <p>Evaluate $f_k$ : For $i=0$ to $\log(k)$ do {\(\mathsf{ct}_{\mathsf{bin}} \gets \text{evaluate } f(\rho_{-2^i}(\mathsf{ct}_{\mathsf{bin}}), \mathsf{ct}_{\mathsf{bin}})\)}</p> </li> <li> <p>Evaluate $\tau$ : \(\mathsf{ct}' \gets \tau(\mathsf{ct}_{\mathsf{bin}})\)</p> </li> <li> <p>Return \(\mathsf{ct}_{\mathsf{out}} \gets \mathsf{ct}_{\mathsf{lazycarry}} - B\cdot \mathsf{ct}' + \rho_{-1}(\mathsf{ct}')\)</p> </li> </ol> <p>Our algorithm performs (t) bootstrappings in Step 1, one bootstrapping in Step 2, and additional bootstrappings in Step 3. The extra bootstrappings in Step 3 arise in two cases: (1) when the remaining multiplicative depth is insufficient to evaluate the function (f_k), or (2) when bootstrapping is required for noise management. Despite these additional calls, the total number of bootstrappings remains modest in practice. The exact number of bootstrapping calls is reported in the Conclusion section.</p> <p><br/></p> <h2 id="other-integer-operations">Other integer operations</h2> <div style="margin-top: 1.5em;"></div> <p>Because the digit vector can be restored to its unique representation, one can further design non-arithmetic operations such as comparison and conditional subtraction. Moreover, by using the bit-extraction technique of [5], each digit can be decomposed into bits, thereby enabling bitwise operations as well. These non-arithmetic operations can in turn be used to homomorphically implement reduction techniques such as folding and Montgomery reduction, thereby supporting computation not only over prime-power moduli but also over more general moduli. Although we would have liked to discuss these non-arithmetic operations in this post as well, we refer the interested reader to the paper for further details.</p> <p><br/></p> <h2 id="conclusion">Conclusion</h2> <div style="margin-top: 1.5em;"></div> <p>All experiments were conducted on an AMD Ryzen 9 7900X 12-Core Processor with 125 GB RAM, running Ubuntu 20.04, using a single thread. We used the Lattigo library, a representative RNS-CKKS library, for our implementation.</p> <p>We evaluated multiplication operations for bit-lengths ranging from 16 bits to 2048 bits. In the third column, (a+b) indicates that (a) bootstrappings were used in LazyCarry and (b) bootstrappings were used in LazyCarry-to-Carry.</p> <div class="row mt-3"> <div class="col-sm-12 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2604_Gyeongwon/2-480.webp 480w,/assets/img/blog/2604_Gyeongwon/2-800.webp 800w,/assets/img/blog/2604_Gyeongwon/2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2604_Gyeongwon/2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>To be completely honest, a single multiplication still took a considerable amount of time. Even though our method processes many integers at once, the total runtime is by no means small.</p> <p>At first glance, BFV may appear more attractive, as homomorphic multiplication can be performed simply by invoking the scheme’s native multiplication primitive. However, the main bottleneck of BFV lies in bootstrapping, whose runtime may exceed that of our entire multiplication procedure depending on the plaintext modulus. Therefore, a fair comparison should not be based solely on raw multiplication speed, but rather should account for the cost of BFV bootstrapping.</p> <p>For instance, according to Table 4 in [6], BFV bootstrapping with a plaintext modulus of approximately 234 bits at $N = 2^{17}$ takes about 392 seconds. In contrast, our proposed framework enables subsequent computations after a single additional CKKS bootstrapping (approximately 12 seconds).</p> <p>From an amortized perspective, a more careful comparison is also needed, since the number of available BFV slots depends on the plaintext modulus.</p> <p>For DM/CGGI, by contrast, our method consistently shows better amortized performance. Interestingly, at the time of writing, our single-thread benchmark on the same machine showed that TFHE-rs required 77.5 seconds for 128-bit multiplication, whereas our method took 57.4 seconds. This indicates that from 128 bits onward, our method outperforms TFHE-rs not only in amortized throughput but also in latency.</p> <p>Finally, we conclude by introducing a paper for readers interested in related work [7]. While our work focuses on algorithms for performing carry homomorphically, the work introduced below is particularly impressive in that it avoids performing carry altogether.</p> <p><br/></p> <h2 id="references">References</h2> <div style="margin-top: 1.5em;"></div> <p>[1] Kim, J. “Efficient Homomorphic Integer Computer from CKKS.” TCHES 2025.</p> <p>[2] Kim, J. “Faster Homomorphic Integer Computer.” Cryptology ePrint Archive, Paper 2025/1440, 2025.</p> <p>[3] Alexandru, A., Kim, A., and Polyakov, Y. “General Functional Bootstrapping using CKKS.” CRYPTO 2025.</p> <p>[4] Kim, J. and Noh, T. “Modular Reduction in CKKS.” CIC 2025.</p> <p>[5] Bae, Y., Kim, J., Stehlé, D., and Suvanto, E. “Bootstrapping Small Integers with CKKS.” ASIACRYPT 2024.</p> <p>[6] Kim, J., Seo, J., and Song, Y. “Simpler and Faster BFV Bootstrapping for Arbitrary Plaintext Modulus from CKKS.” CCS 2024.</p> <p>[7] Brakerski, Z., Friedman, O., Golan, D., Gurny, A., Mutzari, D., and Sheinfeld, O. “REFHE: Fully Homomorphic ALU.” EUROCRYPT 2026.</p>]]></content><author><name>Gyeongwon Cha</name></author><summary type="html"><![CDATA[TL;DR: To handle large integers in FHE, a common approach is to decompose an integer into several small pieces, called digits, and perform computations based on them. One such approach is the radix-based approach, which decomposes a large integer into digits in base $B$ and carries out arithmetic on those digits via polynomial operations. However, after arithmetic operations, the resulting representation is no longer unique, which makes it difficult to directly perform non-arithmetic operations such as comparison or bitwise operations. The process of restoring such a disturbed digit representation back to its unique form is commonly called digit carry, and this step inherently requires non-arithmetic processing. In this post, we introduce a two-step homomorphic digit carry algorithm over CKKS. Our algorithm restores the digit representation to its unique form using $O(\log k)$ bootstrappings.]]></summary></entry><entry><title type="html">Verifiable Computation for CKKS</title><link href="https://ckks.org/blog/2026/verifiable-ckks/" rel="alternate" type="text/html" title="Verifiable Computation for CKKS"/><published>2026-03-03T04:01:00+00:00</published><updated>2026-03-03T04:01:00+00:00</updated><id>https://ckks.org/blog/2026/verifiable-ckks</id><content type="html" xml:base="https://ckks.org/blog/2026/verifiable-ckks/"><![CDATA[<style>.math-wide,.math-narrow{display:none}@media(min-width:661px){.math-wide{display:block}}@media(max-width:660px){.math-narrow{display:block}}</style> <ul> <li>Written by Ignacio Cascudo (IMDEA Software Institute), Anamaria Costache (École Polytechnique), Daniele Cozzo (IMDEA Software Institute), Dario Fiore (IMDEA Software Institute), Antonio Guimarães (IMDEA Software Institute), Eduardo Soria-Vazquez (Technology Innovation Institute)</li> <li>Based on <a href="https://ia.cr/2025/286">https://ia.cr/2025/286</a> (Crypto 2025)</li> </ul> <p><em>TL;DR: Homomorphic Encryption (HE) enables computing over encrypted data but, by itself, provides no guarantees that the computation was honestly executed. One can build “Verifiable HE” (vHE) using SNARKs, but efficiently combining HE and SNARKs in practice is a major challenge. This work introduces a blueprint for building verifiable HE schemes and its efficient instantiation for CKKS. Our first step is to introduce a “proof-friendly” version of CKKS, which is more amenable to proof systems, while being only slightly slower than typical RNS CKKS implementations. We then show how the problem of proving correctness of computations for such proof-friendly HE schemes can be reduced to just two sets of arithmetic relations (containing equalities and inequalities). We show that if these are satisfied, it implies the correct execution of the HE evaluation. We design Polynomial Interactive Oracle Proofs (PIOPs) for efficiently proving these relations, and we show how they can be instantiated using standard proof components. Our final construction demonstrates the feasibility of building SNARKs for proving computation of full-fledged HE schemes, opening the path for building practical verifiable HE schemes.</em></p> <hr/> <p><br/></p> <h2 id="context">Context</h2> <div style="margin-top: 1.5em;"></div> <p><strong>Homomorphic Encryption.</strong> Homomorphic Encryption (HE) is a type of encryption that allows to compute on encrypted data. This allows, amongst others, to outsource computations without compromising on privacy. It is a very powerful primitive, which enables many applications such as secure outsourced medical analysis, private set intersection, and so on. In recent years, Machine Learning (ML) applications have become ubiquitous. While this has opened up the door to many novel and powerful applications, this has also introduced privacy concerns. Indeed, ML applications inherently and very heavily rely on data - but this data can turn out to be sensitive, and users may rightly be weary of sharing their private data with companies. To the rescue comes HE, in the form of Privacy-Preserving Machine Learning (PPML)! With HE, we can now encrypt the data, and evaluate an ML model directly on encrypted data. The result can now be recovered in plaintext, but only by whoever holds the secret key.</p> <p><strong>Verifiable Homomorphic Encryption.</strong> A significant limitation of homomorphic encryption (HE) is that it works in the trusted model, where the entity that computes over the ciphertexts is assumed to behave honestly. In particular, we must assume that the computing party (from here on, we refer to this party as the <em>server</em>) performs exactly the operation it says it does, which in practice is a rather strong assumption. In reality, without an integrity mechanism on top of the computation, the server could follow any malicious strategy: it could for example bias the result, or perform a different computation entirely. Even more concerning, without integrity mechanisms in place, the server could take advantage of the malleability of HE ciphertexts and mount a key recovery attack. Therefore, a lack of integrity can potentially lead to serious privacy leaks!</p> <h3 id="snarks">SNARKs</h3> <p><em>Succinct proof systems</em> (commonly referred to as SNARKs) are an important cryptographic primitive that allow to add integrity to computations. Informally, these are cryptographic proofs that allow a prover to convince a verifier that a statement is true. This is a very rich research field in and of itself, so for now, we can think of SNARKs as tools that allow us to prove a statement of the following kind: “Given a public circuit $C$ and an output $y$, I know an input $x$ such that $y=C(x)$.”</p> <p>What makes SNARKs useful in practice are the following properties:</p> <ul> <li><strong>Correctness</strong>: if the statement is true, then the verifier will accept the proof;</li> <li><strong>Soundness</strong>: a cheating prover is not able to convince the verifier about a false statement, except with negligible probability;</li> <li><strong>Succinctness</strong>: the proof is very short and verifying it should be fast, e.g. sublinear in the size of the computation.</li> </ul> <h3 id="verifiable-he">Verifiable HE</h3> <p>It is natural then to combine SNARKs and HE to achieve both privacy and integrity in outsourcing computations. This approach, also known as <em>Verifiable Homomorphic Encryption (vHE)</em> does work in theory, because SNARKs can prove NP statements. Unfortunately, in practice this approach has several limitations for the prover efficiency.</p> <p>1) HE ciphertexts are typically pairs of elements of the polynomial ring $R_q = \mathbb{Z}_q[X]/{(X^{N}+1)}$, whereas SNARKs typically work best on computations over large finite fields;</p> <p>2) Virtually all HE schemes require so-called <em>ciphertexts maintenance</em> operations (such as rescaling). These operations entail non-algebraic operations, for instance real division and rounding. In contrast, SNARKs shine at proving algebraic statements. Even worse, operations like rescaling cause the underlying algebraic structure to change during the computation, something that cannot be easily processed by traditional SNARKs.</p> <p>Naively, general-purpose SNARKs can prove ciphertext arithmetic and maintenance operations by emulating them. This is prohibitively expensive for the prover, as shown by a recent survey by Knabenhans, Viand, and Hithnawi [4].</p> <p>This motivates the research line of designing SNARKs <em>specifically tailored to HE operations</em>.</p> <p>In this blogpost, we introduce our recent result [1] that represents the state-of-the-art in this direction. Our work is a blueprint for constructing vHE schemes that can scale with large computations and have practical proving times.</p> <p><br/></p> <h2 id="the-blueprint">The blueprint</h2> <div style="margin-top: 1.5em;"></div> <p>The core contributions of our work are a blueprint for constructing practical vHE schemes in a modular way and its instantiation to the CKKS scheme. Such a framework consists of several building blocks that are carefully designed to be combined together to yield the final vHE scheme.</p> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-12 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2603_DanieleCozzo/image-480.webp 480w,/assets/img/blog/2603_DanieleCozzo/image-800.webp 800w,/assets/img/blog/2603_DanieleCozzo/image-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2603_DanieleCozzo/image.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption" style="text-align: center;"> Figure 1: The Blueprint. </figcaption> </figure> <p>The first step in our blueprint is to take an HE scheme, and modify it in such a way that it becomes more “SNARK-friendly”. The aim is to make arithmetization easier – that is to say, we aim to make the operations of the HE scheme easily expressed in a language that can then be processed by the SNARK with small overhead. We call such a scheme “proof-friendly”. In practice, such a scheme must work over a <strong>single</strong> ring with <strong>specific algebraic structures</strong> that are necessary for the SNARKs to achieve soundness. In this work, we instantiate our blueprint using the CKKS scheme, introducing a proof-friendly variant of CKKS.</p> <p>Once we have a proof-friendly HE scheme, a computation over its ciphertexts is processed according to the following steps:</p> <ul> <li><strong>Translation</strong>: The arithmetization process translates the computation into an arithmetic circuit satisfiability relation over the ring, and a finite number of range check relations over the ring. This is the language our SNARK “speaks”.</li> <li><strong>Proofs</strong>: These relations will be proved by specialized proof systems, in the form of <em>Polynomial Interactive Oracle Proofs (PIOPs)</em>. PIOPs are proof systems where the interaction between the prover and verifier happens via polynomials: the statements are encoded by long polynomials and the proof consists in evaluating these over points chosen uniformly at random by the verifier. Importantly, PIOPs differ from SNARKs in that they are interactive, are not succinct (meaning that the proof is typically long, of the size of the statement) and soundness holds information-theoretically. A popular and practical way of realizing SNARKs, which we follow, consists in first designing PIOPs, and then compiling them into full-fledged SNARKs through cryptographic tools (we will talk about this in a moment). In our case, we need two PIOPs: one that is specialized to prove arithmetic statements over rings, and one specialized to prove range checks over rings.</li> <li><strong>Cryptographic compiler</strong>: Finally, in order to compile these information-theoretic protocols into a SNARK we need one more tool, called a <em>Polynomial commitment</em> (PC) scheme. This is a primitive that allows the prover to commit to a large polynomial resulting in a short string, and to later prove evaluations of the committed polynomial. Polynomial commitment schemes are used to compile PIOPs into SNARKs: the long polynomials resulting from the prover PIOP are replaced by the short commitments.</li> </ul> <p><br/></p> <h2 id="proof-friendly-ckks">Proof-friendly CKKS</h2> <div style="margin-top: 1.5em;"></div> <p>We start by showing how we can build a proof-friendly version of CKKS which presents the characteristics described above while maintaining near state-of-the-art performance levels.</p> <h3 id="setting-up-the-ring">Setting up the ring</h3> <p>CKKS works over polynomial rings of the form $R_q := \mathbb{Z}_q[X]/(X^N+1)$. The first step is to set up the ring $R_q$ such that it will be easier to construct a SNARK over it. Remember, one of the properties a SNARK must satisfy is soundness: this says that if the statement is false, then the prover is only able to convince the verifier with negligible probability.</p> <p>Naively, a way to achieve this is to just repeat the proof system as many times as needed, to amplify soundness error. However, this makes the proof much larger, and one might need many repetitions to achieve negligible soundness error. The alternative is to choose the ring $R_q$ in an appropriate way, so that no repetitions are needed.</p> <p>The second requirement is, of course, efficiency. We want the prover to be fast while performing the CKKS operations. For that reason, as is typical in HE, we invoke the Chinese Remainder Theorem (CRT). Let $N$ be a power of two and $q:= \prod_{i=0}^L p_i$ for some integer $L$, such that each $p_i$ is a prime of the form $2aN/d + 1$ for some odd integer $a$ and suitable power-of-two value $d$. Then, for each $p_i$, the cyclotomic polynomial $X^N+1$ splits in \(\mathbb{Z}_{p_i}[X]\) into $k = N/d$ irreducible factors \(X^N + 1 = \prod_{j=0}^{k-1}(X^d - \zeta^{(2j+1)}),\) where \(\zeta\in\mathbb{Z}_{p_i}\) is a \(2N/d\)-th primitive root of unity. By the CRT, the ring \(R_{q}\) splits as</p> \[R_{q} = \prod_{i=0}^L R_{p_i} = \prod_{i=0}^L \left( \prod_{j=0}^{k-1} R_{i,j} \right), \tag{1}\] <p>with each \(R_{i,j} = \mathbb{Z}_{p_i}[X]/(X^d - \zeta^{(2j+1)})\) being a field of size $p_i^d$.</p> <p>The value $d$ will be chosen appropriately so as to guarantee soundness security. For brevity, we omit a detailed discussion on choosing the value $d$, but note that typical choices for $d$ would be $d=2$, or $d=4$. We refer to [1] for details on efficient arithmetic over the rings these choices entail.</p> <h3 id="algorithms">Algorithms</h3> <p>Now that we have set up the underlying ring $R_q$, we are ready to describe the main algorithms for our proof-friendly CKKS. In more detail, we present a modified version of the CKKS scheme. At a high level, these modifications aim to get around of the hurdles which makes CKKS incompatible with SNARKs, namely the fact that the ring changes during the computation. In our version of CKKS, the algorithms will always work on the same underlying ring $R_q$. The trick that makes this possible is a modification of the CRT map underlying the RNS version of CKKS, which we show now.</p> <p>For all $0 \leq j \leq L$ define $q_j = \prod_{i=0}^{j} p_{i}$, and in particular $q_L = q$. Then we define \(R := \mathbb{Z}[X]/(X^N+1)\) and \(R_{q_j} := \mathbb{Z}_{q_j}[X]/(X^N+1)\) for all $0 \leq j \leq L.$ Let $\omega_{j} = (p_0, p_1, \dots, p_{j})$ be a CRT base for $q_j$ and given $a \in R_{q}$, we define the inverse CRT map as follows:</p> <div class="math-wide"> $$CRT^{-1}_{\omega_{j}}(a) := \left( \left[ a\right]_{p_0}, \left[ a \right]_{p_1}, \dots, \left[ a \right]_{p_{j}} \right) \in R_{q}^{j+1}.$$ </div> <div class="math-narrow"> $$\begin{aligned} &amp;CRT^{-1}_{\omega_{j}}(a) := \\ &amp;~~ \left( \left[ a\right]_{p_0}, \left[ a \right]_{p_1}, \dots, \left[ a \right]_{p_{j}} \right) \in R_{q}^{j+1}. \end{aligned}$$ </div> <p>Note that, in general, the inverse CRT for $\omega_j$ would be defined as a map $R_{q} \rightarrow R_{p_0} \times \ldots \times R_{p_j}$. We slightly modify this, by adding an embedding from $R_{p_0} \times \ldots \times R_{p_{j}}$ to $R_{q}^{j+1}$. More precisely, our mapping looks like:</p> <div class="math-wide"> $$\begin{aligned} R_{q_0} ~~ &amp;\longrightarrow \qquad R_{p_0} \times \ldots \times R_{p_j} \qquad \longrightarrow \quad R_{q}^{j+1} \\ a ~~~ &amp;\mapsto ~ \mathbf{a} := \left( \left[ a\right]_{p_0}, \left[ a\right]_{p_1}, \dots,\left[ a\right]_{p_{j}}\right) \rightarrow \left(\mathbf{a}'_0, \ldots, \mathbf{a}'_j \right)\end{aligned}$$ </div> <div class="math-narrow"> $$\begin{aligned} R_{q_0} &amp;\longrightarrow \prod_{i=0}^{j} R_{p_i} \longrightarrow R_{q}^{j+1} \\ a &amp;\mapsto \mathbf{a} := ([a]_{p_i})_{i \in [0,j]} \\ &amp;~~~~~~~~~~~\rightarrow (\mathbf{a}'_0, \ldots, \mathbf{a}'_j) \end{aligned}$$ </div> <p>where, using RNS representation,</p> <div class="math-wide"> $$\mathbf{a}'_i = \left( \left[ \left[ a\right]_{p_i} \right]_{p_0}, \left[ \left[ a\right]_{p_i}\right]_{p_1}, \dots,\left[ \left[ a\right]_{p_i} \right]_{p_{j}}, \underbrace{0 \ldots, 0}_{L-j ~\text{times}}\right) \in R_q.$$ </div> <div class="math-narrow"> $$\begin{aligned} \mathbf{a}'_i = \bigg( &amp;\left[ \left[ a\right]_{p_i} \right]_{p_0}, \left[ \left[ a\right]_{p_i}\right]_{p_1}, \dots,\\ &amp; ~~~~~ \left[ \left[ a\right]_{p_i} \right]_{p_{j}}, \underbrace{0 \ldots, 0}_{L-j ~\text{times}} \bigg) \in R_q. \end{aligned}$$ </div> <p>Letting $Q_{i} = q_L/p_i$ and \(\hat{Q}_{i} = \left[(q_L/p_i)^{-1}\right]_{p_i}\) be the usual constants for CRT recomposition, we define</p> <div class="math-wide"> $$z_j := \sum_{i=0}^{j} Q_i\hat{Q}_i = CRT(\underbrace{1, \dots, j}_{j+1~\text{times}}, 0, \dots, 0) \in R_{q}.$$ </div> <div class="math-narrow"> $$\begin{aligned} z_j :=&amp; \sum_{i=0}^{j} Q_i\hat{Q}_i \\ =&amp; ~CRT(\underbrace{1, \dots, j}_{j+1~\text{times}}, 0, \dots, 0) \in R_{q}. \end{aligned}$$ </div> <p>For each $0 \leq j \leq L$, the CRT recomposition vector for a given $a \in R_{q}$ is:</p> <div class="math-wide"> $$ PW_{\omega_{j}}(a) := \left(\left[ aQ_{0}\hat{Q}_{0} \right]_{q}, \left[ aQ_{1}\hat{Q}_{1} \right]_{q}, \dots, \left[ aQ_{j}\hat{Q}_{j} \right]_{q} \right) \in R_q^{j+1}.$$ </div> <div class="math-narrow"> $$\begin{aligned} PW_{\omega_{j}}(a) := \bigg( &amp;\left[ aQ_{0}\hat{Q}_{0} \right]_{q}, \left[ aQ_{1}\hat{Q}_{1} \right]_{q},\\ &amp;\dots, \left[ aQ_{j}\hat{Q}_{j} \right]_{q} \bigg) \in R_q^{j+1}. \end{aligned}$$ </div> <p>For any $a, b \in R_{q}$, for any $j$, the following holds</p> <div class="math-wide"> $$ a \cdot b \cdot z_j \equiv \langle PW_{\omega_{j}}(a), CRT^{-1}_{\omega_{j}}(b) \rangle \cdot z_j \pmod{q_0}. $$ </div> <div class="math-narrow"> $$\begin{aligned} a \cdot b \cdot z_j \equiv \langle PW_{\omega_{j}}(a),~&amp; CRT^{-1}_{\omega_{j}}(b) \rangle \cdot z_j \\ &amp;~~~\pmod{q_0}. \end{aligned}$$ </div> <p>This follows from a direct application of the CRT in the ideal defined by $z_j$.</p> <p>We are ready to describe the algorithms defining our proof-friendly version of CKKS. Let $q_{0} &lt; q_{1} &lt; \dots &lt; q_{D-1}$ be a chain of moduli for a circuit of depth $D$ and $(\omega_{0}, \omega_{1}, \dots, \omega_{D-1})$ their respective CRT bases, $\chi_\text{key}$ be the secret key distribution over $R$, and $\chi_\text{err}, \chi_\text{enc}$ be discrete Gaussian distributions over $R_{q}$. Below we present our version of the CKKS scheme. For brevity, we only present the algorithms which are different from “regular CKKS”.</p> <ul> <li>$\mathsf{EvalKeyGen}:$ Let the secret key be $s$, and assume we are performing a key switching from $s^2$ to $s$. Sample $a_i \leftarrow R_{q}$, sample $e_i \leftarrow \chi_\text{err}$ for $i= 0, \dots, L$. Compute $b_i = - a_i \cdot s + e_i + PW_{\omega_{L}}(s^2)[i]$ $\bmod q$. For each level $l \in { 0, \dots, D - 1}$, compute</li> </ul> <div class="math-wide"> $$\mathfrak{evk}_l := (\mathfrak{evk}_{l,0}, \mathfrak{evk}_{l,1}) \leftarrow \left( (z_l b_i)_{i=0, \dots, l}, (z_l a_i)_{i=0, \dots, l} \right) \in \left( R_{q}^2 \right)^{l+1}.$$ </div> <div class="math-narrow"> $$\begin{aligned} \mathfrak{evk}_l &amp;:= (\mathfrak{evk}_{l,0}, \mathfrak{evk}_{l,1}) \\ \leftarrow&amp; \left( (z_l b_i)_{i=0, \dots, l}, (z_l a_i)_{i=0, \dots, l} \right) \in \left( R_{q}^2 \right)^{l+1}. \end{aligned}$$ </div> <p>Notice that all key switching keys are generated in $R_{q}$ for all levels, but we <em>manually</em> change levels by moving them to the ideals defined by $z_l$. In practice, zeroed RNS components do not need to be processed, yielding similar performance as typical RNS implementations.</p> <ul> <li> <p>$\mathsf{Mult}(ct_0, ct_1, l):$ Multiplication of two ciphertexts $ct_0 := (ct_0[0], ct_0[1])$, $ct_1 := (ct_1[0], ct_1[1])$ at the same multiplicative level $l$ goes as follows. First a <em>pre-multiplication</em> performs a polynomial multiplication</p> <div class="math-wide"> $$ (d_0, d_1, d_2) := (ct_0[0] \cdot ct_1[0], ct_0[0] \cdot ct_1[1] + ct_1[0] \cdot ct_0[1], ct_0[1] \cdot ct_1[1]) \in R_q^3. $$ </div> <div class="math-narrow"> $$\begin{aligned} (d_0, d_1, d_2) := \big(ct_0[0] \cdot ct_1[0], ~~~~~~~~~~~&amp;\\ ct_0[0] \cdot ct_1[1] + ct_1[0] \cdot ct_0[1], &amp;\\ ct_0[1] \cdot ct_1[1] \big) \in R_q^3. &amp; \end{aligned}$$ </div> <p>Then, we perform the <em>key switching</em> as follows.</p> <div class="math-wide"> $$ D_i := d_i + \langle CRT_{\omega_l}^{-1}(d_2), \mathfrak{evk}_{l,i} \rangle \in R_{q}, \quad i=0,1. $$ </div> <div class="math-narrow"> $$\begin{aligned} D_i := d_i + \langle CRT_{\omega_l}^{-1}(d_2), &amp;\mathfrak{evk}_{l,i} \rangle \in R_{q},\\ \text{ for } i=0,1.~&amp; \end{aligned}$$ </div> <p>Then we <em>re-scale</em> $D_0, D_1$ as follows. Let \(p_l^{-1} = \left[q_{l+1}/q_{l}\right]_{q_{j+1} } \cdot z_l \in R_{q}\). Then compute and output the final ciphertext</p> <div class="math-wide"> $$ c_i := \left( D_i - \left[D_i\right]_{p_l}\right) p^{-1}_l \in R_{q},\quad i=0,1. $$ </div> <div class="math-narrow"> $$\begin{aligned} c_i := \left( D_i - \left[D_i\right]_{p_l}\right) p^{-1}_l \in R_{q},&amp;\\ \text{ for } i=0,1.~~~~~~~~~~~~~~~&amp; \end{aligned}$$ </div> </li> </ul> <p>A detailed noise analysis can be found in the full version of our paper. We omit it here for brevity, but we show that our CKKS incurs <em>at most</em> one additional bit of noise.</p> <p><br/></p> <h2 id="arithmetization">Arithmetization</h2> <div style="margin-top: 1.5em;"></div> <p>Now that we have a proof-friendly HE scheme, we describe how we can translate it into a finite set of relations over $R_q$, which, if satisfied, imply the correct execution of the scheme. Although we focus specifically on our (proof-friendly) CKKS additions and multiplications, we note that this process could be extended to any operations or schemes satisfying our desirable “proof-friendly” characteristics.</p> <h3 id="additions">Additions</h3> <p>Let two ciphertexts \(a = (a_0, a_1)\), \(b=(b_0, b_1) \in R_q^2\) be at the same level $l$. Adding $a$ and $b$ can be readily expressed as an arithmetic circuit satisfiability relation over $R_q$, since the resulting ciphertext \(c = (c_0, c_1) := (a_0 + b_0, a_1 + b_1)\) is simply the component-wise addition of $a$ and $b$ over $R_q^2$.</p> <h3 id="multiplications">Multiplications</h3> <p>Suppose that now we want to multiply the ciphertexts $a$ and $b$. That means that the ciphertexts $a, b$ are actually elements in $R_{q_l}$, although our CKKS treats them as elements in $R_q$. We know that a CKKS multiplication is composed of a pre-multiplication, followed by a key switching and finally by a rescaling. Let’s analyze each of these operations in turn.</p> <p>Pre-multiplication can be easily expressed as a sequence of arithmetic operations over $R_q$:</p> <div class="math-wide"> $$ (d_0, d_1, d_2):= (a_0 b_0, a_0b_1 + a_1b_0, a_1b_1). \tag{2} $$ </div> <div class="math-narrow"> $$\begin{aligned} (d_0, d_1, d_2&amp;):= \\ (a_0 b_0, &amp;a_0b_1 + a_1b_0, a_1b_1). \end{aligned}\tag{2}$$ </div> <p>For the key switching</p> <div class="math-wide"> $$ D_0 = d_0 + \langle \mathfrak{evk}_0, CRT^{-1}_{\omega_l}(d_2) \rangle, \quad D_1 = d_1 + \langle \mathfrak{evk}_1, CRT^{-1}_{\omega_l}(d_2) \rangle, \tag{3} $$ </div> <div class="math-narrow"> $$\begin{aligned} D_0 &amp;= d_0 + \langle \mathfrak{evk}_0, CRT^{-1}_{\omega_l}(d_2) \rangle, \\ D_1 &amp;= d_1 + \langle \mathfrak{evk}_1, CRT^{-1}_{\omega_l}(d_2) \rangle, \end{aligned}\tag{3}$$ </div> <p>we come across the first obstacle. The term \(\langle \mathfrak{evk}_0\), \(CRT^{-1}_{\omega_l}(d_2) \rangle\) involves expressing $d_2$ with respect to the RNS basis \(\omega_l\), which is not an arithmetic operation. Instead of proving the decomposition, we let the prover give the verifier the outcome of the decomposition. In other words, the prover sends inputs \(w_{ks, 0}, \dots, w_{ks, l} \in R_q\) satisfying</p> \[d_2 = \sum_{i=0}^l PW_{\omega_l}(1)[i] \cdot w_{ks, i}, \tag{4}\] <p>that is to say, they recompose to $d_2$ under the CRT map, and</p> \[\Vert w_{ks, i} \Vert &lt; p_i, \quad i = 0, \dots l, \tag{5}\] <p>which means that they are bounded by the RNS primes of the basis $\omega_l$ (remember we are at level $l$). In other words, (4) and (5) prove that the new inputs are indeed the CRT decomposition of $d_2$, and thus they can be used in (2) to compute the values $D_i$’s and continue the computation. Next is modulus switching</p> <div class="math-wide"> $$ c_0 = (D_0 + [D_0]_{p_l})\cdot p_l^{-1}, \quad c_1 = (D_1 + [D_1]_{p_l})\cdot p_l^{-1}. $$ </div> <div class="math-narrow"> $$\begin{aligned} c_0 &amp;= (D_0 + [D_0]_{p_l})\cdot p_l^{-1},\\ c_1 &amp;= (D_1 + [D_1]_{p_l})\cdot p_l^{-1}. \end{aligned}$$ </div> <p>Note that this is just a component-wise Euclidean division of $D_i$ by $p_l$. Using the same strategy as above, we let the prover introduce values $w_{quo, 0}, w_{quo, 1}$ and $w_{rmd, 0}, w_{rmd, 1}$ and prove that these are the quotients and remainders for the above equations. Specifically, the prover shows that</p> <div class="math-wide"> $$ D_i = p_l \cdot w_{quo, i} + w_{rmd, i}, \quad i = 0,1, \tag{6} $$ </div> <div class="math-narrow"> $$\begin{aligned} D_i = p_l \cdot w_{quo, i} &amp;+ w_{rmd, i}, \\ i = 0,&amp; 1, \end{aligned}\tag{6}$$ </div> <p>and</p> \[\Vert w_{quo, i} \Vert &lt; q/p_l, \quad i = 0, 1, \tag{7}\] <p>and</p> \[\Vert w_{rmd,i} \Vert &lt; p_l, \quad i=0,1. \tag{8}\] <p>Putting (2), (3), (4) and (6) together, these are equivalent to the following arithmetic circuit satisfiability relation over $R_q$:</p> <div class="math-wide"> $$ \begin{cases} p_l \cdot w_{quo, 0} + w_{rmd, 0} - a_0b_0 - \sum_{i=0}^l \mathfrak{evk}_0[i]\cdot w_{ks, i} = 0,\\ p_l \cdot w_{quo, 1} + w_{rmd, 1} - a_0b_1 - a_1b_0 - \sum_{i=0}^l \mathfrak{evk}_1[i]\cdot w_{ks, i} = 0,\\ d_2 - \sum_{i=0}^l PW_{\omega_l}(1)[i] \cdot w_{ks, i} = 0, \end{cases} $$ </div> <div class="math-narrow"> $$ \begin{cases} \begin{aligned} p_l \cdot w_{quo, 0} &amp;+ w_{rmd, 0} - a_0b_0 \\ &amp;~~ - \sum_{i=0}^l \mathfrak{evk}_0[i]\cdot w_{ks, i} = 0, \end{aligned}\\ \begin{aligned} p_l \cdot w_{quo, 1} &amp;+ w_{rmd, 1} - a_0b_1 - a_1b_0 \\ &amp;~~ -\sum_{i=0}^l \mathfrak{evk}_1[i]\cdot w_{ks, i} = 0, \end{aligned}\\ d_2 - \sum_{i=0}^l PW_{\omega_l}(1)[i] \cdot w_{ks, i} = 0, \end{cases} $$ </div> <p>and $l+5$ range check relations over $R_q$:</p> \[\Vert w_{ks, 0} \Vert &lt; p_0, \dots, \Vert w_{ks, l} \Vert &lt; p_l,\] \[\Vert w_{rmd,0} \Vert, \Vert w_{rmd,1} \Vert &lt; p_l,\] \[\Vert w_{quo, 0} \Vert, \Vert w_{quo, 1} \Vert &lt; q/p_l.\] <p>We have shown how to arithmetize a single addition and multiplication. A similar but much more involved argument can be done for a general circuit made of CKKS additions and multiplications. The strategy is to organize the CKKS circuit into mutiplicative layers, where consecutive additions are grouped together and are followed by the multiplication gate. Now instead of single $R_q$ values, one has to reason with $R_q$ vectors. For the details, we refer to Sec. 4 of our paper.</p> <h3 id="to-summarize">To summarize</h3> <p>To recap, our proof-friendly CKKS has the following properties that make it particularly suitable for proof systems:</p> <p>1) A carefully designed underlying ring which gives large enough exceptional sets while keeping arithmetic fast;</p> <p>2) This ring does not change during the computation, thanks to our re-design of the rescaling algorithm;</p> <p>3) A scheme design for which a noise analysis proves it allows looser bounds, which in turn enables batching of range proofs.</p> <p>At this point, one might ask: do these modifications have an impact on efficiency? After all, we do not change rings in our proof-friendly CKKS so, in particular, the ciphertext size does not decrease as we go through homomorphic computations. As it turns out, staying in the same ring <em>does not</em> impact the efficiency of CKKS operations! Intuitively, this is because we re-embed the ciphertexts into the initial ring $R_q$ by zero-ing the relevant slots in the RNS representation. The prover sees those zeros and simply ignores them: after all, operations involving zeros do not change the result.</p> <p>We implemented and benchmarked our proof-friendly version of CKKS. We show that our proof-friendly CKKS implementation only introduces an overhead of up to 20% for ciphertext multiplications over a “regular” instantiation of the scheme, while still being faster than commonly used libraries such as HELib. This slowdown is mainly due to the use of incomplete NTT. The source code is available on <a href="https://github.com/vfhe/proof-friendly-CKKS">GitHub</a>.</p> <p><br/></p> <h2 id="a-first-instantiation">A first instantiation</h2> <div style="margin-top: 1.5em;"></div> <p>The framework we propose reduces the problem of proving CKKS operations to the problem of designing PIOPs for two specific relations: arithmetic circuit satisfiability over $R_q$ and range checks for $R_q$ vectors. This problem can now be solved with the use of black-box components, making our solution highly modular. In the paper, we make some specific choices for providing a first instantiation that could concretely demonstrate practical feasibility. By themselves, some of these components are also independent contributions, as we designed them to exploit the particular characteristics of our arithmetic structures. Ultimately, however, one could pick and choose different instantiations, and in doing so, optimise for different efficiency metrics.</p> <p>Our instantiation of the framework consists of the following components:</p> <ul> <li>Arithmetic circuit relations are proven with a “custom” version of the GKR protocol [2] over rings. This variant crucially takes advantage of the particular structure of the circuit induced by our previous arithmetization, which results in a GKR circuit of constant depth (consisting of only $4$ layers), independent of the size or depth of the HE circuit.</li> <li>Range checks are proven using lookup arguments, i.e., a proof that convinces the verifier that a value belongs to a table of values $T_B$. For example, this table may include values within a certain range $B$. However, since CKKS requires to check large bounds (e.g., $B$ can have between $50$ and $300$ bits, depending on the level), and the ring dimension is concretely large (typically \(N \in \{2^{13}, 2^{17}\}\)) the public table $T_B$ would be too large to even represent. To overcome this issue, we rely on the recent decomposition technique of Lasso [5], that consists in splitting a large table into smaller ones, so that one can efficiently perform look-ups into them.</li> <li>The last component for building a succinct argument is a polynomial commitment for <em>multilinear</em> polynomials over $R_q$, since those are used to encode the messages sent by the prover in our PIOPs. First, we use the splitting (1) of $R_q$ into the product of finite fields to reduce the problem of designing a PC for multilinear polynomials over $R_q$ into that of designing PCs for multilinear polynomials over finite fields. The second contribution here is to use a field-agnostic PC, Brakedown [3], and modify it in such a way that we can use the limited algebraic structure of our fields to gain in efficiency.</li> </ul> <p><br/></p> <h2 id="future-work">Future work</h2> <div style="margin-top: 1.5em;"></div> <p>The relevance of our framework is mainly on the abstraction level. We are able to provide a blueprint for constructing vHE schemes where verification is asymptotically fast and the prover can potentially scale up well with the size of the circuit. Our blueprint is realized in a modular way by combining specific building blocks. We instantiate these by modifying and optimizing recent constructions from proof systems literature. We demonstrate that our building blocks can be practically instantiated. Compared to previous literature, our results for small (depth-1) circuits indicate similar performance levels as [4], which was the state of the art on concrete performance for verifiable RNS HE schemes. However, contrary to [4], we verify full-featured RNS-based leveled HE schemes, including key switching and rescaling operations, which enables our solution to scale to larger circuits. In contrast, the performance in the previous approach [4] would deteriorate exponentially with the circuit depth.</p> <p>The interested reader can also check online the recordings of our talks:</p> <ul> <li>For an FHE introduction check the <a href="https://www.youtube.com/watch?v=nAdAs56TxvE">talk at FHE.org</a></li> <li>For a more SNARK-y perspective check the <a href="https://www.youtube.com/watch?v=QskB0USXFrU">talk at ZKProof</a></li> </ul> <p>The next challenge is of course practical: to show that our framework is really able to scale up with large circuits. So stay tuned!</p> <hr/> <p><br/></p> <h2 id="references">References</h2> <div style="margin-top: 1.5em;"></div> <p>[1] I. Cascudo, A. Costache, D. Cozzo, D. Fiore, A. Guimaraes and E. Soria-Vazquez. “Verifiable Computation for Approximate Homomorphic Encryption Schemes.” CRYPTO 2025.</p> <p>[2] S. Goldwasser, Y. T. Kalai, and G. N. Rothblum. “Delegating Computation: Interactive Proofs for Muggles.” Journal of the ACM 2015.</p> <p>[3] A. Golovnev, J. Lee, S. T. V. Setty, J. Thaler, and R. S. Wahby. “Brakedown: Linear-time and Field-agnostic SNARKs for R1CS.” CRYPTO 2023.</p> <p>[4] C. Knabenhans, A. Viand, and A. Hithnawi. “Towards Robust FHE for the Real World.” Real World Crypto 2024.</p> <p>[5] S. T. V. Setty, J. Thaler, and R. S. Wahby. “Unlocking the Lookup Singularity with Lasso.” EUROCRYPT 2024.</p>]]></content><author><name>Ignacio Cascudo, Anamaria Costache, Daniele Cozzo, Dario Fiore, Antonio Guimarães, Eduardo Soria-Vazquez</name></author><summary type="html"><![CDATA[TL;DR: Homomorphic Encryption (HE) enables computing over encrypted data but, by itself, provides no guarantees that the computation was honestly executed. One can build "Verifiable HE" (vHE) using SNARKs, but efficiently combining HE and SNARKs in practice is a major challenge. This work introduces a blueprint for building verifiable HE schemes and its efficient instantiation for CKKS. Our first step is to introduce a "proof-friendly" version of CKKS, which is more amenable to proof systems, while being only slightly slower than typical RNS CKKS implementations. We then show how the problem of proving correctness of computations for such proof-friendly HE schemes can be reduced to just two sets of arithmetic relations (containing equalities and inequalities). We show that if these are satisfied, it implies the correct execution of the HE evaluation. We design Polynomial Interactive Oracle Proofs (PIOPs) for efficiently proving these relations, and we show how they can be instantiated using standard proof components. Our final construction demonstrates the feasibility of building SNARKs for proving computation of full-fledged HE schemes, opening the path for building practical verifiable HE schemes.]]></summary></entry><entry><title type="html">Orion: A Fully Homomorphic Encryption Framework for Deep Learning</title><link href="https://ckks.org/blog/2026/orion/" rel="alternate" type="text/html" title="Orion: A Fully Homomorphic Encryption Framework for Deep Learning"/><published>2026-02-02T04:00:00+00:00</published><updated>2026-02-02T04:00:00+00:00</updated><id>https://ckks.org/blog/2026/orion</id><content type="html" xml:base="https://ckks.org/blog/2026/orion/"><![CDATA[<ul> <li>Written by <a href="https://austinebel.net/">Austin Ebel</a>, <a href="https://kvgarimella.github.io/">Karthik Garimella</a>, <a href="https://brandonreagen.com/">Brandon Reagen</a> (New York University)</li> <li>Based on <a href="https://arxiv.org/pdf/2311.03470">https://arxiv.org/pdf/2311.03470</a> (ASPLOS 2025)</li> </ul> <p><em>TL;DR: Orion is a framework that compiles PyTorch neural network models into efficient CKKS FHE programs for encrypted inference. Orion automatically handles low-level FHE details such as data packing, bootstrap placement, and precision management. Orion is open-sourced at: <a href="https://github.com/baahl-nyu/orion">https://github.com/baahl-nyu/orion</a>.</em></p> <hr/> <p><br/></p> <h2 id="1-introduction">1. Introduction</h2> <div style="margin-top: 1.5em;"></div> <p>The CKKS FHE scheme enables cryptographically secure outsourced computing, which has broad implications for areas such as health or finance. However, writing FHE programs, especially FHE neural networks, remains a challenge given the low-level primitives that CKKS exposes: addition, multiplication, and rotation of encrypted vectors. Furthermore, auxiliary FHE operations that are necessary for deep computations (such as bootstrapping and scale management) only make this harder.</p> <p>A user of FHE might ask: How and when should bootstraps be placed during inference? What FHE algorithm should be used for computing convolutions? How can we approximate non-linear activations? In this blog, we introduce our framework Orion, which handles each of these issues automatically, making it easier to write FHE neural network programs. The figure below describes three key aspects we had in mind when building Orion.</p> <div style="width: 95%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/orion_three_pillars.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/orion_three_pillars.svg" class="img-fluid" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 1: Orion's high-level goals: (left) vector packing for fast linear transforms, (middle) automated bootstrap placement and level management, and (right) an end-user workflow that stays close to PyTorch. </div> </div> <p>First, we automatically handle FHE packing in order to leverage fast CKKS linear transformation algorithms. Our packing strategy ensures that each linear layer consumes <em>only one multiplicative level</em> regardless of the linear layer configuration. Second, we entirely automate both bootstrap and scale management, which enables Orion to run <em>deep</em> neural networks such as ResNet-50 while still maintaining precision. Finally, we wanted Orion to be <em>accessible</em>; writing FHE neural networks in Orion requires only knowledge of PyTorch.</p> <p>Below, we’ll cover the details of our convolution algorithm and bootstrap placement strategy.</p> <p><br/></p> <h2 id="2-efficient-linear-transform-algorithms">2. Efficient Linear Transform Algorithms</h2> <div style="margin-top: 1.5em;"></div> <p>Linear transformations are a core building block in neural networks, from convolutional neural networks (convolutions, average pooling, and final head layers) to modern-day transformer architectures (QKV projections, feed-forward MLPs, and even RoPE). Under CKKS, it helps to think about linear transforms in terms of three constraints: how well we use ciphertext slots, how many multiplicative levels we consume, and how many expensive homomorphic operations we require (in particular, key-switches induced by ciphertext rotations).</p> <p>As a concrete example, consider outsourced neural inference where we want to compute a matrix–vector product between a plaintext weight matrix and an encrypted ciphertext vector. Since the weight matrix is unencrypted, it can be <em>packed</em> into CKKS plaintext vectors in different ways, and that packing choice largely determines how many rotations and levels we pay for.</p> <p>A natural first attempt is to pack each row of the matrix into a CKKS vector and compute a dot product against the ciphertext. In practice, row-based packing tends to be a poor fit for CKKS: it requires padding dimensions to powers of two, it computes dot products <em>within</em> ciphertexts (often leading to low slot utilization), and it typically consumes extra multiplicative depth to consolidate partial sums. It also scales poorly in rotations for dense matrix–vector products.</p> <div style="width: 85%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/diagonal_method.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/diagonal_method.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 2: The diagonal method for plaintext-ciphertext matrix–vector products. </div> </div> <p>Orion instead targets <strong>diagonal-based</strong> matrix–vector product algorithms in which the <em>diagonals</em> of the unencrypted matrix are packed into CKKS vectors. These algorithms have been well-optimized by the cryptographic community, and they align well with how CKKS bootstrapping itself is implemented (which relies on large linear transforms). Rather than computing dot products within intermediate ciphertexts, diagonal-based approaches compute dot products <em>across</em> ciphertexts and then aggregate them into a single packed ciphertext output.</p> <p>A standard way to reduce the rotation count further is the baby-step giant-step (BSGS) strategy. Informally, BSGS reuses ciphertext rotations by reorganizing which shifts happen on ciphertexts and which shifts can be absorbed into plaintext preprocessing of packed diagonals. This reduces rotation counts from $O(n)$ down to $O(\sqrt{n})$ while still consuming only <strong>one</strong> multiplicative level.</p> <div style="width: 85%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/bsgs_method.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/bsgs_method.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 3: BSGS reduces ciphertext rotations by leveraging the fact that matrix diagonals can be cheaply rotated before being encoded. </div> </div> <p>In Orion, we combine BSGS with low-level cryptographic optimizations (in particular, <em>double hoisting</em>) to reduce the amortized cost per rotation. Once we treat “fast matrix–vector products” as a first-class compiler target, we can build higher-level neural network layers on top of a kernel that is both level-efficient (one level per MV) and rotation-efficient (BSGS + hoisting).</p> <p><br/></p> <h2 id="3-convolutions-as-matrix-vector-products">3. Convolutions as Matrix-Vector Products</h2> <div style="margin-top: 1.5em;"></div> <p>Now that we have a blueprint for fast matrix-vector products, a natural question is: can we apply it to convolutions? If so, then the same principles would extend to the much wider class of <em>convolutional</em> neural nets.</p> <p>The standard approach for this is known as the Toeplitz formulation. Here, the input image is flattened into a vector and the kernel expands into a matrix. Each row of this Toeplitz matrix corresponds to one filter multiplication with the input image as it slides over all of its positions.</p> <div style="width: 80%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/toeplitz_animation.webp" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/toeplitz_animation.webp" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 4: The Toeplitz formulation of a single-input, single-output (SISO) convolution. </div> </div> <p>The animation in Figure 4 shows this for single-input single-output (SISO) convolutions, although the approach naturally extends to multi-input, multi-output convolutions as well. We use the following code block in Orion to perform the initial packing of <strong>arbitrary</strong> convolutions (e.g., any stride, padding, dilation, etc.) directly in PyTorch.</p> <details> <summary>Click to expand code</summary> <div style="font-size: 0.9em;"> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">def</span> <span class="nf">construct_toeplitz_matrix</span><span class="p">(</span><span class="n">input_shape</span><span class="p">,</span> <span class="n">output_shape</span><span class="p">,</span> <span class="n">conv_layer</span><span class="p">):</span>
    <span class="sh">"""</span><span class="s">
    Construct the Toeplitz representation of a convolutional layer.
    
    The Toeplitz matrix reformulates convolution as matrix-vector multiplication:
        output_flat = toep_matrix @ input_flat
    
    Args:
        input_shape (tuple): Shape of input tensor (N, Ci, Hi, Wi)
        output_shape (tuple): Shape of output tensor (N, Co, Ho, Wo)
        conv_layer (nn.Conv2d): PyTorch convolutional layer
    
    Returns:
        torch.Tensor: Toeplitz matrix of shape (Co * Ho * Wo, Ci * Hi * Wi)
    </span><span class="sh">"""</span>
    <span class="c1"># Unpack shapes
</span>    <span class="n">N</span><span class="p">,</span> <span class="n">Ci</span><span class="p">,</span> <span class="n">Hi</span><span class="p">,</span> <span class="n">Wi</span> <span class="o">=</span> <span class="n">input_shape</span>
    <span class="n">N</span><span class="p">,</span> <span class="n">Co</span><span class="p">,</span> <span class="n">Ho</span><span class="p">,</span> <span class="n">Wo</span> <span class="o">=</span> <span class="n">output_shape</span>
    
    <span class="c1"># Extract convolution parameters
</span>    <span class="n">kernel_weights</span> <span class="o">=</span> <span class="n">conv_layer</span><span class="p">.</span><span class="n">weight</span><span class="p">.</span><span class="n">data</span>  <span class="c1"># shape: (Co, Ci, kH, kW)
</span>    <span class="n">_</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">kH</span><span class="p">,</span> <span class="n">kW</span> <span class="o">=</span> <span class="n">kernel_weights</span><span class="p">.</span><span class="n">shape</span>
    <span class="n">padding</span> <span class="o">=</span> <span class="n">conv_layer</span><span class="p">.</span><span class="n">padding</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">stride</span> <span class="o">=</span> <span class="n">conv_layer</span><span class="p">.</span><span class="n">stride</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    <span class="n">dilation</span> <span class="o">=</span> <span class="n">conv_layer</span><span class="p">.</span><span class="n">dilation</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
    
    <span class="c1"># Compute padded input dimensions
</span>    <span class="n">Hi_padded</span> <span class="o">=</span> <span class="n">Hi</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">padding</span>
    <span class="n">Wi_padded</span> <span class="o">=</span> <span class="n">Wi</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">padding</span>
    
    <span class="c1"># Initialize Toeplitz matrix
</span>    <span class="n">num_output_elements</span> <span class="o">=</span> <span class="n">Co</span> <span class="o">*</span> <span class="n">Ho</span> <span class="o">*</span> <span class="n">Wo</span>
    <span class="n">num_padded_input_elements</span> <span class="o">=</span> <span class="n">Ci</span> <span class="o">*</span> <span class="n">Hi_padded</span> <span class="o">*</span> <span class="n">Wi_padded</span>
    <span class="n">toep_matrix</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">zeros</span><span class="p">(</span><span class="n">num_output_elements</span><span class="p">,</span> <span class="n">num_padded_input_elements</span><span class="p">)</span>
    
    <span class="c1"># Create index grid for the padded input (Ci, Hi_padded, Wi_padded)
</span>    <span class="n">padded_indices</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="nf">arange</span><span class="p">(</span>
        <span class="n">Ci</span> <span class="o">*</span> <span class="n">Hi_padded</span> <span class="o">*</span> <span class="n">Wi_padded</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">int32</span>
    <span class="p">).</span><span class="nf">reshape</span><span class="p">(</span><span class="n">Ci</span><span class="p">,</span> <span class="n">Hi_padded</span><span class="p">,</span> <span class="n">Wi_padded</span><span class="p">)</span>
    
    <span class="c1"># Indices within a single kernel window (accounts for dilation)
</span>    <span class="n">kernel_window_indices</span> <span class="o">=</span> <span class="n">padded_indices</span><span class="p">[</span>
        <span class="p">:</span><span class="n">Ci</span><span class="p">,</span>                      <span class="c1"># all input channels
</span>        <span class="p">:</span><span class="n">kH</span> <span class="o">*</span> <span class="n">dilation</span><span class="p">:</span><span class="n">dilation</span><span class="p">,</span>  <span class="c1"># kernel height with dilation
</span>        <span class="p">:</span><span class="n">kW</span> <span class="o">*</span> <span class="n">dilation</span><span class="p">:</span><span class="n">dilation</span>   <span class="c1"># kernel width with dilation
</span>    <span class="p">].</span><span class="nf">flatten</span><span class="p">()</span>
    
    <span class="c1"># Top-left corner positions where the kernel is applied (accounts for stride)
</span>    <span class="n">kernel_start_positions</span> <span class="o">=</span> <span class="n">padded_indices</span><span class="p">[</span>
        <span class="mi">0</span><span class="p">,</span>                  <span class="c1"># first channel only (others offset by kernel_window_indices)
</span>        <span class="mi">0</span><span class="p">:</span><span class="n">Ho</span> <span class="o">*</span> <span class="n">stride</span><span class="p">:</span><span class="n">stride</span><span class="p">,</span>  <span class="c1"># vertical positions
</span>        <span class="mi">0</span><span class="p">:</span><span class="n">Wo</span> <span class="o">*</span> <span class="n">stride</span><span class="p">:</span><span class="n">stride</span>   <span class="c1"># horizontal positions
</span>    <span class="p">].</span><span class="nf">flatten</span><span class="p">()</span>
    
    <span class="c1"># Fill Toeplitz matrix: each output channel block gets the kernel weights
</span>    <span class="n">output_channel_offsets</span> <span class="o">=</span> <span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="nf">arange</span><span class="p">(</span><span class="n">Co</span><span class="p">)</span> <span class="o">*</span> <span class="n">Ho</span> <span class="o">*</span> <span class="n">Wo</span><span class="p">).</span><span class="nf">reshape</span><span class="p">(</span><span class="n">Co</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">spatial_idx</span><span class="p">,</span> <span class="n">start_pos</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">kernel_start_positions</span><span class="p">):</span>
        <span class="n">row_indices</span> <span class="o">=</span> <span class="n">spatial_idx</span> <span class="o">+</span> <span class="n">output_channel_offsets</span>
        <span class="n">col_indices</span> <span class="o">=</span> <span class="n">kernel_window_indices</span> <span class="o">+</span> <span class="n">start_pos</span>
        <span class="n">toep_matrix</span><span class="p">[</span><span class="n">row_indices</span><span class="p">,</span> <span class="n">col_indices</span><span class="p">]</span> <span class="o">=</span> <span class="n">kernel_weights</span><span class="p">.</span><span class="nf">reshape</span><span class="p">(</span><span class="n">Co</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">)</span>
    
    <span class="c1"># Extract only the columns corresponding to the original (unpadded) input
</span>    <span class="n">original_input_indices</span> <span class="o">=</span> <span class="n">padded_indices</span><span class="p">[</span>
        <span class="p">:,</span>                      <span class="c1"># all channels
</span>        <span class="n">padding</span><span class="p">:</span><span class="n">Hi</span> <span class="o">+</span> <span class="n">padding</span><span class="p">,</span>   <span class="c1"># remove top/bottom padding
</span>        <span class="n">padding</span><span class="p">:</span><span class="n">Wi</span> <span class="o">+</span> <span class="n">padding</span>    <span class="c1"># remove left/right padding
</span>    <span class="p">].</span><span class="nf">flatten</span><span class="p">()</span>
    
    <span class="n">toep_matrix</span> <span class="o">=</span> <span class="n">toep_matrix</span><span class="p">[:,</span> <span class="n">original_input_indices</span><span class="p">]</span>
    
    <span class="k">return</span> <span class="n">toep_matrix</span></code></pre></figure> </div> </details> <div style="margin-top: 1em;"></div> <p>The core issue with this method is that it does not efficiently extend to strided convolutions under FHE. In particular, the nice <em>diagonal</em> pattern we see in the unit-stride case is no longer present. Without that pattern, the number of non-zero diagonals depends on the input’s spatial dimensions, which scales poorly to larger image sizes.</p> <div style="width: 95%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/toeplitz_strided.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/toeplitz_strided.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 5: Why strided Toeplitz convolutions are harder. With stride $s &gt; 1$, the Toeplitz matrix loses the clean diagonal structure of the unit-stride case. </div> </div> <p>For the Toeplitz representation to be useful, we need better slot utilization for strided convolutions. Our key observation is that we can permute the matrix rows without changing the computation. All that changes is the order of the output feature map. Figure 6 shows this approach for a convolution with stride $s=2$.</p> <div style="width: 95%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/multiplexed_convs.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/multiplexed_convs.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 6: Single-shot multiplexing. Permuting rows of the Toeplitz matrix packs the diagonals more densely, which reduces expensive ciphertext rotations. </div> </div> <p>We call this technique “single-shot multiplexing” as it mirrors the work from Lee et al. <sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, but consumes just a single multiplicative level. Like their work, this method produces outputs with a gap $g$. Future layers need to account for this gap, but Orion handles it automatically during compilation.</p> <p>The result is fewer rotations, roughly half of which are hoisted. On CIFAR-10, we see 1.65$\times$ fewer rotations on ResNet-20 and up to 6.41$\times$ fewer rotations on AlexNet.</p> <p><br/></p> <h2 id="4-automating-bootstrap-placement">4. Automating Bootstrap Placement</h2> <div style="margin-top: 1.5em;"></div> <p>At this point, it’s worth zooming in on how Orion automates bootstrap placement, because bootstrapping is what makes deep encrypted inference possible in the first place. Unlike data packing, <em>good</em> bootstrap placement often requires a deeper understanding of the underlying cryptography. In our experience, this creates the largest barrier to entry for practitioners, making automated solutions all the more valuable.</p> <p>Our approach involves reformulating things as a shortest-path problem. We create what we call a “level DAG” to enumerate the behavior of all possible network states and their transitions.</p> <p>Figure 7 <em>(left)</em> visualizes this DAG for a simple 3-layer MLP without intermediate activation functions. Here, nodes within a row represent possible choices of level for each linear layer, weighted by the latency of performing that linear layer at the given level. Each row also excludes invalid states. For instance, there is no node for <code class="language-plaintext highlighter-rouge">fc2</code> at level <code class="language-plaintext highlighter-rouge">l=0</code> because linear layers consume a level, and we could lose decoding correctness if there were.</p> <div style="width: 90%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/simple_level_dag.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/simple_level_dag.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 7: Bootstrap placement as shortest-path search. Nodes are level choices (weighted by latency), edges are transitions that may require a bootstrap. The shortest path gives us where to bootstrap and at what level to run each layer. </div> </div> <p>We can then weigh the edges between layers by whether a bootstrap operation is required. For instance, the edges highlighted in red in Figure 7 <em>(middle)</em> each increase the level of the ciphertext (after the layer itself is performed), and therefore must be bootstrapped.</p> <p>With these two constraints, we can apply a shortest path algorithm to find the path that minimizes end-to-end latency in Figure 7 <em>(right)</em>. With respect to our heuristics, this shortest path gives us both the optimal <strong>locations</strong> to bootstrap and the optimal <strong>levels</strong> at which to perform each layer.</p> <p><br/></p> <h3 id="4a-beyond-simple-mlps">4a. Beyond Simple MLPs</h3> <div style="margin-top: 1.5em;"></div> <p>This method extends almost directly to more complex networks, especially those with <em>residual</em> connections. The main obstacle is that residual connections create multiple paths through the graph, each sharing a common fork and join node. This makes directly applying our shortest path approach challenging.</p> <p>We solve this by creating two level DAGs around each residual connection: one for the backbone, one for the residual itself. Then, we can (i) sum the shortest paths for all possible <em>pairs</em> of input and output nodes, and (ii) insert that black-boxed solution back into the original level DAG. This reduces the problem to our original shortest path approach.</p> <div style="width: 100%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/bootstrap_placement_animation.webp" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/bootstrap_placement_animation.webp" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 8: Bootstrap placement over residual connections by solving and black-boxing single-entry, single-exit regions (the backbone and residual paths) and inserting the aggregated solution back into the global level DAG. </div> </div> <p>Interestingly, this approach extends recursively to graphs with any number of fork/join pairs. Figure 9 gives the high-level solution for attention blocks within transformers.</p> <div style="width: 100%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/extending_auto_bootstrap.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/extending_auto_bootstrap.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: center; margin-top: -0.5em;"> Figure 9: Bootstrap placement for more complex networks and attention blocks with multiple fork/join pairs. </div> </div> <p><br/></p> <h2 id="5-addressing-the-sharp-bits">5. Addressing the Sharp Bits</h2> <div style="margin-top: 1.5em;"></div> <p>There’s still more to running deep FHE inference beyond packing and bootstrap placement. Orion includes several automation layers that remove the “sharp edges” that typically trip up non-experts.</p> <p>First, to conform to the CKKS programming model, all non-linear functions (e.g., ReLU or GELU) must be replaced by high-degree polynomials over the correct <em>range</em>. Orion automates this with <code class="language-plaintext highlighter-rouge">orion.fit()</code> which iterates over the training data and (i) replaces non-linearities with Chebyshev polynomials, and (ii) automatically determines the input ranges those approximations need.</p> <p>Second, Orion automates scale management. We leverage the fact that our compilation phase has already determined the level at which to perform any linear layer (call it level $j$). We encode the weights with scale factor $q_j$ (the last RNS modulus at level $j$) rather than $\Delta$. This way, performing a convolution with a ciphertext at scale $\Delta$ produces an output at scale $\Delta \cdot q_j$. Rescaling divides by $q_j$, resetting the scale back to exactly $\Delta$. This is what we call “errorless” neural network evaluation.</p> <p>Finally, Orion handles the large data structures that come with FHE inference. Server-side packed vectors and evaluation keys can reach tens to hundreds of gigabytes. Orion optionally reduces memory pressure by dynamically loading per-layer plaintext vectors during linear transforms.</p> <div style="width: 100%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/orion_sharp_bits.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/orion_sharp_bits.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 10: The sharp bits Orion handles automatically: (left) fitting activation functions, (middle) scale management, (right) large data structures. </div> </div> <p>From the user’s perspective, Orion feels like PyTorch: you write a model as usual, fit/compile it for FHE, and run encrypted inference.</p> <details> <summary>Click to expand: Model definition (ResNet-20)</summary> <div style="font-size: 0.9em;"> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="n">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">import</span> <span class="n">orion.nn</span> <span class="k">as</span> <span class="n">on</span>


<span class="k">class</span> <span class="nc">BasicBlock</span><span class="p">(</span><span class="n">on</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="n">expansion</span> <span class="o">=</span> <span class="mi">1</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">Ci</span><span class="p">,</span> <span class="n">Co</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
        <span class="nf">super</span><span class="p">().</span><span class="nf">__init__</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">Conv2d</span><span class="p">(</span><span class="n">Ci</span><span class="p">,</span> <span class="n">Co</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="n">stride</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">bn1</span>   <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">BatchNorm2d</span><span class="p">(</span><span class="n">Co</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">act1</span>  <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">ReLU</span><span class="p">()</span>

        <span class="n">self</span><span class="p">.</span><span class="n">conv2</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">Conv2d</span><span class="p">(</span><span class="n">Co</span><span class="p">,</span> <span class="n">Co</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">padding</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">bn2</span>   <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">BatchNorm2d</span><span class="p">(</span><span class="n">Co</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">act2</span>  <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">ReLU</span><span class="p">()</span>
       
        <span class="n">self</span><span class="p">.</span><span class="n">add</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">Add</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">shortcut</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="nc">Sequential</span><span class="p">()</span>
        <span class="k">if</span> <span class="n">stride</span> <span class="o">!=</span> <span class="mi">1</span> <span class="ow">or</span> <span class="n">Ci</span> <span class="o">!=</span> <span class="n">self</span><span class="p">.</span><span class="n">expansion</span><span class="o">*</span><span class="n">Co</span><span class="p">:</span>
            <span class="n">self</span><span class="p">.</span><span class="n">shortcut</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="nc">Sequential</span><span class="p">(</span>
                <span class="n">on</span><span class="p">.</span><span class="nc">Conv2d</span><span class="p">(</span><span class="n">Ci</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">expansion</span><span class="o">*</span><span class="n">Co</span><span class="p">,</span> <span class="n">kernel_size</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">stride</span><span class="o">=</span><span class="n">stride</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">),</span>
                <span class="n">on</span><span class="p">.</span><span class="nc">BatchNorm2d</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">expansion</span><span class="o">*</span><span class="n">Co</span><span class="p">))</span>
  
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">act1</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">bn1</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">bn2</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">conv2</span><span class="p">(</span><span class="n">out</span><span class="p">))</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">add</span><span class="p">(</span><span class="n">out</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="nf">shortcut</span><span class="p">(</span><span class="n">x</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">act2</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">ResNet</span><span class="p">(</span><span class="n">on</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">dataset</span><span class="p">,</span> <span class="n">block</span><span class="p">,</span> <span class="n">num_blocks</span><span class="p">,</span> <span class="n">num_chans</span><span class="p">,</span> <span class="n">conv1_params</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">):</span>
        <span class="nf">super</span><span class="p">().</span><span class="nf">__init__</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">in_chans</span> <span class="o">=</span> <span class="n">num_chans</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>
        <span class="n">self</span><span class="p">.</span><span class="n">last_chans</span> <span class="o">=</span> <span class="n">num_chans</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>

        <span class="n">self</span><span class="p">.</span><span class="n">conv1</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">Conv2d</span><span class="p">(</span><span class="mi">3</span><span class="p">,</span> <span class="n">self</span><span class="p">.</span><span class="n">in_chans</span><span class="p">,</span> <span class="o">**</span><span class="n">conv1_params</span><span class="p">,</span> <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">bn1</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">BatchNorm2d</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">in_chans</span><span class="p">)</span>
        <span class="n">self</span><span class="p">.</span><span class="n">act</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">ReLU</span><span class="p">()</span>
        
        <span class="n">self</span><span class="p">.</span><span class="n">layers</span> <span class="o">=</span> <span class="n">nn</span><span class="p">.</span><span class="nc">ModuleList</span><span class="p">()</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nf">range</span><span class="p">(</span><span class="nf">len</span><span class="p">(</span><span class="n">num_blocks</span><span class="p">)):</span>
            <span class="n">stride</span> <span class="o">=</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="mi">2</span>
            <span class="n">self</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">layer</span><span class="p">(</span><span class="n">block</span><span class="p">,</span> <span class="n">num_chans</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">num_blocks</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">stride</span><span class="p">))</span>

        <span class="n">self</span><span class="p">.</span><span class="n">avgpool</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">AdaptiveAvgPool2d</span><span class="p">(</span><span class="n">output_size</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span> 
        <span class="n">self</span><span class="p">.</span><span class="n">flatten</span> <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">Flatten</span><span class="p">()</span>
        <span class="n">self</span><span class="p">.</span><span class="n">linear</span>  <span class="o">=</span> <span class="n">on</span><span class="p">.</span><span class="nc">Linear</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">last_chans</span> <span class="o">*</span> <span class="n">block</span><span class="p">.</span><span class="n">expansion</span><span class="p">,</span> <span class="n">num_classes</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">layer</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">block</span><span class="p">,</span> <span class="n">chans</span><span class="p">,</span> <span class="n">num_blocks</span><span class="p">,</span> <span class="n">stride</span><span class="p">):</span>
        <span class="n">strides</span> <span class="o">=</span> <span class="p">[</span><span class="n">stride</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">*</span><span class="p">(</span><span class="n">num_blocks</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span>
        <span class="n">layers</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">stride</span> <span class="ow">in</span> <span class="n">strides</span><span class="p">:</span>
            <span class="n">layers</span><span class="p">.</span><span class="nf">append</span><span class="p">(</span><span class="nf">block</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="n">in_chans</span><span class="p">,</span> <span class="n">chans</span><span class="p">,</span> <span class="n">stride</span><span class="p">))</span>
            <span class="n">self</span><span class="p">.</span><span class="n">in_chans</span> <span class="o">=</span> <span class="n">chans</span> <span class="o">*</span> <span class="n">block</span><span class="p">.</span><span class="n">expansion</span>
        <span class="k">return</span> <span class="n">nn</span><span class="p">.</span><span class="nc">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">layers</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">self</span><span class="p">,</span> <span class="n">x</span><span class="p">):</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">act</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">bn1</span><span class="p">(</span><span class="n">self</span><span class="p">.</span><span class="nf">conv1</span><span class="p">(</span><span class="n">x</span><span class="p">)))</span>
        <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="n">self</span><span class="p">.</span><span class="n">layers</span><span class="p">:</span>
            <span class="n">out</span> <span class="o">=</span> <span class="nf">layer</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>

        <span class="n">out</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">avgpool</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">self</span><span class="p">.</span><span class="nf">flatten</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">self</span><span class="p">.</span><span class="nf">linear</span><span class="p">(</span><span class="n">out</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">ResNet20</span><span class="p">(</span><span class="n">dataset</span><span class="o">=</span><span class="sh">'</span><span class="s">cifar10</span><span class="sh">'</span><span class="p">):</span>
    <span class="n">configs</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">cifar10</span><span class="sh">"</span><span class="p">:</span>  <span class="p">{</span><span class="sh">"</span><span class="s">kernel_size</span><span class="sh">"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span> <span class="sh">"</span><span class="s">stride</span><span class="sh">"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="sh">"</span><span class="s">padding</span><span class="sh">"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="sh">"</span><span class="s">num_classes</span><span class="sh">"</span><span class="p">:</span> <span class="mi">10</span><span class="p">},</span>
        <span class="sh">"</span><span class="s">cifar100</span><span class="sh">"</span><span class="p">:</span> <span class="p">{</span><span class="sh">"</span><span class="s">kernel_size</span><span class="sh">"</span><span class="p">:</span> <span class="mi">3</span><span class="p">,</span> <span class="sh">"</span><span class="s">stride</span><span class="sh">"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="sh">"</span><span class="s">padding</span><span class="sh">"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="sh">"</span><span class="s">num_classes</span><span class="sh">"</span><span class="p">:</span> <span class="mi">100</span><span class="p">},</span>
    <span class="p">}</span>
    <span class="n">config</span> <span class="o">=</span> <span class="n">configs</span><span class="p">[</span><span class="n">dataset</span><span class="p">]</span>
    <span class="n">conv1_params</span> <span class="o">=</span> <span class="p">{</span>
        <span class="sh">'</span><span class="s">kernel_size</span><span class="sh">'</span><span class="p">:</span> <span class="n">config</span><span class="p">[</span><span class="sh">"</span><span class="s">kernel_size</span><span class="sh">"</span><span class="p">],</span>
        <span class="sh">'</span><span class="s">stride</span><span class="sh">'</span><span class="p">:</span> <span class="n">config</span><span class="p">[</span><span class="sh">"</span><span class="s">stride</span><span class="sh">"</span><span class="p">],</span>
        <span class="sh">'</span><span class="s">padding</span><span class="sh">'</span><span class="p">:</span> <span class="n">config</span><span class="p">[</span><span class="sh">"</span><span class="s">padding</span><span class="sh">"</span><span class="p">]</span>
    <span class="p">}</span>
    <span class="k">return</span> <span class="nc">ResNet</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">BasicBlock</span><span class="p">,</span> <span class="p">[</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span><span class="p">],</span> <span class="p">[</span><span class="mi">16</span><span class="p">,</span><span class="mi">32</span><span class="p">,</span><span class="mi">64</span><span class="p">],</span> <span class="n">conv1_params</span><span class="p">,</span> <span class="n">config</span><span class="p">[</span><span class="sh">"</span><span class="s">num_classes</span><span class="sh">"</span><span class="p">])</span></code></pre></figure> </div> </details> <details> <summary>Click to expand: FHE compilation workflow</summary> <div style="font-size: 0.9em;"> <figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c1"># High-level workflow (sketch)
</span><span class="n">net</span> <span class="o">=</span> <span class="nc">ResNet20</span><span class="p">()</span>
<span class="n">orion</span><span class="p">.</span><span class="nf">fit</span><span class="p">(</span><span class="n">net</span><span class="p">,</span> <span class="n">trainloader</span><span class="p">)</span>   <span class="c1"># range discovery + polynomial activation replacement
</span><span class="n">orion</span><span class="p">.</span><span class="nf">compile</span><span class="p">(</span><span class="n">net</span><span class="p">)</span>            <span class="c1"># packing + bootstrap placement + key/material generation
</span>
<span class="n">net</span><span class="p">.</span><span class="nf">he</span><span class="p">()</span>                      <span class="c1"># switch model into FHE execution mode
</span><span class="n">ct_out</span> <span class="o">=</span> <span class="nf">net</span><span class="p">(</span><span class="n">ct_in</span><span class="p">)</span>           <span class="c1"># run encrypted inference</span></code></pre></figure> </div> </details> <p><br/></p> <h2 id="6-results">6. Results</h2> <div style="margin-top: 1.5em;"></div> <p>To highlight Orion’s scalability, we ran ResNet-34 and ResNet-50 on ImageNet end-to-end under FHE with single-threaded inference times of 3.98 hours and 8.98 hours, respectively. To show that Orion handles more than classification, we also ran object detection using a 139 million parameter YOLO-v1 model on the PASCAL-VOC dataset (images of size 448 $\times$ 448 $\times$ 3). An encrypted inference took 17.5 hours and produced bounding boxes and confidence scores entirely under FHE, matching the cleartext PyTorch output to 8 bits of precision.</p> <div style="margin-top: 1em;"></div> <div style="width: 80%; margin: 0 auto;"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2602_Austin/object_detection.svg" sizes="95vw"/> <img src="/assets/img/blog/2602_Austin/object_detection.svg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <div class="caption" style="text-align: left; margin-top: -0.5em;"> Figure 11: Homomorphic object detection and localization results. Labels show predicted class and confidence score. Outputs match cleartext PyTorch to 8 bits of precision. </div> </div> <p><em>If you want to reproduce results or try Orion on your own models, the repository is available at <a href="https://github.com/baahl-nyu/orion">https://github.com/baahl-nyu/orion</a>.</em></p> <hr/> <p><br/></p> <h2 id="references">References</h2> <div style="margin-top: 1.5em;"></div> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>Eunsang Lee, Joon-Woo Lee, Junghyun Lee, Young-Sik Kim, Yongjune Kim, Jong-Seon No, and Woosuk Choi. “Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions.” ICML 2022. <a href="https://proceedings.mlr.press/v162/lee22e.html">https://proceedings.mlr.press/v162/lee22e.html</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Austin Ebel, Karthik Garimella, Brandon Reagen</name></author><summary type="html"><![CDATA[TL;DR: Orion is a framework that compiles PyTorch neural network models into efficient CKKS FHE programs for encrypted inference. Orion automatically handles low-level FHE details such as data packing, bootstrap placement, and precision management. Orion is open-sourced at: https://github.com/baahl-nyu/orion.]]></summary></entry><entry><title type="html">DPHE: Protecting Server Privacy in CKKS-based Protocols</title><link href="https://ckks.org/blog/2026/DPHE/" rel="alternate" type="text/html" title="DPHE: Protecting Server Privacy in CKKS-based Protocols"/><published>2026-01-05T09:12:00+00:00</published><updated>2026-01-05T09:12:00+00:00</updated><id>https://ckks.org/blog/2026/DPHE</id><content type="html" xml:base="https://ckks.org/blog/2026/DPHE/"><![CDATA[<ul> <li>Written by <a href="https://jin-yeong-seo.github.io/">Jinyeong Seo</a> (Seoul National University)</li> <li>Based on <a href="https://ia.cr/2025/382">https://ia.cr/2025/382</a> (Asiacrypt 2025)</li> </ul> <p><em>TL;DR: We investigate methods for protecting server privacy in CKKS-based protocols. Unlike exact homomorphic encryption schemes, formally defining security notions for the server is challenging in CKKS-based protocols due to the approximate nature of CKKS. We address this by introducing a new security notion called Differentially Private Homomorphic Encryption, which is motivated by differential privacy. Based on this notion, we construct a general compiler that transforms CKKS-based protocols into DPHE protocols. We also present the first zero-knowledge argument of knowledge for CKKS ciphertexts to protect server privacy against malicious clients.</em></p> <hr/> <p><br/></p> <h2 id="1introduction">1.Introduction</h2> <div style="margin-top: 1.5em;"></div> <p>In recent years, CKKS has become a popular choice for building privacy-preserving machine learning (PPML) protocols. The primary reason for its popularity is its support for efficient real and complex arithmetic. This feature enables straightforward design of machine learning as a service (MLaaS) protocols that protect user privacy. However, in most CKKS-based protocols, server privacy is not guaranteed. Specifically, a client may learn more than just the inference result, potentially gaining access to sensitive information such as model weights or training data. In delegated computation scenarios where the server’s model is public, such as in open-source large language models, this is not a major concern. However, in settings where the service provider aims to keep its model private—due to high training costs, risks of model jailbreaking, or legal and regulatory issues involving sensitive data such as health or legal information—protecting server privacy becomes critical. Thus, in this paper, we aim to address the following question in CKKS-based protocols.</p> <blockquote> <p>How can we protect the server’s privacy in CKKS-based MLaaS protocols?</p> </blockquote> <p><br/></p> <h2 id="2-server-privacy-in-homomorphic-evaluation-protocols">2. Server Privacy in Homomorphic Evaluation Protocols</h2> <div style="margin-top: 1.5em;"></div> <h3 id="circuit-privacy-is-insufficient">Circuit Privacy is Insufficient</h3> <p>For other homomorphic encryption (HE)-based protocols, protecting server privacy has been studied in the context of standard two-party computation (2PC) protocol security, which is often referred to as circuit privacy. The circuit privacy framework is suitable for two-party cryptographic protocols, such as oblivious pseudo-random function (OPRF) protocols <sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, where the output reveals no information at all about the server’s input. However, for encrypted MLaaS protocols, where the output often contains too much information about the server’s input, circuit privacy does not guarantee server privacy, as it does not prevent leakage from the output itself.</p> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-12 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2601_Jinyeong/oprf-480.webp 480w,/assets/img/blog/2601_Jinyeong/oprf-800.webp 800w,/assets/img/blog/2601_Jinyeong/oprf-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2601_Jinyeong/oprf.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption" style="text-align: center;"> Figure 1: OPRF protocol </figcaption> </figure> <h3 id="estimating-privacy-leakage-via-differential-privacy">Estimating Privacy Leakage via Differential Privacy</h3> <p>To estimate server privacy leakage in CKKS-based MLaaS protocols, we utilize a differential privacy (DP)-based analysis beyond the circuit privacy framework. In plain MLaaS protocols, server privacy leakage is usually measured through the lens of differential privacy. In particular, it can be formalized through the notion of DP-prediction<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>, which is defined as follows.</p> <p><strong>Definition (DP Prediction).</strong> Let $M : \Theta \times \mathcal{X} \rightarrow \mathcal{Y}$ be a random algorithm. We say that $M$ is an $\epsilon$-DP prediction algorithm if, for every $x \in \mathcal{X}$, the output $M(\theta, x)$ is $\epsilon$-DP with respect to $\theta \in \Theta$. In other words, for all adjacent $\theta, \theta’ \in \Theta$ and all PPT algorithms $\mathcal{A}$, the following holds.</p> \[\Pr[ \mathcal{A}( M(\theta, x) ) = 1] \lesssim e^{\epsilon} \cdot \Pr[\mathcal{A}( M(\theta', x) ) = 1]\] <p>The above definition models an MLaaS protocol where the client’s query is $x$ and the server’s model weight or training data is $\theta$. Then, the above definition essentially says that server privacy is maintained regardless of the client’s query. Then, we model the ideal functionality of encrypted MLaaS protocols as evaluating some DP-prediction algorithm $M$, which can be described as follows.</p> <p><strong>Ideal Functionality.</strong> The ideal functionality $\mathcal{F}$ for encrypted MLaaS protocols is defined as follows.</p> <ul> <li><em>Client’s input</em>: $x \in \mathcal{X}$</li> <li><em>Server’s input</em>: $\theta \in \Theta$</li> <li><em>Client’s output</em>: $M(\theta, x)$</li> <li><em>Server’s output</em>: $\bot$</li> </ul> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-10 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2601_Jinyeong/ideal-480.webp 480w,/assets/img/blog/2601_Jinyeong/ideal-800.webp 800w,/assets/img/blog/2601_Jinyeong/ideal-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2601_Jinyeong/ideal.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption" style="text-align: center;"> Figure 2: Ideal functionality </figcaption> </figure> <p>Based on the above ideal functionality, we define the server’s privacy in encrypted MLaaS protocols as follows.</p> <p><strong>Definition (Server Privacy).</strong> Let $\Pi$ be a two-party protocol that implements the ideal functionality $\mathcal{F}$. We say $\Pi$ achieves server privacy with parameter $\epsilon$ if an execution of $\Pi$ is an $\epsilon$-DP prediction with respect to the server’s input $\theta$. In other words, the following holds for all PPT adversaries $\mathcal{A}$ that manipulate the client, and all PPT environments $\mathcal{Z}$.</p> \[\Pr[\mathsf{Exec}[\Pi_{\theta}, \mathcal{A}, \mathcal{Z}] = 1] \lesssim e^{\epsilon} \cdot \Pr[\mathsf{Exec}[\Pi_{\theta'}, \mathcal{A}, \mathcal{Z}] = 1]\] <p>The above definition can be interpreted as a natural extension of DP-prediction in a two-party computation scenario. In other words, the above definition ensures that the protocol $\Pi$ protects the server’s input $\theta$ in terms of differential privacy against all adversarial clients’ inputs $x$.</p> <h3 id="circuit-privacy-implies-server-privacy">Circuit Privacy implies Server Privacy</h3> <p>One interesting corollary is that circuit privacy remains meaningful within our new definition of server privacy. To present more details, we recall the definition of circuit privacy below.</p> <p><strong>Definition (Circuit Privacy).</strong> Let $\Pi$ be a two-party protocol that implements the ideal functionality $\mathcal{F}$. We say $\Pi$ achieves circuit privacy if there exists a PPT simulator $\mathcal{S}$ such that the following holds for all PPT adversaries $\mathcal{A}$ that manipulate the client, and all PPT environments $\mathcal{Z}$.</p> \[\Pr[\mathsf{Exec}[\Pi_{\theta}, \mathcal{A}, \mathcal{Z}] = 1] \approx \Pr[\mathsf{Exec}[\mathcal{F}_{\theta}, \mathcal{S}, \mathcal{Z}] = 1]\] <p>Then, we can derive the following result.</p> <p><strong>Theorem (Circuit Privacy).</strong> Let $\Pi$ be a two-party protocol that implements the ideal functionality $\mathcal{F}$. Suppose the target model $M$ in $\mathcal{F}$ is an $\epsilon$-DP prediction and $\Pi$ achieves circuit privacy, then $\Pi$ achieves server privacy with parameter $\epsilon$.</p> <p>The above theorem says that if the target model $M$ is a DP-prediction algorithm and $\Pi$ achieves circuit privacy, then $\Pi$ guarantees server privacy. Thus, we can conclude that achieving circuit privacy is still meaningful in the context of encrypted MLaaS protocols if the target models are set to DP prediction algorithms.</p> <p><br/></p> <h2 id="3-differentially-private-homomrphic-evaluation">3. Differentially Private Homomrphic Evaluation</h2> <div style="margin-top: 1.5em;"></div> <h3 id="motivation">Motivation</h3> <p>Within our new server privacy notion, the problem of achieving server privacy seems to essentially boil down to achieving circuit privacy, which naturally leads to the following questions in the case of CKKS-based protocols.</p> <blockquote> <p>Can we achieve circuit privacy in CKKS-based protocols?</p> </blockquote> <p>The answer to the above question is <strong>No</strong> in general due to the peculiar structure of CKKS ciphertexts. To demonstrate the reasons, we first compare the ciphertext structure of CKKS with that of other exact HE schemes, such as BFV. In a BFV ciphertext, a noise term $e$ and a plaintext $m$ are strictly separated, so they do not interfere with each other unless $e$ exceeds the decryption bound. However, in a CKKS ciphertext, the noise $e$ and the plaintext $m$ exist in a fused state $m + e$, and they cannot be separated once encryption is performed. Thus, the size of the noise affects the precision of the plaintext after decryption, as they interfere with each other.</p> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-10 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2601_Jinyeong/ctxt-480.webp 480w,/assets/img/blog/2601_Jinyeong/ctxt-800.webp 800w,/assets/img/blog/2601_Jinyeong/ctxt-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2601_Jinyeong/ctxt.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption" style="text-align: center;"> Figure 3: Ciphertext structure of BFV &amp; CKKS </figcaption> </figure> <p>To achieve circuit privacy, one frequently utilized technique is noise flooding, which introduces additional noise $e’$ to erase any circuit information remaining in the noise part $e$. To achieve the indistinguishability notion in circuit privacy, the size of $e’$ is set to be exponentially larger than $e$. This is acceptable for BFV ciphertexts if $e + e’$ remains below the decryption bound, as the additional noise does not alter the value $m$ of the plaintext. However, for CKKS ciphertexts, excessive noise corrupts the plaintext value because they are fused. To be precise, performing noise flooding results in a decryption result of $m + e + e’$. If $e’ \gg m$, then the decryption result becomes entirely unusable for the client, even though circuit privacy is achieved.</p> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-10 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2601_Jinyeong/flooding-480.webp 480w,/assets/img/blog/2601_Jinyeong/flooding-800.webp 800w,/assets/img/blog/2601_Jinyeong/flooding-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2601_Jinyeong/flooding.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption" style="text-align: center;"> Figure 4: Ciphertext structure of BFV &amp; CKKS after noise flooding </figcaption> </figure> <h3 id="definition">Definition</h3> <p>We observe that the main difficulty in achieving circuit privacy arises from the computational indistinguishability requirement between the ideal functionality and the real protocol execution. However, if our final goal is to achieve server privacy, which is estimated in terms of differential privacy, requiring computational indistinguishability can be an overkill. Thus, we define the concept of <em>differentially private homomorphic evaluation</em><sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> (DPHE) as follows to resolve this issue.</p> <p><strong>Definition (DPHE).</strong> Let $\Pi$ be a homomorphic evaluation protocol that implements ideal functionality $\mathcal{F}$. We say $\Pi$ is an $\epsilon$-DPHE protocol for $F$ if there exists a PPT simulator $\mathcal{S}$ such that the following holds for all PPT adversaries $\mathcal{A}$ that manipulate the client, and all PPT environments $\mathcal{Z}$.</p> \[\Pr[\mathsf{Exec}[\Pi_{\theta}, \mathcal{A}, \mathcal{Z}] = 1] \lesssim e^{\epsilon} \cdot \Pr[\mathsf{Exec}[\mathcal{F}_{\theta}, \mathcal{S}, \mathcal{Z}] = 1]\] \[\Pr[\mathsf{Exec}[\mathcal{F}_{\theta}, \mathcal{S}, \mathcal{Z}] = 1] \lesssim e^{\epsilon} \cdot \Pr[\mathsf{Exec}[\Pi_{\theta}, \mathcal{A}, \mathcal{Z}] = 1]\] <p>The above definition can be viewed as a relaxation of the indistinguishability notion in circuit privacy into something analogous to differential privacy. Once a protocol satisfies the DPHE property, we can derive the following result.</p> <p><strong>Theorem (DPHE).</strong> Let $\Pi$ be a two-party protocol that implements the ideal functionality $\mathcal{F}$. Suppose the target model $M$ in $\mathcal{F}$ is an $\epsilon$-DP prediction and $\Pi$ achieves $\epsilon’$-DPHE property, then $\Pi$ achieves server privacy with parameter $\epsilon+2\epsilon’$.</p> <p>Therefore, we can conclude that achieving the DPHE property is sufficient for server privacy instead of achieving circuit privacy.</p> <h3 id="instantiation">Instantiation</h3> <p>Once we verify that the DPHE property is sufficient, it naturally leads to the following next questions.</p> <blockquote> <p>Can we achieve the DPHE property in CKKS-based protocols?</p> </blockquote> <p>The answer is <strong>Yes</strong>, and we show how to instantiate a DPHE protocol by compiling existing CKKS-based protocols. The core idea is to utilize the Laplace mechanism<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> to achieve differential privacy. Suppose the ciphertext space is \(R_q = \mathbb{Z}_q[X]/(X^N + 1)\) and security parameter is given as $\mathbf{1}^{\lambda}$. We recall that to achieve circuit privacy, the noise flooding method introduces additional noise $e’$ whose norm is $O(2^{\lambda} \cdot B_e)$, where $B_e$ is an upper bound of the norm $\Vert e \Vert_{\infty}$ of the initial noise $e$. However, to achieve the DPHE property, it suffices to add additional noise $e’$ whose norm is $O(N B_e)$, which is significantly smaller than that of noise flooding.</p> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-10 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2601_Jinyeong/dphe-480.webp 480w,/assets/img/blog/2601_Jinyeong/dphe-800.webp 800w,/assets/img/blog/2601_Jinyeong/dphe-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2601_Jinyeong/dphe.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption" style="text-align: center;"> Figure 5: Noise flooding vs. Laplace mechansim </figcaption> </figure> <p>The detailed procedure is as follows. Suppose a protocol $\Pi$ aims at evaluating a DP mechanism $M$ on the client’s encrypted input $\mathsf{ct}_{in} = \mathsf{Enc}(x)$. Let $\tau &gt; 0$ be an $L^1$-norm bound for the noise when evaluating $M$ in CKKS evaluation algorithms for $\theta \in \Theta$ and $x \in \mathcal{X}$. Then, the compilation is achieved as follows.</p> <ol> <li>$(c_0, c_1) \gets \mathsf{Eval} \big( M(\theta, \cdot), \mathsf{ct}_{in} \big) \pmod{q}$</li> <li>$(c’_0, c’_1) \gets (c_0, c_1) + (\lceil t \rfloor, 0) \pmod{q}$ for $t \gets \mathsf{Lap}(N \tau / \epsilon’)^N$</li> <li>\(\mathsf{ct}_{out} \gets q' \cdot (c'_0, c'_1) + \mathsf{Enc}_{\mathsf{pk}}(0) \pmod{qq'}\)  </li> </ol> <p>The second step is for achieving the differential privacy property on the client’s output plaintext, and the third step is to remove any remaining information in the ciphertext components by adding an encryption of zero. As a corollary, our compiler results in the following result.</p> <p><strong>Corollary (DPHE Compiler).</strong> Let $\mathcal{F}$ be the ideal functionality for homomorphic evaluation of an $\epsilon$-DP prediction algorithm $M$. Then, the DPHE compiler produces an $\epsilon’$-DPHE protocol $\Pi$ that implements $\mathcal{F}$, and $\Pi$ achieves server privacy with parameter. $\epsilon + 2\epsilon’$</p> <p>Therefore, by relaxing the security notion for homomorphic evaluation protocols, we succeed in achieving server privacy in CKKS-based protocols.</p> <p><br/></p> <h2 id="4-zkaok-for-ckks-ciphertexts">4. ZKAoK for CKKS Ciphertexts</h2> <div style="margin-top: 1.5em;"></div> <p>Our DPHE compiler is based on the assumption that the input ciphertext is well-formed and the input message lies in the domain $\mathcal{X}$. However, for malicious clients, there is no guarantee that the input ciphertext satisfies these conditions. Thus, for the server to verify these conditions without compromising the client’s privacy, we need a zero-knowledge argument of knowledge (ZKAoK) for CKKS ciphertexts.</p> <p>However, designing ZKAoK for CKKS is nontrivial, and there have been no previous attempts for it. The main difficulty arises from the lack of techniques for verifying the validity of the message, i.e., $\vec{x} \in \mathcal{X} \subseteq \mathbb{R}^N$. To be precise, when encrypting a message $\vec{x} \in \mathbb{R}^N$, it is encoded into a polynomial $m(X) \in R_q$ that lies in the ciphertext space $R_q = \mathbb{Z}_q[X]/(X^N + 1)$. Current ZKAoK for HE ciphertexts<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> only support verification of arithmetic relations defined over $R_q$, whereas we need to verify the condition $x \in \mathcal{X}$, which is defined over $\mathbb{R}^N$.</p> <h3 id="solution">Solution</h3> <p>We address this issue by delegating the encoding procedure to the server, described below.</p> <ol> <li>For an input message $\vec{x} := (x_0, \dots, x_{N-1}) \in \mathbb{R}^N$, the client generates a plaintext $m’(X) \in R_q$ with scaled coefficient packing as follows. <ul> <li>$m’(X) = \lceil \Delta x_0 \rceil + \lceil \Delta x_1 \rceil X + \cdots + \lceil \Delta x_{N-1} \rceil X^{N-1}$</li> </ul> </li> <li>The client generates a ciphertext $\mathsf{ct}’ = (a’, b’) \in R_q^2$ from $m’(X)$ and proves the following relations hold through ZKAoK for HE ciphertexts. <ul> <li>$b’ - a’ s = m’ + e’ \pmod{q}$</li> <li>$\Vert s \Vert_{\infty} \le B_s$ and $\Vert e’ \Vert_{\infty} \le B_e$</li> <li>$\mathsf{Coeff}(m’) \in \lceil \Delta \mathcal{X} \rceil \subseteq \mathbb{Z}_q^N$</li> </ul> </li> <li>The server verifies the ZKAoK for the ciphertext $\mathsf{ct}’$, and obtains the actual input ciphertext $\mathsf{ct}$ as follows. <ul> <li>$\mathsf{ct} \gets \mathsf{CoeffToSlot}(\mathsf{ct}’)$</li> </ul> </li> </ol> <p>The scaled coefficient packing in the first step allows a client to generate a ZKAoK for verifying the input domain in $R_q$. After the server verifies this ZKAoK, it can generate an input ciphertext whose slot values correspond to these coefficient values by applying the coeff-to-slot operation. As a result, the server can ensure that input ciphertexts are well-formed, i.e., the input messages lie within the input domain.</p> <h3 id="benchmark">Benchmark</h3> <p>With the above technique, we instantiate the first ZKAoK for CKKS ciphertexts<sup id="fnref:6"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup> that can verify the input domain. Specifically, we construct the ZKAoK for CKKS, which proves that input messages of ciphertexts lie in $[-1, 1]^N$, together with well-formedness of public keys, including encryption, relinearization, rotation key, and conjugation key. In the following table, we provide the concrete benchmark results, where $k$ denotes the number of ciphertexts, $\Delta$ denotes the scaling factor, <strong>PK Size</strong> denotes the total size of public keys, and <strong>CT Size</strong> denotes the total size of ciphertexts. The performance is measured on an Intel Xeon Platinum 8268 CPU with a single thread.</p> <div style="margin-top: 1.5em;"></div> <table> <thead> <tr> <th>$k$</th> <th>$ \log\Delta $</th> <th>PK Size</th> <th>CT Size</th> <th>Proof Size</th> <th>Prover Time (s)</th> <th>Verifier Time (s)</th> </tr> </thead> <tbody> <tr> <td>2</td> <td>16</td> <td>39.5 MB</td> <td>1.57 MB</td> <td>17.9 MB</td> <td>324.35</td> <td>50.88</td> </tr> <tr> <td>4</td> <td>16</td> <td>39.5 MB</td> <td>3.14 MB</td> <td>18.9 MB</td> <td>365.46</td> <td>56.08</td> </tr> <tr> <td>8</td> <td>16</td> <td>39.5 MB</td> <td>6.28 MB</td> <td>21.0 MB</td> <td>442.86</td> <td>67.13</td> </tr> <tr> <td>2</td> <td>32</td> <td>39.5 MB</td> <td>1.57 MB</td> <td>18.7 MB</td> <td>356.90</td> <td>54.65</td> </tr> <tr> <td>4</td> <td>32</td> <td>39.5 MB</td> <td>3.14 MB</td> <td>20.4 MB</td> <td>425.26</td> <td>64.33</td> </tr> <tr> <td>8</td> <td>32</td> <td>39.5 MB</td> <td>6.28 MB</td> <td>24.0 MB</td> <td>561.28</td> <td>83.63</td> </tr> </tbody> </table> <p><br/></p> <h2 id="5-conclusion">5. Conclusion</h2> <div style="margin-top: 1.5em;"></div> <p>In summary, we examine the server privacy issues in CKKS-based protocols, particularly for encrypted MLaaS protocols. The key takeaways are as follows.</p> <ul> <li>We formalize the security notion for server privacy based on differential privacy.</li> <li>We achieve server privacy for CKKS without noise flooding.</li> <li>We construct the first zero-knowledge argument of knowledge for CKKS to handle malicious clients.</li> </ul> <hr/> <p><br/></p> <h2 id="references">References</h2> <div style="margin-top: 1.5em;"></div> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>Martin R. Albrecht, Alex Davidson, Amit Deo, and Daniel Gardham. “Crypto Dark Matter on the Torus: Oblivious PRFs from shallow PRFs and TFHE.” Eurocrypt 2024. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2"> <p>Cynthia Dwork and Vitaly Feldman. “Privacy-preserving prediction.” COLT 2018. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>Intak Hwang, Seonhong Min, Jinyeong Seo, and Yongsoo Song. “On the security and privacy of CKKS-based homomorphic evaluation protocols.” Asiacrypt 2025. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p>Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. “Calibrating noise to sensitivity in private data analysis.” TCC 2006. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p>Intak Hwang, Hyeonbum Lee, Jinyeong Seo, and Yongsoo Song. “Practical zero-knowledge PIOP for maliciously secure multiparty homomorphic encryption.” ACM CCS 2025. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:6"> <p><a href="https://github.com/SNUCP/ckks-piop">https://github.com/SNUCP/ckks-piop</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Jinyeong Seo</name></author><summary type="html"><![CDATA[TL;DR: We investigate methods for protecting server privacy in CKKS-based protocols. Unlike exact homomorphic encryption schemes, formally defining security notions for the server is challenging in CKKS-based protocols due to the approximate nature of CKKS. We address this by introducing a new security notion called Differentially Private Homomorphic Encryption, which is motivated by differential privacy. Based on this notion, we construct a general compiler that transforms CKKS-based protocols into DPHE protocols. We also present the first zero-knowledge argument of knowledge for CKKS ciphertexts to protect server privacy against malicious clients.]]></summary></entry><entry><title type="html">Homomorphic Encryption for Data Science</title><link href="https://ckks.org/blog/2025/HE4DS/" rel="alternate" type="text/html" title="Homomorphic Encryption for Data Science"/><published>2025-12-08T09:12:00+00:00</published><updated>2025-12-08T09:12:00+00:00</updated><id>https://ckks.org/blog/2025/HE4DS</id><content type="html" xml:base="https://ckks.org/blog/2025/HE4DS/"><![CDATA[<ul> <li>Written by Allon Adir, Ehud Aharoni, Nir Drucker, Ronen Levy, Hayim Shaul, Omri Soceanu (IBM Research, Israel)</li> <li>Based on <a href="https://link.springer.com/book/10.1007/978-3-031-65494-7">https://link.springer.com/book/10.1007/978-3-031-65494-7</a> (Homomorphic Encryption for Data Science)</li> </ul> <p><em>TL;DR: FHE has advanced significantly since its introduction fifteen years ago, yet it remains challenging to use efficiently. We examine methods addressing three of the major challenges faced by cryptographers and data scientists face when using FHE: data packing; polynomial approximations and data traversal.</em></p> <hr/> <p><br/></p> <p>More than a decade and a half has passed since the publication of Gentry’s original paper [1] about Fully Homomorphic Encryption (FHE). Since then, real progress has been made transforming FHE from a theoretical tool accessible only to a few into a practical solution with rapidly improving usability, performance, and abstraction. Still, the development of practical applications and data science models remains challenging for both cryptographers and data scientists. The three major challenges are outlined below and discussed in subsequent sections:</p> <ul> <li>Efficient data packing</li> <li>Accurate polynomial approximation of analytical functions</li> <li>Efficient data traversal</li> </ul> <p><br/></p> <h2 id="efficient-data-packing-and-tile-tensors">Efficient Data Packing and Tile Tensors</h2> <p>Several FHE schemes (e.g. BGV [2], BFV [3], CKKS [4]) encode a vector of plaintext values as polynomial coefficients and operate on it element-wise. Thus, encrypted data operations in many FHE schemes can be viewed as Single Instruction, Multiple Data (SIMD). Efficiently mapping complex data to these coefficients (or vector slots) is often a pain-point for developers. Tile tensors [5] simplify this packing process by defining the layout of tensors and their tiling using high-level “tile tensor shapes”. Tile tensors extend known packing approaches, allowing researchers to focus on the algorithms rather than on the ciphertext internals.</p> <p>The idea behind tile tensors is to think of a ciphertext as a <em>tile</em> with a configurable shape and cover tensors with these tiles. This implies that the size of a tile equals the number of slots each ciphertext has. Changing the shape of the tiles changes how a tensor is packed into ciphertexts. For example, when covering a 2-dimensional matrix with a tile of shape 1x8 (assuming 8 slots in a ciphertext, for simplicity) we get row-based packing. With a tile of shape 8x1 we get column-based packing. Other shapes, such as 2x4 are also possible. See Figure 1 for an example.</p> <p><br/></p> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-7 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2512_Hayim/image1-480.webp 480w,/assets/img/blog/2512_Hayim/image1-800.webp 800w,/assets/img/blog/2512_Hayim/image1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2512_Hayim/image1.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption"> Figure 1: Packing a 5x6 matrix in two different ways, assuming a ciphertext has 8 slots. On the left, a tile has the shape of 2x4. In this case the matrix is partitioned into 6 ciphertexts (tiles), padding with 0s where needed. On the right, a tile has a shape of 1x8. In this case the matrix is partitioned into 5 ciphertexts. Switching between the packings is done simply by changing the tile shape. The image is taken from a tutorial given in CCS’22 and is available online [6]. </figcaption> </figure> <p><br/></p> <h2 id="efficient-and-accurate-polynomial-approximating">Efficient and Accurate Polynomial Approximating</h2> <p>Another fundamental FHE challenge is function evaluation. Most FHE schemes support a handful of arithmetic primitives, such as element-wise addition, multiplication and vector rotation. Any operation beyond these primitives, e.g. division, comparison, not to mention complex neural network activation functions, requires some approximation. One can construct polynomials using only additions and multiplications, and then use them to approximate almost any function to a desired degree of accuracy. However, since the use of any primitive comes with an inherent computational cost, every operation should be assessed with scrutiny, to minimize the amount and type of operations to only those that are necessary to meet accuracy goals. The art here is twofold: devising polynomial approximations that are both accurate and of low-enough degree to achieve acceptable performance and representing and evaluating these polynomials in ways optimized for FHE. Efficient polynomial evaluation, packing tricks (that can be implemented with Tile Tensors), and circuit crafting are all essential.</p> <p><br/></p> <h2 id="efficient-data-traversal">Efficient Data Traversal</h2> <p>Since algorithms under FHE deal with encrypted messages they are unable to make any decision based on the encrypted input. This includes branching dynamically, i.e. take a path of the code depending on encrypted input. See for example [7]. Instead, algorithms need to evaluate all possible branches and multiplex (“select”) their output. Even a simple algorithm such as searching a binary tree becomes exponentially more expensive in FHE because it needs to traverse every path of the tree and not just a single one as in the cleartext case. Algorithms such as those needed in private database queries or machine learning (decision trees) suffer from this limitation and are not efficient under FHE.</p> <p>Different methods have been proposed, among which is “Copy-and-Recurse” [8, 9]. Rather than branching, which is problematic under FHE, this method creates a copy (under FHE) of the branch that the algorithm needs to take and continues with this copy. Creating this copy requires a constant number of multiplications and additions for each node, but then any the number of expensive computations that are done at nodes (e.g., comparisons) is proportional to the cleartext algorithm.</p> <p><br/></p> <figure class="figure-class"> <div class="row mt-3"> <div class="col-sm-7 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2512_Hayim/image2-480.webp 480w,/assets/img/blog/2512_Hayim/image2-800.webp 800w,/assets/img/blog/2512_Hayim/image2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2512_Hayim/image2.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <figcaption class="figure-caption"> Figure 2: Traversing a search tree using copy-and-recurse. To reach the black leaf we multiplex both branches to create a single copy which we recurse into. The number of operations is still linear but the number of the (expensive) comparison is only logarithmic. </figcaption> </figure> <p><br/></p> <h2 id="a-book-for-the-practicing-cryptographer-and-data-scientist">A Book for the Practicing Cryptographer and Data Scientist</h2> <p>These challenges, alongside techniques and practical recipes to solve them, are important for algorithmic researchers attempting to utilize FHE. Each topic is explained further with concrete examples, code templates, and in-depth discussions in the book “<strong>Homomorphic Encryption for Data Science (HE4DS)</strong>” [10]. For cryptographers determined to bridge the theory-practice gap, and for data scientists eager to harness encrypted computation for real-world ML and analytics tasks, this text stands as an invitation: the road is still rugged, but the right tools and abstractions exist. The landscape of FHE is shifting from a field of delicate manual hacks to one supported by high-level libraries and reproducible patterns. Understanding these new abstractions will empower the next generation of privacy-preserving applications.</p> <hr/> <p><br/></p> <h2 id="references">References</h2> <p>[1] Craig Gentry. “Fully Homomorphic Encryption Using Ideal Lattices.” Proceedings of the 41st Annual ACM Symposium on Theory of Computing (STOC 2009), pp. 169–178. ACM, 2009. DOI: 10.1145/1536414.1536440</p> <p>[2] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. “(Leveled) Fully Homomorphic Encryption without Bootstrapping.” Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS 2012), pp. 309–325. ACM, 2012.</p> <p>[3] J. Fan and F. Vercauteren. Somewhat practical fully homomorphic encryption. IACR Cryptol. ePrint Arch., 2012:144.</p> <p>[4] Jung Hee Cheon, Andrey Kim, Miran Kim, and Yongsoo Song. “Homomorphic Encryption for Arithmetic of Approximate Numbers.” In Advances in Cryptology – ASIACRYPT 2017, Lecture Notes in Computer Science, vol. 10624, pp. 409–437. Springer, 2017.</p> <p>[5] Ehud Aharoni, Allon Adir, Moran Baruch, Nir Drucker, Gilad Ezov, Ariel Farkash, Lev Greenberg, Ramy Masalha, Guy Moshkowich, Dov Murik, Hayim Shaul, and Omri Soceanu. “HeLayers: A Tile Tensors Framework for Large Neural Networks on Encrypted Data.” Proceedings on Privacy Enhancing Technologies (PoPETs), 2023(1): 325–342. DOI: 10.56553/popets-2023-0020</p> <p>[6] Ehud Aharoni, Nir Drucker and Hayim Shaul, “Tutorial: Advanced HE packing methods with applications to ML”, ACM CCS 2022, https://research.ibm.com/haifa/dept/vst/tutorial_ccs2022.html.</p> <p>[7] Sunchul Jung, “Convergent Evolution: Why Secure Homomorphic Encryption Will Resemble High-Performance GPU Computing”, https://ckks.org/blog/2025/convergent-evolution/#21-the-fhe-security-model-and-the-turing-barrier</p> <p>[8] Eyal Kushnir, Guy Moshkowich, and Hayim Shaul. “Secure Range-Searching Using Copy-And-Recurse.” Proceedings on Privacy Enhancing Technologies (PoPETs), 2024(3).</p> <p>[9] Eyal Kushnir and Hayim Shaul. “Improved Range Searching and Range Emptiness Under FHE Using Copy-And-Recurse.” Cryptology ePrint Archive, Paper 2025/751, 2025. Available at: https://eprint.iacr.org/2025/751</p> <p>[10] Adir, A., Aharoni, E., Drucker, N., Levy, R., Shaul, H., and Soceanu, O. Homomorphic Encryption for Data Science (HE4DS). Springer Nature Switzerland, 2024. ISBN 9783031654947. Available at: https://link.springer.com/book/10.1007/978-3-031-65494-7</p>]]></content><author><name>Allon Adir, Ehud Aharoni, Nir Drucker, Ronen Levy, Hayim Shaul, Omri Soceanu</name></author><summary type="html"><![CDATA[TL;DR: FHE has advanced significantly since its introduction fifteen years ago, yet it remains challenging to use efficiently. We examine methods addressing three of the major challenges faced by cryptographers and data scientists face when using FHE: data packing; polynomial approximations and data traversal.]]></summary></entry><entry><title type="html">A Novel Asymmetric BSGS Polynomial Evaluation Algorithm under Homomorphic Encryption</title><link href="https://ckks.org/blog/2025/asymmetric-BSGS-algorithm/" rel="alternate" type="text/html" title="A Novel Asymmetric BSGS Polynomial Evaluation Algorithm under Homomorphic Encryption"/><published>2025-11-03T00:00:00+00:00</published><updated>2025-11-03T00:00:00+00:00</updated><id>https://ckks.org/blog/2025/asymmetric-BSGS-algorithm</id><content type="html" xml:base="https://ckks.org/blog/2025/asymmetric-BSGS-algorithm/"><![CDATA[<ul> <li>Written by <a href="https://orcid.org/0009-0004-6490-7717">Qingfeng Wang</a> (Institute of Information Engineering, Chinese Academy of Sciences)</li> <li>Based on <a href="https://dl.acm.org/doi/10.1145/3708821.3710822">https://doi.org/10.1145/3708821.3710822</a> (ASIACCS 2025)</li> </ul> <p><em>TL;DR: We introduce a new polynomial evaluation algorithm under homomorphic encryption, namely the Asymmetric BSGS Algorithm. It is a generalization and specialization of the original Baby-Step Giant-Step algorithm in the leveled FHE computation model. Leveraging the observation that there is a difference in multiplicative depth between the baby-step set and the giant-step set, this algorithm significantly reduces the number of modulus and key switches required for dense polynomial evaluation from \(O(\sqrt{d})\) to \(O(d^{1/t})\), by adjusting the set decomposition method and relaxing the control of noise growth and ciphertext size in some calculations. Here, \(d\) is the polynomial degree and \(t\) is a small constant which, according to our experiments, is recommended to be chosen as \(4\).</em></p> <hr/> <p><br/></p> <h2 id="1-how-to-compute-polynomial-evaluation">1. How to Compute Polynomial Evaluation?</h2> <p>Polynomial evaluation is one of the most crucial tasks in the field of computing. This is particularly true in leveled FHEs, as they only support homomorphic addition, homomorphic multiplication, and some automorphisms related to SIMD packing. Almost all functions are either interpolated (in BGV/BFV) or approximated (in CKKS) with the use of specific polynomials.</p> <p>So, how can we efficiently compute polynomial evaluation? For polynomials that can be factored into sparse factors, such as \(f(X) = (x^3+1)^{16}\cdot(x^{32}+x-1)\), we can compute the evaluation of each sparse-coefficient polynomial separately and then multiply them together. Thus, the computational cost is relatively low, although the multiplication depth may not be optimal. However, most polynomials are difficult or even impossible to factor into sparse factors. One can only use general polynomial evaluation algorithms to deal with them, among which the most well-known is the BSGS algorithm<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>.</p> <p>Briefly speaking, for a degree-\(d\) polynomial \(f(X) = \sum_{i}f_{i}X^{i}\), the BSGS algorithm selects two integers, \(k\approx m\approx\sqrt{d}\), such that \(km &gt; d\). Then the coefficients of \(f(X)\) can be partitioned into \(m\) intervals of length at most \(k\), namely</p> \[f(X)=\sum_{j = 0}^{m - 1}f^{(j)}(X)\cdot(X^{k})^{j},\] <p>where each \(f^{(j)}(X)=\sum_{i = 0}^{k-1}f_{jk + i}X^{i}\) is a polynomial of degree less than \(k\).</p> <p>Given a value \(x\), pre-compute \(S_1 := \{1,x,\cdots,x^{k-1}\}\) (which we call the baby-step set) and \(S_2 := \{1,z,\cdots,z^{m-1}\}\) (which we call the giant-step set), where \(z := x^k\). Then the computation of all \(y_j := f^{(j)}(x)\) is merely a linear combination of the elements in \(S_1\), and this can be efficiently computed in leveled FHE. Finally, by multiplying the \(y_j\) with the elements in \(S_2\) and then summing them up, the computation of \(f(x)\) can be completed.</p> <p>The above-mentioned algorithm takes \(3\sqrt{d}\) non-scalar multiplications: computing \(S_1\), computing \(S_2\), and the final multiplication and accumulation each take \(\sqrt{d}\) non-scalar multiplications. If we use Horner’s rule, the computation of \(S_2\) can be omitted, but at the cost of increasing the multiplication depth to the order of \(\sqrt{d}\). The recursive variant of the BSGS algorithm, the Paterson-Stockmeyer algorithm, only requires \(\sqrt{2d}+O(\log d)\) non-scalar multiplications and has a multiplicative depth of logarithmic order.</p> <p>It has been proven that <strong>a general polynomial evaluation algorithm requires at least \(\sqrt{d}\) non-scalar multiplications</strong><sup id="fnref:1:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, so it seems that there is no way to further improve the performance.</p> <p><br/></p> <h2 id="2-we-can-do-better">2. We Can Do Better!</h2> <p>We need to emphasize that there are significant differences between the leveled FHE computation model and the classical computation model. In the classical computation model, multiplication is an atomic operation, and there is no concept of “levels” for data.</p> <p>However, in the leveled FHE computation model, homomorphic multiplication consists of three distinct operations with different purposes.</p> <ol> <li><strong>Tensor product</strong>: A ciphertext \(\mathbf{ct} \in \mathcal{R}_q[\mathbf{S}]\) is a polynomial over the cyclotomic ring \(\mathcal{R}_q\) with respect to the variable \(\mathbf{s} \in \mathcal{R}_q\), and its (noisy) plaintext is its evaluation \(\mathbf{ct}(\mathbf{s}) = \textbf{m}+\textbf{e} \in \mathcal{R}_q\). Multiplying two ciphertexts then realizes the multiplication of the two underlying plaintexts. In a typical setting, the asymptotic complexity of the tensor product is linear in terms of the security parameter, and the cost of tensor product amounts to less than 5% of that of homomorphic multiplication in practice.</li> <li><strong>Modulus switching</strong>: When performing the tensor product, the ciphertext noise \(\mathbf{e} \in \mathcal{R}_q\) is also multiplied, and its growth rate is proportional to the norm of \(\mathbf{e}\). By dividing by a rescaling factor in the cyclotomic field \(\mathcal{K}\supset\mathcal{R}\) and then rounding to \(\mathcal{R}\) to discard the lower-order bits of the noise \(\mathbf{e}\), the noise growth can be effectively controlled. And modulus switching endows ciphertexts with levels. The asymptotic complexity of modulus switching is quasi-linear with respect to the security parameter, and the cost of modulus switching typically constitutes approximately 20% in homomorphic multiplication.</li> <li><strong>Relinearization</strong>: Because polynomial multiplication over \(\mathcal{R}_q[\mathbf{S}]\) will increase the degree of the ciphertext, we need use key switching to linearize the ciphertext polynomial. This is also of quasi-linear complexity with respect to the security parameter, and it is usually the most costly one, especially when the ciphertext size increases.</li> </ol> <p>Now, let’s consider what is special about implementing the BSGS algorithm in leveled FHEs. Horner’s rule would imply depth proportional to the square root of the degree, so is excluded. Given the ciphertext of \(x\), using a binary-tree approach to pre-compute the ciphertexts of \(S_1\) and \(S_2\) requires \(\sqrt{d}\) standard homomorphic multiplications respectively. In the process of multiplying the \(y_j\) with the elements in \(S_2\) and accumulating, the lazy technique, <em>which takes advantage of the commutativity and associativity between the above-mentioned three operations and addition</em>, can be used to combine some modulus and key switches. Altogether, approximately \(2\sqrt{d}\) modulus and key switches are required.</p> <p>The figure below shows the homomorphic evaluation process of a 32-degree polynomial, where \(k = m = 6\).</p> <div class="row mt-3"> <div class="col-sm-10 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2511_Qingfeng/img0-480.webp 480w,/assets/img/blog/2511_Qingfeng/img0-800.webp 800w,/assets/img/blog/2511_Qingfeng/img0-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2511_Qingfeng/img0.png" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>We observe that <strong>the ciphertexts of \(S_1\) have a much shallower ciphertext level than those of \(S_2\)</strong>. Specifically, the multiplicative depth of the ciphertexts of \(S_1\) is \(\ell_1=\left\lceil \log (k - 1) \right\rceil\), and the multiplicative depth of the ciphertexts of \(S_2\) is \(\ell_2=\left\lceil \log (km - k) \right\rceil\). The difference between the two is \(\ell_2-\ell_1\approx\frac{1}{2}\cdot\log d\), which is half of the overall multiplicative depth. We note that the noise control for the ciphertexts of set \(S_1\) does not actually need to be so strict. <strong>As long as the multiplicative depth of \(S_1\) is not greater than that of \(S_2\)</strong>, the multiplicative depth of the final calculation result will not change.</p> <p>How can we make full use of this difference in multiplicative depth? During the construction of \(S_1\), given \(t = \ell_2-\ell_1+1\) ciphertexts located at most at the \(\ell_1\)-th layer, it is tempting to <em>evaluate their tensor products without performing modulus switching</em>. At this time, the depth of the calculated \(S_1\) will not exceed \(\ell_2\), and the computational efficiency is significantly improved.</p> <p>Since the size of \(S_2\) is \(\sqrt{d}\), the overall number of modulus switches is still \(O(\sqrt{d})\), but this is sufficient to provide us with an optimization direction. To achieve improvement in asymptotic complexity, we suggest to:</p> <ol> <li>Adjust the relative sizes of \(S_1\) and \(S_2\).</li> <li>Further decompose \(S_1\) into some small sets.</li> </ol> <p><br/></p> <h2 id="3-asymmetric-bsgs-algorithm">3. Asymmetric BSGS Algorithm</h2> <h3 id="31-preliminary-version">3.1 Preliminary Version</h3> <p>First, we introduce “set multiplication”, which is used to simplify the description of set decomposition in the algorithm. Given two sets \(S\) and \(S'\), which are subsets of a multiplicative group \((G,\cdot)\). Define \(S \odot S':=\{s \cdot s' \mid s\in S, s'\in S'\} \subseteq G\).</p> <p>For a set \(S := \{1,x,\cdots,x^d\}\), the BSGS algorithm considers two sets \(S_1\) and \(S_2\) of approximately the same size, such that \(S\subseteq S_1\odot S_2\). We generalize it to: \(S \subseteq S_1^{(0)} \odot \cdots \odot S_1^{(t - 1)} \odot S_2,\)</p> <p>such that \(|S_1^{(j)}| = O(d^{1/t})\) and \(|S_2|\leq2^{t - 1}=O(1)\), where \(t\) is a small integer.</p> <p>Most importantly, we ensure that <strong>the multiplicative depth of \(S_2\) is \(t - 1\) levels higher than that of \(S_1^{(j)}\)</strong>. Then,</p> <ol> <li>First, use the binary-tree method to construct \(S_1^{(j)}\) and \(S_2\).</li> <li>Second, reconstruct \(S_1\) based on \(S_1^{(j)}\) and calculate \(y_j := f^{(j)}(x)\), <strong>without performing modulus switching during this process</strong>.</li> <li>Finally, calculate \(f(x)\) based on \(y_j\) and \(S_2\).</li> </ol> <p>Since the multiplicative depth of \(S_1\) calculated in step 2 does not exceed the multiplicative depth \(\ell_2\) of \(S_2\), the multiplicative depth spent by the above algorithm is \(\ell_2 + 1\), which is the same as that of the original BSGS algorithm. Let’s briefly analyze the number of modulus switches spent by this algorithm: Step 1 requires \(t\cdot O(d^{1/t})+2^{t-1}\) modulus switches, step 2 no longer requires modulus switches, and step 3 requires \(O(1)\) modulus switches. In total, it spends \(O(d^{1/t})\) modulus switches.</p> <p>The decomposition of the set \(S\) is not unique. By generalizing the decomposition method in the original BSGS, we present a simple method based on digit decomposition. Suppose \(\ell=\lceil\log d\rceil\), set \(\ell'=\ell-(t - 1)\), \(\ell''=\ell'/t\) and \(B = 2^{\ell''}\). Define the sets</p> \[S_1^{(j)}:=\{x^e\mid e = i\cdot B^j,\, i = 0,\cdots,B - 1\},\, j = 0,\cdots,t - 1\] <p>and</p> \[S_2:=\{x^e\mid e = i\cdot B^t,\, i = 0,\cdots,\lfloor d/B^t\rfloor\},\] <p>where the multiplicative depth of \(S_1^{(j)}\) is \((j + 1)\cdot\ell''\leq\ell'\), which is \((t - 1)\) levels shallower than the multiplicative depth \(\ell\) of the set \(S_2\). The figure below shows the relative relationship of the multiplicative depths of each set.</p> <div class="row mt-3"> <div class="col-sm-10 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2511_Qingfeng/img1-480.webp 480w,/assets/img/blog/2511_Qingfeng/img1-800.webp 800w,/assets/img/blog/2511_Qingfeng/img1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2511_Qingfeng/img1.png" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <h3 id="32-complete-version">3.2 Complete Version</h3> <p>In the above-mentioned algorithm, there are still two directions for optimization:</p> <ul> <li><strong>Reduce the number of key switches</strong>. Since \(t\) is a small integer, even if no relinearization is performed in step 2, the ciphertexts in the obtained \(S_1\) are polynomials in \(\mathcal{R}_q[\mathbf{S}]\) with degree at most \(t\). <em>We postpone all relinearizations to step 3, utilizing the lazy technique along with a larger key-switching key.</em></li> <li><strong>Reduce the number of tensor products</strong>. The size of the set \(S_1\) is \(O(d^{1/t})^t = O(d)\), so step 2 requires the computation of \(O(d)\) tensor products. <em>This can be reduced to \(O(\sqrt{d})\) by re-applying the BSGS algorithm, which we term the ”Internal BSGS Subroutine”.</em></li> </ul> <p>Specifically, let \(t'=\lceil t/2\rceil\) and define</p> \[\bar{S}_1 = \{x^e\mid e = 0,\cdots,B^{t'}-1\}\] <p>and</p> \[\hat{S}_1=\{x^e\mid e = i\cdot B^{t'},\, i = 0,\cdots,B^{t - t'}-1\}.\] <p>Then, it can be proven that \(\bar{S}_1\subseteq S_1^{(0)}\odot\cdots\odot S_1^{(t'-1)}\), \(\hat{S}_1\subseteq S_1^{(t')}\odot\cdots\odot S_1^{(t - 1)}\) and \(S_1\subseteq \bar{S}_1\odot\hat{S}_1\), where \(|\bar{S}_1| \approx |\hat{S}_1| = O(\sqrt{d})\). Similar to the original BSGS algorithm, based on the pre-computed \(\bar{S}_1\) and \(\hat{S}_1\), the \(y_j\) in step 2 can be calculated rapidly.</p> <p>Given a polynomial \(f(X)=\sum_j f^{(j)}(X)\cdot(X^{B^t})^j\) and the ciphertext of the value \(x\), the steps of the asymmetric BSGS algorithm are as follows:</p> <ol> <li>First, compute \(S_1^{(j)}\) and \(S_2\), which costs \(O(d^{1/t})\) standard homomorphic multiplications.</li> <li>Second, construct \(\bar{S}_1\) and \(\hat{S}_1\) based on \(S_1^{(j)}\), and then use the Internal BSGS Subroutine to compute \(y_j := f^{(j)}(x)\). During this process, only \(O(\sqrt{d})\) tensor products are performed.</li> <li>Finally, perform modulus switching and relinearization on \(y_j\), and then compute \(f(x)\) based on \(S_2\). Here, it costs \(O(1)\) standard homomorphic multiplications.</li> </ol> <p>This provides a description of the high-level abstraction of the asymmetric BSGS algorithm.</p> <h3 id="33-specific-implementation">3.3 Specific Implementation</h3> <p>Due to the fact that <em>different leveled FHEs adopt different encoding strategies</em>, there are subtle differences in the specific implementation of the asymmetric BSGS algorithm.</p> <ul> <li><strong>BGV uses LSB encoding and NTT packing</strong>, with explicit ciphertext levels and modulus switching. The asymmetric BSGS algorithm fully adheres to the description in the previous subsection.</li> <li><strong>CKKS uses fixed-point number encoding and DFT packing</strong>, with explicit ciphertext levels, and modulus switching is replaced by rescaling. The asymmetric BSGS algorithm also fully follows the description in the previous subsection, and its computational accuracy is close to that of the original BSGS algorithm.</li> <li><strong>BFV uses MSB encoding and NTT packing</strong>, and modulus switching is implicitly performed during the tensor product, making it difficult to actively relax the noise control. For Step 2, we suggest to first convert all the BFV ciphertexts in \(S_1^{(j)}\) to BGV ciphertexts<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>, then compute the tensor products, and finally convert \(\bar{S}_1\) and \(\hat{S}_1\) back to BFV ciphertexts.</li> </ul> <h3 id="34-complexity-analysis">3.4 Complexity Analysis</h3> <p>The following table compares the asymptotic complexity of the asymmetric BSGS algorithm with that of the original BSGS algorithm and its recursive variant (Paterson-Stockmeyer algorithm).</p> <p>A recent article<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> proposed an automorphism-based polynomial evaluation algorithm with a computational complexity of only \(O(\log d)\). However, this algorithm depends on a specific algebraic structure and can only support polynomial evaluations of limited degrees, so it is excluded here.</p> <table> <thead> <tr> <th>Algorithm</th> <th>Scalar Mult.</th> <th>Addition</th> <th>Tensor Product</th> <th>Key Switching</th> <th>Modulus Switching</th> <th>Multiplicative Depth</th> </tr> </thead> <tbody> <tr> <td>Original BSGS Algorithm</td> <td>\(O(d)\)</td> <td>\(O(d)\)</td> <td>\(\approx 3\sqrt{d}\)</td> <td>\(\approx 2\sqrt{d}\)</td> <td>\(\approx 2\sqrt{d}\)</td> <td>\(\lceil\log d\rceil + 1\)</td> </tr> <tr> <td>Paterson-Stockmeyer Algorithm</td> <td>\(O(d)\)</td> <td>\(O(d)\)</td> <td>\(\approx \sqrt{2d}\)</td> <td>\(\approx \sqrt{2d}\)</td> <td>\(\approx \sqrt{2d}\)</td> <td>\(\lceil\log d\rceil\)</td> </tr> <tr> <td>Asymmetric BSGS Algorithm</td> <td>\(O(d)\)</td> <td>\(O(d)\)</td> <td>\(O(\sqrt{d})\)</td> <td>\(O(d^{1/t})\)</td> <td>\(O(d^{1/t})\)</td> <td>\(\lceil\log d\rceil + 1\)</td> </tr> </tbody> </table> <p><br/></p> <p>To keep the description concise, we ignored the influence of the constant \(t\) in the asymptotic complexity of the asymmetric BSGS algorithm. However, we note that as \(t\) increases, there is a factor of \(t\) hidden in the complexity of key switching and modulus switching, and a factor of \(2^{t - 1}\) hidden in the complexity of tensor products. Therefore, for any fixed $d$, there is a turning point at which increasing \(t\) further will instead increase the overall computational cost.</p> <p>To strike a balance between the positive and negative effects, we conducted a series of experiments. The results show that for polynomials of degree less than 65536, setting \(t = 4\) yields the optimal performance in most scenarios. An interesting follow-up direction would be to mitigate the negative impact of the parameter $t$ in this asymptotic complexity, so as to further increase its value.</p> <p>The figure below shows the exact number of modulus switches and key switches in the asymmetric BSGS algorithm with different parameters. Although it performs worse than the Paterson-Stockmeyer algorithm for polynomials of low degrees, as the degree of the polynomial increases, the number of modulus and key switches is only a fraction of that in the Paterson-Stockmeyer algorithm.</p> <div class="row mt-3"> <div class="col-sm-11 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2511_Qingfeng/img2-480.webp 480w,/assets/img/blog/2511_Qingfeng/img2-800.webp 800w,/assets/img/blog/2511_Qingfeng/img2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2511_Qingfeng/img2.png" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p><br/></p> <h2 id="4-applications">4. Applications</h2> <p>The asymmetric BSGS algorithm is universal and can handle the evaluation tasks of polynomials of any degree in all leveled FHEs.</p> <p>All homomorphic computing tasks that rely on high-degree dense polynomial evaluation will benefit from this, including leveled FHE bootstrapping and LWE amortized bootstrapping. For example, we obtain the following accelerations:</p> <ul> <li>\(2.9\times\) improvement in throughput for the bootstrapping of BGV presented in paper<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>.</li> <li>\(3.5\times\) decrease in latency for a recent amortized bootstrapping for LWE ciphertext presented in paper<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup>.</li> </ul> <p><br/></p> <h2 id="5-references">5. References</h2> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>Mike Paterson, Larry J. Stockmeyer. On the number of nonscalar multiplications necessary to evaluate polynomials. SIAM J. Comput. 1973. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:1:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:2"> <p>Jacob Alperin-Sheriff and Chris Peikert. Practical Bootstrapping in Quasi-linear Time. CRYPTO 2013. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>Hiroki Okada, Rachel Player, Simon Pohmann. Homomorphic Polynomial Evaluation Using Galois Structure and Applications to BFV Bootstrapping. ASIACRYPT 2023. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p>Shai Halevi, Victor Shoup. Bootstrapping for HElib. EUROCRYPT 2015. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p>Zeyu Liu, Yunhao Wang. Amortized Functional Bootstrapping in Less than 7 ms, with Õ(1) Polynomial Multiplications. ASIACRYPT 2023. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Qingfeng Wang</name></author><summary type="html"><![CDATA[TL;DR: We introduce a new polynomial evaluation algorithm under homomorphic encryption, namely the Asymmetric BSGS Algorithm. It is a generalization and specialization of the original Baby-Step Giant-Step algorithm in the leveled FHE computation model. Leveraging the observation that there is a difference in multiplicative depth between the baby-step set and the giant-step set, this algorithm significantly reduces the number of modulus and key switches required for dense polynomial evaluation from $O(\sqrt{d})$ to $O(d^{1/t})$, by adjusting the set decomposition method and relaxing the control of noise growth and ciphertext size in some calculations. Here, $d$ is the polynomial degree and $t$ is a small constant which, according to our experiments, is recommended to be chosen as $4$.]]></summary></entry><entry><title type="html">Leveraging Discrete CKKS to Bootstrap in High Precision</title><link href="https://ckks.org/blog/2025/high_prec_bootstrap/" rel="alternate" type="text/html" title="Leveraging Discrete CKKS to Bootstrap in High Precision"/><published>2025-10-06T15:12:00+00:00</published><updated>2025-10-06T15:12:00+00:00</updated><id>https://ckks.org/blog/2025/high_prec_bootstrap</id><content type="html" xml:base="https://ckks.org/blog/2025/high_prec_bootstrap/"><![CDATA[<ul> <li>Written by <a href="https://hmchoe0528.github.io">Hyeongmin Choe</a> (CryptoLab)</li> <li>Based on <a href="https://ia.cr/2025/1786">https://ia.cr/2025/1786</a> (CCS 2025)</li> </ul> <p><em>TL;DR: We introduce a new high-precision CKKS bootstrapping method. It leverages a novel Integer Cleaning strategy inspired by the Discrete CKKS technique and is implemented using the Grafting technique. We highlight its main building blocks and discuss its efficiency.</em></p> <hr/> <p><br/></p> <h2 id="why-high-precision-bootstrapping-matters">Why High-Precision Bootstrapping Matters</h2> <p>CKKS bootstrapping aims to refresh ciphertexts by increasing their modulus while preserving the encrypted message, enabling further homomorphic computations. However, bootstrapping introduces an approximation error. Most existing implementations achieve <em>5–25 bits</em> of precision, where the bit-precision can be defined as the negative base-2 logarithm of the worst-case error, measured across many runs.</p> <p>Some advanced applications, however, require much smaller error to support stronger security properties such as <em>Circuit Privacy</em> and <em>IND-CPA-D Security</em>. It is also important for Threshold-FHE (see <a href="https://ckks.org/blog/2025/threshold/">this blogpost</a>). These are often achieved via noise flooding—adding large noise relative to the error before decryption—which blinds the secret-dependent error terms but also the lower bits of the message. To retain precision after flooding, the pre-flooding precision must be higher, typically <em>64–80 bits</em> or more.</p> <div class="row mt-3"> <div class="col-sm-7 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2510_Hyeongmin/image0-480.webp 480w,/assets/img/blog/2510_Hyeongmin/image0-800.webp 800w,/assets/img/blog/2510_Hyeongmin/image0-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2510_Hyeongmin/image0.png" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>Supporting such high precision securely and efficiently is a key challenge in CKKS bootstrapping. In LLK+22<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, a high-accuracy polynomial approximation for modular reduction \(x \mapsto (x \bmod q)\) was introduced. This approximation enables high-precision CKKS bootstrapping, though it incurs enormous modulus consumption. Note that larger modulus consumption during bootstrapping leaves fewer levels available for subsequent computations, which necessitates more frequent bootstrappings and thus significantly increases the overall cost. BCC+22<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> introduced Meta-BTS, which achieves high precision by performing multiple sequential low-precision bootstrapping, at the cost of increasing latency linearly in the target precision. This approach also consumes slightly more modulus.</p> <p><br/></p> <h2 id="evalround-paradigm-for-high-precision">EvalRound+ Paradigm for High Precision</h2> <p>We follow the <strong>EvalRound+</strong> paradigm as the first building block of high-precision CKKS bootstrapping. It differs slightly from the traditional CKKS bootstrapping. Let’s first recall the traditional CKKS bootstrapping pipeline:</p> \[\text{ModRaise} \rightarrow \text{CoeffsToSlots} \rightarrow \text{EvalMod} \rightarrow \text{SlotsToCoeffs}.\] <p>Assume a ciphertext encrypting a message \(m\) (or the coefficient vector of a polynomial) under modulus \(q_0\). In the ModRaise step, the ciphertext modulus is extended from \(q_0\) to a larger modulus \(Q\), which introduces an extra integer polynomial \(I\) multiplied by \(q_0\). As a result, the ciphertext encrypts \(m + q_0I\), so an EvalMod step is needed to reduce the message modulo \(q_0\) and recover \(m\). To enable this modular reduction on the encrypted message, the ciphertext is first transformed from the coefficient domain to the slot domain via CoeffsToSlots. After EvalMod the result is mapped back to the coefficient domain using SlotsToCoeffs. Alternatively, one can place SlotsToCoeffs at the beginning. Here, we focus on the ModRaise-first variant, as the technical details are equivalent.</p> <p>The <em>EvalRound+</em> paradigm<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>, replaces EvalMod with a subroutine called EvalRound and branches into two parallel tracks. EvalRound<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> is \(\text{EvalRound} = \text{Id} - \text{EvalMod}.\) That is, while EvalMod extracts the message \(m\) from \(m + q_0 I\), EvalRound extracts \(q_0 I\).</p> <p>The EvalRound+ procedure begins with ModRaise and then splits into two tracks:</p> <ul> <li>the first track applies the standard CoeffsToSlots algorithm, and</li> <li>the second track uses a low-precision CoeffsToSlots* followed by EvalRound.</li> </ul> <p>Then it subtracts the two resulting ciphertexts, and concludes with SlotsToCoeffs.</p> <p>The two branches produce ciphertexts encrypting \(m + q_0 I\) and \(q_0 I\), respectively; subtracting them yields a ciphertext encrypting \(m\), which is then passed through SlotsToCoeffs to return to coefficient representation. Although the dual-track structure increases latency, it reduces modulus consumption, since only the second track dictates the modulus consumption and CoeffsToSlots* consumes much less modulus than the usual CoeffsToSlots algorithm.</p> <p>To enable high-precision bootstrapping, one can upgrade CoeffsToSlots, EvalRound, and SlotsToCoeffs to their high-precision versions. Each requires larger scale factors (which is used for encodings to scale the real/complex messages) increasing modulus consumption. By contrast, CoeffsToSlots* can remain low-precision and consume significantly less modulus compared to the other bootstrapping components. This marks a key distinction from traditional CKKS bootstrapping, where high-precision CoeffsToSlots contributes to overall modulus consumption.</p> <p>The main challenge is that high-precision bootstrapping still requires substantial modulus. In particular, EvalMod and EvalRound rely on polynomial approximations of \((x \bmod q)\) and \((x - (x \bmod q))\), respectively. Achieving higher accuracy requires higher-degree polynomials, whose multiplicative depth grows as \(\Theta(\log t)\), where \(t\) is the target bit-precision. Since scale factors grow as \(\Theta(t)\), overall modulus consumption typically scales as \(\Theta(t \log t)\).</p> <p><br/></p> <h2 id="integer-cleaning-from-discrete-ckks">Integer Cleaning from Discrete CKKS</h2> <p>Let’s take a closer look at the EvalRound procedure. When treating \(q_0\) as a new scale factor, EvalRound maps \((I+m/q_0)\) to \(I\). More precisely, an erroneous integer \(I + \varepsilon\) is mapped to a (much less) erroneous integer \(I + \varepsilon'\) where \(\varepsilon := m/q_0\) and \(\varepsilon' \ll \varepsilon\). As it <em>cleans</em> a noisy integer, we may call this functionality <strong>Integer Cleaning</strong>.</p> <p>To reduce modulus consumption in high-precision bootstrapping, we introduce a novel Integer Cleaning algorithm based on this idea. The process involves:</p> <ol> <li><strong>Digit Extraction</strong>: Decompose the noisy integer \(I+\varepsilon\) into base-\(\beta\) digits: For \(I = \sum_{i=0}^{\ell} I_i \beta^i,\) each noisy digit \(I_i + \varepsilon_i\) is extracted and stored separately. This can be done by either: <ul> <li>using direct polynomial approximation, mapping \(I\) into \(I_i \in [0, \beta)\), or</li> <li>mapping \(I\) into \(\exp(2i\pi I / \beta^\ell)\), the complex \(\beta^\ell\)-th roots of unity (as in CKKL24<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup>), and decomposing the digits via interpolation (as in BKSS24<sup id="fnref:6"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup>).</li> </ul> </li> <li><strong>Iterative Digit Cleaning</strong>: Apply low-degree polynomials iteratively to each noisy digit \(I_i\), to refine its precision: <ul> <li>(\(\beta = 2\)) \(h_1(x) = 3x^2 - 2x^3\) from CKK20<sup id="fnref:7"><a href="#fn:7" class="footnote" rel="footnote" role="doc-noteref">7</a></sup>, or</li> <li>(\(\beta = 3\)) \(\frac{1}{3}(\bar{x}^2 + 4x - 2x^2\bar{x})\) from BKSS24<sup id="fnref:6:1"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup>.</li> </ul> <p>These polynomials quadratically clean the bits or trits (ternary digits), refining a \(t\)-bit precision to around \(2t\)-bit. In the end, it returns the cleaned digits \(I_i + \varepsilon'_i\) with \(\varepsilon'_i \ll \varepsilon_i\). Note that for each iteration, the scale factor must be large enough to support the cleaned digit, which roughly squares after each iteration. This implies that early iterations (and Digit Extraction) can use much smaller scale factors (e.g., 25–35 bits) than the desired integer cleaning precision.</p> </li> <li><strong>Recombination</strong>: Combine the cleaned digits \(I_i + \varepsilon'_i\) to reconstruct the cleaned integer \(I + \varepsilon'\). This requires only integer multiplications and additions, with no extra modulus consumption.</li> </ol> <p>Suppose the input \(I + \varepsilon\) to the Integer Cleaning algorithm has \(t\)-bit precision (i.e., \(|\varepsilon| \leq 2^{-t}\)). Then, with \(\text{iter}\) iterations of Digit Cleaning, the algorithm outputs an integer with \(\Theta(2^{\text{iter}} \cdot t)\) bits of precision.</p> <p><br/></p> <h2 id="grafting">Grafting</h2> <p>We leverage Grafting<sup id="fnref:8"><a href="#fn:8" class="footnote" rel="footnote" role="doc-noteref">8</a></sup> to efficiently support heterogeneous (i.e., small) scale factors without incurring additional RNS moduli or performance loss. This allows our implementation to achieve modulus-efficient and high-performance bootstrapping. For further details on RNS-CKKS and Grafting, see <a href="https://ckks.org/blog/2025/grafting">this blogpost</a>.</p> <p><br/></p> <h2 id="putting-it-all-together">Putting It All Together</h2> <p>The new CKKS bootstrapping is built on the EvalRound+ paradigm with the Integer Cleaning strategy and employs the Grafting technique. We note that the Integer Cleaning parts can be further optimized using a so-called <strong>Thrifty</strong> approach detailed in the paper.</p> <div class="row mt-3"> <div class="col-sm-7 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2510_Hyeongmin/final_bar-480.webp 480w,/assets/img/blog/2510_Hyeongmin/final_bar-800.webp 800w,/assets/img/blog/2510_Hyeongmin/final_bar-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2510_Hyeongmin/final_bar.jpeg" class="img-fluid rounded z-depth-1" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p><br/> In our proof-of-concept implementation, we used a ring dimension of \(N=2^{16}\) with full-slot messages. With bit decomposition (\(\beta = 2\)), the direct approximation method for bit extraction, and three iterations of bit-cleaning, our bootstrapping achieved 81 bits of precision. We compared this with Meta-BTS, which requires four sequential bootstraps to reach similar accuracy. Our bootstrapping achieved a <strong>1.64× speedup</strong>, while still leaving 494 bits available for homomorphic computations.</p> <p>We note that the bootstrapping scales naturally with the desired precision, thanks to its iterative nature. By adjusting the number of digit-cleaning iterations, one can flexibly target either lower or higher bootstrapping precisions. For instance, adding one more iteration yields roughly 150 bits of precision still at \(N=2^{16}\), with only a modest increase in latency.</p> <p>In summary, the new bootstrapping method, combining EvalRound+, Integer Cleaning, and Grafting, enables high-precision CKKS bootstrapping that is modulus-efficient and high-performance, offering a practical solution for advanced homomorphic encryption applications.</p> <p><br/></p> <h2 id="references">References</h2> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>Y. Lee, J. Lee, Y. Kim, Y. Kim, J. No, and H. Kang. <a href="https://ia.cr/2020/1549">“High-Precision Bootstrapping for Approximate Homomorphic Encryption by Error Variance Minimization.”</a> Eurocrypt 2022. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2"> <p>Y. Bae, J. H. Cheon, W. Cho, J. Kim, and T. Kim. <a href="https://ia.cr/2022/1167">“META-BTS: Bootstrapping Precision Beyond the Limit.”</a> ACM CCS 2022. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>H. Sung, S. Seo, T. Kim, and C. Min. <a href="https://ia.cr/2024/1379">“EvalRound+ Bootstrapping and its Rigorous Analysis for CKKS Scheme.”</a> ePrint Archive 2024. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p>S. Kim, M. Park, J. Kim, T. Kim, and C. Min. <a href="https://ia.cr/2022/1256">“EvalRound Algorithm in CKKS Bootstrapping.”</a> Asiacrypt 2022. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p>H. Chung, H. Kim, Y. Kim, and Y. Lee. <a href="https://ia.cr/2024/274">“Amortized Large Look-up Table Evaluation with Multivariate Polynomials for Homomorphic Encryption.”</a> ePrint Archive 2024. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:6"> <p>Y. Bae, J. Kim, D. Stehlé, and E. Suvanto. <a href="https://ia.cr/2024/1637">“Bootstrapping Small Integers With CKKS.”</a> Asiacrypt 2024. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:6:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:7"> <p>J. H. Cheon, D. Kim, and D. Kim. <a href="https://ia.cr/2019/1234">“Efficient Homomorphic Comparison Methods with Optimal Complexity.”</a> Asiacrypt 2020. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:8"> <p>J. H. Cheon, H. Choe, M. Kang, J. Kim, S. Kim, J. Mono, and T. Noh. <a href="https://ia.cr/2024/1014">“Grafting: Decoupled Scale Factors and Modulus in RNS-CKKS.”</a> ACM CCS 2025. <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Hyeongmin Choe</name></author><summary type="html"><![CDATA[TL;DR: We introduce a new high-precision CKKS bootstrapping method. It leverages a novel Integer Cleaning strategy inspired by the Discrete CKKS technique and is implemented using the Grafting technique. We highlight its main building blocks and discuss its efficiency.]]></summary></entry><entry><title type="html">Convergent Evolution: Why Secure Homomorphic Encryption Will Resemble High-Performance GPU Computing</title><link href="https://ckks.org/blog/2025/convergent-evolution/" rel="alternate" type="text/html" title="Convergent Evolution: Why Secure Homomorphic Encryption Will Resemble High-Performance GPU Computing"/><published>2025-09-08T04:00:00+00:00</published><updated>2025-09-08T04:00:00+00:00</updated><id>https://ckks.org/blog/2025/convergent-evolution</id><content type="html" xml:base="https://ckks.org/blog/2025/convergent-evolution/"><![CDATA[<ul> <li>Written by <a href="https://www.linkedin.com/in/sunchul-jung-78331082/">Sunchul Jung</a> (CryptoLab)</li> </ul> <p><em>TL;DR: Fully Homomorphic Encryption (FHE) programming hits a fundamental Turing Barrier where secure computation forbids the dynamic branching that makes conventional software work, forcing it into a parallel-first paradigm surprisingly similar to the high-performance GPU model. This means the future of FHE isn’t a magic compiler, but a hybrid architecture where a trusted client orchestrates complex logic, while an untrusted server executes simple, branchless secure kernels on encrypted data across a well-defined offloading boundary. Ultimately, developers must stop trying to translate old optimization habits and start redefining problems from the ground up, because in the world of FHE, performance isn’t about pruning—it’s about parallelism.</em></p> <hr/> <p><br/></p> <h2 id="abstract">Abstract</h2> <p>Fully Homomorphic Encryption (FHE) and high-performance GPU computing have evolved under fundamentally different pressures—security and performance—yet I argue that <strong>their programming models are converging toward a branch-free, massively parallel paradigm</strong>.</p> <p>This convergence arises from FHE’s <strong>Turing Barrier</strong>: while FHE is <em>theoretically Turing-complete</em> at the gate level, <strong>supporting secret-dependent control flow securely is practically infeasible</strong> without incurring combinatorial overheads. Meanwhile, GPUs evolved under different constraints—avoiding <strong>warp divergence</strong>—but reached similar design instincts: <strong>uniform, parallel-first kernels</strong>.</p> <p>However, FHE introduces strictly stronger constraints than GPUs:</p> <ul> <li>Every kernel must be <strong>data-oblivious</strong>.</li> <li>Memory access patterns must remain uniform.</li> <li>The <strong>Host ↔ Device</strong> (Trusted-Host ↔ Secure-Kernel-Executor in the FHE context) <strong>offloading boundary</strong> must be formally defined.</li> </ul> <p>Using Quicksort and Approximate Nearest Neighbor (ANN) search as case studies, I show why pruning-based optimizations—highly effective in plaintext—become <strong>anti-secure</strong> (i.e., the act of optimization itself introduces security vulnerabilities contrary to the original intention) under encryption. Recent works, including the FHE compiler project <strong>HEIR</strong><sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>, programmable bootstrapping optimizations<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>, and private inference frameworks<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>, confirm these fundamental constraints, demonstrating incremental improvements within fixed sandboxes but falling short of general solutions.</p> <p>I conclude that the future of practical FHE lies in <strong>hybrid architectures</strong> where trusted clients orchestrate control flow and untrusted servers execute <strong>branchless, parallel secure kernels</strong>. GPU programming models provide valuable blueprints, but advancing secure computation demands an <strong>FHE-native programming model</strong>.</p> <hr/> <p><br/></p> <h2 id="table-of-contents">Table of Contents</h2> <ol> <li><a href="#1-introduction">Introduction</a></li> <li><a href="#2-background-fundamental-constraints">Background: Fundamental Constraints</a><br/> 2.1. <a href="#21-the-fhe-security-model-and-the-turing-barrier">The FHE Security Model and the Turing Barrier</a><br/> 2.2. <a href="#22-how-fhe-compilers-handle-control-flow">How FHE Compilers Handle Control Flow</a><br/> 2.3. <a href="#23-compilation-resource-explosion-warning">Compilation Resource Explosion Warning</a><br/> 2.4. <a href="#24-programmable-bootstrapping-and-its-limits">Programmable Bootstrapping and Its Limits</a><br/> 2.5. <a href="#25-memory-access-and-oram-lower-bounds">Memory Access and ORAM Lower Bounds</a></li> <li><a href="#3-convergent-evolution-fhe-and-gpus">Convergent Evolution: FHE and GPUs</a></li> <li><a href="#4-a-practical-architecture-the-hybrid-model">A Practical Architecture: The Hybrid Model</a></li> <li><a href="#5-case-studies-the-parallel-first-imperative">Case Studies: The Parallel-First Imperative</a></li> <li><a href="#6-comparative-analysis-of-recent-research">Comparative Analysis of Recent Research</a></li> <li><a href="#7-implications-for-fhe-systems">Implications for FHE Systems</a></li> <li><a href="#8-conclusion-the-path-to-fhe-native-development">Conclusion: The Path to FHE-Native Development</a></li> <li><a href="#9-references">References</a></li> </ol> <hr/> <p><br/></p> <h2 id="1-introduction">1. Introduction</h2> <p>Fully Homomorphic Encryption (FHE) enables arbitrary computation on encrypted data without exposing plaintexts<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup>. Meanwhile, GPUs have revolutionized computing by enabling massive parallelism, powering deep learning and high-performance computing<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">5</a></sup>.</p> <p>Despite different origins, these two paradigms now face <strong>structurally similar challenges</strong>. The need for a parallel-first model in FHE arises from a dual mandate:</p> <ul> <li><strong>Security:</strong> Translating secret-dependent branching from conventional programs introduces critical <strong>side-channel leakage</strong> risks, forcing a branchless, data-oblivious model.</li> <li><strong>Performance:</strong> FHE operations are often heavily <strong>memory-bound</strong>, and mitigating the I/O overhead caused by chaining multiple operations requires exploiting <strong>massive parallelism via fused, coarse-grained kernels</strong>.</li> </ul> <p>This blog explores why these dual pressures lead to a convergence with the GPU model, why traditional optimizations fail, and why the future depends on building a new ecosystem of FHE-native, fused kernels managed by a smart, hybrid architecture.</p> <p><br/></p> <h2 id="2-background-fundamental-constraints">2. Background: Fundamental Constraints</h2> <p><br/></p> <h3 id="21-the-fhe-security-model-and-the-turing-barrier">2.1 The FHE Security Model and the Turing Barrier</h3> <p>An FHE scheme evaluates $f(\mathrm{Enc}(x)) \rightarrow \mathrm{Enc}(f(x))$ without revealing $x$. However, secure execution forbids revealing <strong>control flow</strong> or <strong>memory access patterns</strong> derived from secret data.</p> <p>Dynamic branches and loops are problematic:</p> <ul> <li><code class="language-plaintext highlighter-rouge">if-else</code> decisions → leak branch outcomes.</li> <li>Secret-dependent loops → leak iteration counts.</li> </ul> <p>Thus, FHE programs must adopt <strong>data-oblivious execution</strong><sup id="fnref:6"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup>:</p> <ul> <li>Evaluate both paths of all branches.</li> <li>Fix loop bounds to public constants.</li> <li>Make memory accesses uniform.</li> </ul> <p>This reality is summarized as the <strong>Turing Barrier</strong>: FHE remains Turing-complete in theory at the gate level, but we use this term to describe the practical infeasibility of securely compiling general-purpose programs that rely on unrestricted control flow.</p> <p><br/></p> <h3 id="22-how-fhe-compilers-handle-control-flow">2.2 How FHE Compilers Handle Control Flow</h3> <p>Many developers assume FHE behaves like a traditional runtime: write an <code class="language-plaintext highlighter-rouge">if</code> statement, and only the chosen branch executes. <strong>But this would leak secrets</strong>—if the server observes which path was taken, it learns private data.</p> <p>In FHE, <strong>both branches must always execute</strong>. The compiler produces an arithmetic <code class="language-plaintext highlighter-rouge">select</code>: \(\mathrm{Enc}(r) = \mathrm{Enc}(\text{cond}) \cdot \mathrm{Enc}(r_a) + (1 - \mathrm{Enc}(\text{cond})) \cdot \mathrm{Enc}(r_b)\)</p> <p><br/></p> <div class="row mt-3"> <div class="col-sm-7 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2509_Sunchul/blog1-480.webp 480w,/assets/img/blog/2509_Sunchul/blog1-800.webp 800w,/assets/img/blog/2509_Sunchul/blog1-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2509_Sunchul/blog1.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 1: FHE Conditional Execution </div> <p><strong>Key takeaway:</strong> Under FHE, writing <code class="language-plaintext highlighter-rouge">if (encrypted(cond)) ...</code> in source code <strong>does not</strong> imply conditional execution; the compiler <strong>always executes both paths</strong> and combines results securely.</p> <p><br/></p> <h3 id="23-compilation-resource-explosion-warning">2.3 Compilation Resource Explosion Warning</h3> <p>The combinatorial blow-up caused by encrypted control flow affects not only runtime performance but also the <strong>compilation process itself</strong>. For every secret-dependent branch, the FHE-aware compiler must <strong>materialize all possible subroutines</strong> into encrypted circuits. In the worst case, $N$ nested conditions yield a circuit size of $O(2^N)$.</p> <p>Recent experiments with <a href="https://llvm.org/">LLVM</a>-based FHE Intermediate Representations (IRs) such as HEIR <sup id="fnref:1:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> highlight this challange. While research explores sophisticated optimizations to avoid naive full unrolling (e.g., by hoisting common code from both branches), these techniques can only mitigate the overhead for shared computational paths. For genuinely divergent logic, the compiler must still materialize a circuit whose complexity grows exponentially with the number of secret-dependent branches.</p> <p>It is therefore an expected and observable outcome that beyond a certain branching complexity, this exponential growth will cause the compiler to exhaust system resources, leading to <strong>out-of-memory (OOM) errors</strong>—<em>before execution even begins</em>. This demonstrates that directly translating branch-heavy business logic is infeasible. <strong>Problem redefinition is not an optimization choice — it is a requirement for tractable FHE compilation</strong>.</p> <p><br/></p> <div class="row mt-3"> <div class="col-sm-7 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2509_Sunchul/blog2-480.webp 480w,/assets/img/blog/2509_Sunchul/blog2-800.webp 800w,/assets/img/blog/2509_Sunchul/blog2-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2509_Sunchul/blog2.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 2. Branch Explosion at Compile Time </div> <p><br/></p> <h3 id="24-programmable-bootstrapping-and-its-limits">2.4 Programmable Bootstrapping and Its Limits</h3> <p>TFHE’s <strong>Programmable Bootstrapping (PBS)</strong><sup id="fnref:7"><a href="#fn:7" class="footnote" rel="footnote" role="doc-noteref">7</a></sup> and CKKS’s <strong>BB-BTS (Batch-bits Bootstrapping)</strong><sup id="fnref:8"><a href="#fn:8" class="footnote" rel="footnote" role="doc-noteref">8</a></sup> enable efficient Look-Up Table (LUT)-based evaluation of small non-linear functions. Recent works on optimizing these circuits<sup id="fnref:2:1"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> further reduce cost by compressing LUT evaluation.</p> <p>However, PBS does <strong>not</strong> eliminate the combinatorial blow-up of complex business logic:</p> <ul> <li>Optimizations are <strong>pattern-limited</strong>, applicable to fixed, low-dimensional functions.</li> <li>Decision-tree inference using private inference frameworks like Piranha<sup id="fnref:3:1"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> works only for <strong>small, fixed-depth trees</strong>; adaptive branching remains insecure.</li> </ul> <p>PBS provides a valuable tool but <strong>not a universal escape hatch</strong>.</p> <p><br/></p> <h3 id="25-memory-access-and-oram-lower-bounds">2.5 Memory Access and ORAM Lower Bounds</h3> <p>Just as control flow must be made data-oblivious, memory access patterns must also be protected. When computations require secret-dependent memory access, <strong>Oblivious RAM (ORAM)</strong> must be used to hide patterns<sup id="fnref:6:1"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">6</a></sup>. However:</p> <ul> <li>ORAM has a <strong>provable lower bound</strong>: $\Omega(\log n)$ per access, where <code class="language-plaintext highlighter-rouge">n</code> is the number of blocks in memory.</li> <li>Even optimal constructions like OptORAMa<sup id="fnref:9"><a href="#fn:9" class="footnote" rel="footnote" role="doc-noteref">9</a></sup> cannot break this limit.</li> <li>For large-scale workloads, ORAM costs dominate if memory access depends on secrets.</li> </ul> <p><strong>Implication:</strong> Secret-driven memory access cannot be made “cheap” under FHE. Algorithms must be <strong>redesigned</strong> to avoid adaptive memory patterns.</p> <p><br/></p> <h2 id="3-convergent-evolution-fhe-and-gpus">3. Convergent Evolution: FHE and GPUs</h2> <p>FHE and GPU computing converge structurally, driven by the dual pressures of security and performance.</p> <table> <thead> <tr> <th>Aspect</th> <th>GPU Offloading (<a href="https://developer.nvidia.com/cuda-toolkit">CUDA</a>/<a href="https://www.amd.com/en/products/software/rocm.html">ROCm</a>/<a href="https://www.khronos.org/sycl/">SYCL</a>)</th> <th>FHE Hybrid Model (Proposed)</th> </tr> </thead> <tbody> <tr> <td><strong>Primary Goal</strong></td> <td>Maximize throughput</td> <td>Preserve confidentiality</td> </tr> <tr> <td><strong>Execution Style</strong></td> <td>SIMT (Single Instruction, Multiple Threads), branch-minimized</td> <td>Branchless, data-oblivious</td> </tr> <tr> <td><strong>Offloading Model</strong></td> <td>Implicit, library-driven</td> <td>Explicit, security-driven</td> </tr> <tr> <td><strong>Boundary Definition</strong></td> <td>Kernel APIs, runtime libraries</td> <td>Formally defined security surface</td> </tr> <tr> <td><strong>Workload Assumption</strong></td> <td>Dense, GPU-friendly ops</td> <td>Arbitrary → refactoring needed</td> </tr> <tr> <td><strong>Memory Model</strong></td> <td>Observable, hierarchical</td> <td>Must be oblivious</td> </tr> <tr> <td><strong>Programming Model</strong></td> <td>Emergent, pragmatic</td> <td>Intentional, security-first</td> </tr> </tbody> </table> <p><br/> FHE’s constraints are stricter, mandating a more intentional and formally defined programming model than the emergent, pragmatic model of GPUs. We must <strong>explicitly define host-device offloading</strong> and <strong>design kernels intentionally</strong>.</p> <p><br/></p> <div class="row mt-3"> <div class="col-sm-8 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2509_Sunchul/blog3-480.webp 480w,/assets/img/blog/2509_Sunchul/blog3-800.webp 800w,/assets/img/blog/2509_Sunchul/blog3-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2509_Sunchul/blog3.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 3. Conceptual Overlap of FHE and GPU Constraints </div> <p><br/></p> <h2 id="4-a-practical-architecture-the-hybrid-model">4. A Practical Architecture: The Hybrid Model</h2> <p><strong>Terminology Note</strong>: I borrow the <strong>host-device</strong> terminology from GPU programming models for intuitive consistency. In the FHE context:</p> <ul> <li><strong>Host</strong> → <strong>Trusted-Host</strong>: Orchestrates application logic, manages secret-dependent control flow, and launches secure kernels.</li> <li><strong>Device</strong> → <strong>Secure-Kernel-Executor</strong>: Executes branchless, data-oblivious kernels on encrypted inputs in an untrusted environment.</li> </ul> <p><br/></p> <h3 id="41-note-on-gpu-programming-models">4.1 Note on GPU Programming Models</h3> <p>Although I borrow GPU programming terminology, <strong>GPU programming models themselves are not fully standardized</strong>. Offloading is often API-driven via libraries like <a href="https://developer.nvidia.com/cublas">cuBLAS</a> or <a href="https://developer.nvidia.com/ko-kr/tensorrt">TensorRT</a>, not formal control-flow abstractions. This implies an <strong>even greater challenge for FHE</strong>: we must <strong>explicitly define the offloading boundary</strong>, because FHE imposes <strong>far stricter constraints</strong> than GPUs.</p> <p><br/></p> <div class="row mt-3"> <div class="col-sm-8 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2509_Sunchul/blog4-480.webp 480w,/assets/img/blog/2509_Sunchul/blog4-800.webp 800w,/assets/img/blog/2509_Sunchul/blog4-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2509_Sunchul/blog4.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 4. Offloading Model Comparison </div> <p><br/></p> <h2 id="5-case-studies-the-parallel-first-imperative">5. Case Studies: The Parallel-First Imperative</h2> <p>Developers often misconceive that FHE’s Turing completeness at the gate level implies pruning-based optimizations can be directly ported. However, pruning is <strong>anti-secure</strong>.</p> <p><br/></p> <h3 id="51-sorting">5.1 Sorting</h3> <ul> <li><strong>Quicksort (FHE-unfriendly):</strong> Adaptive pivots and partitions leak structural information.</li> <li><strong>Sorting Networks (FHE-Friendly):</strong> Fixed compare-swap sequences (e.g., Bitonic Sort) are uniform, oblivious, and parallelizable.</li> </ul> <p><br/></p> <div class="row mt-3"> <div class="col-sm-8 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2509_Sunchul/blog5-480.webp 480w,/assets/img/blog/2509_Sunchul/blog5-800.webp 800w,/assets/img/blog/2509_Sunchul/blog5-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2509_Sunchul/blog5.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 5. Algorithmic Paradigms: Adaptive Sorting vs. Data-Oblivious Sorting </div> <p><br/></p> <h3 id="52-approximate-nearest-neighbor-ann">5.2 Approximate Nearest Neighbor (ANN)</h3> <p>In this context, <code class="language-plaintext highlighter-rouge">N</code> represents the total number of vectors in the dataset, <code class="language-plaintext highlighter-rouge">d</code> is the dimension of each vector, <code class="language-plaintext highlighter-rouge">k</code> is the number of clusters in the IVF index, and <code class="language-plaintext highlighter-rouge">n_probe</code> is the number of clusters to probe during a search.</p> <ul> <li><strong>IVF-FLAT (FHE-unfriendly):</strong> This popular method works by first selecting a few (<code class="language-plaintext highlighter-rouge">n_probe</code>) cluster centroids nearest to the query, then searching only within those clusters. This adaptive probing leaks which clusters the query is close to.</li> <li><strong>HNSW (FHE-unfriendly):</strong> Adaptive graph traversal leaks query-specific paths<sup id="fnref:10"><a href="#fn:10" class="footnote" rel="footnote" role="doc-noteref">10</a></sup>.</li> <li><strong>Brute-Force Scan (FHE-Compatible):</strong> Evaluate distances for <strong>all</strong> vectors uniformly.</li> </ul> <table> <thead> <tr> <th>Algorithm</th> <th>Plain Complexity</th> <th>FHE-Compatible Complexity</th> <th>Overhead Source</th> </tr> </thead> <tbody> <tr> <td>IVF-FLAT</td> <td>$O(k\cdot d + \frac{n_{\text{probe}}}{k}\cdot N\cdot d)$</td> <td>$O(k\cdot d + N\cdot d)$</td> <td>Removing adaptive probing</td> </tr> <tr> <td>HNSW</td> <td>$O(d \cdot \log N)$</td> <td>$O(N\cdot d)$</td> <td>Eliminating path pruning</td> </tr> </tbody> </table> <p><br/></p> <div class="row mt-3"> <div class="col-sm-8 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2509_Sunchul/blog6-480.webp 480w,/assets/img/blog/2509_Sunchul/blog6-800.webp 800w,/assets/img/blog/2509_Sunchul/blog6-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2509_Sunchul/blog6.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <div class="caption"> Figure 6. Pruning vs. Parallelism </div> <p><strong>Key Insight:</strong> In FHE, <strong>brute force is the baseline</strong>, not the fallback. Performance gains come from <strong>parallelizing fixed workloads</strong>, not pruning.</p> <p><br/></p> <h2 id="6-comparative-analysis-of-recent-research">6. Comparative Analysis of Recent Research</h2> <table> <thead> <tr> <th><strong>Work</strong></th> <th><strong>Core Idea / Contribution</strong></th> <th><strong>Scope / Applicability</strong></th> <th><strong>Limitations &amp; Implications</strong></th> </tr> </thead> <tbody> <tr> <td><strong>HEIR</strong><sup id="fnref:1:2"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup></td> <td>MLIR-based compiler for FHE</td> <td>Expressive across CKKS, BFV, TFHE backends</td> <td>Optimizations can reduce overhead for shared paths, but cannot solve the core combinatorial explosion for divergent logic. Confirms <strong>Turing Barrier</strong></td> </tr> <tr> <td><strong>PBS Optimization</strong><sup id="fnref:2:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup></td> <td>PBS circuit optimization for TFHE</td> <td>Efficient for fixed LUT-based non-linearities</td> <td>Limited to <strong>local patterns</strong>; cannot eliminate control-flow explosion</td> </tr> <tr> <td><strong>Private Inference</strong><sup id="fnref:3:2"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup></td> <td>GPU-friendly framework for private inference</td> <td>Works for <strong>fixed-depth, small models</strong></td> <td>Not scalable to adaptive branching; dynamic paths remain insecure</td> </tr> <tr> <td><strong>OptORAMa</strong><sup id="fnref:9:1"><a href="#fn:9" class="footnote" rel="footnote" role="doc-noteref">9</a></sup></td> <td>Near-optimal oblivious memory</td> <td>Provably tight asymptotics</td> <td>$\Omega(\log n)$ overhead unavoidable; secure random access is inherently expensive</td> </tr> </tbody> </table> <p><br/> Recent progress expands FHE’s reach but validates the central thesis: <strong>general-purpose, efficient, secure FHE requires problem redefinition, not direct translation.</strong></p> <p><br/></p> <h2 id="7-implications-for-fhe-systems">7. Implications for FHE Systems</h2> <ul> <li><strong>FHE-aware compilers</strong> must act as <strong>policy enforcers</strong>, not magicians: <ul> <li>Convert branches into predicated arithmetic.</li> <li>Enforce loop unrolling to constant bounds.</li> <li>Optimize SIMD-style ciphertext packing.</li> </ul> </li> <li><strong>Hardware accelerators</strong> for FHE will <strong>increasingly resemble GPUs</strong>: <ul> <li>High-throughput execution cores.</li> <li>On-die NTT engines.</li> <li>High-bandwidth memory for ciphertext batching.</li> </ul> </li> <li><strong>Design implication</strong>: The <strong>parallel-first paradigm</strong> isn’t just optimal — it’s necessary for secure performance.</li> </ul> <p><br/></p> <h2 id="8-conclusion-the-path-to-fhe-native-development">8. Conclusion: The Path to FHE-Native Development</h2> <p>The theoretical Turing completeness of Fully Homomorphic Encryption (FHE) may suggest that any software can be privatized with a magical compiler. In reality, secure computation faces a fundamental <strong>Turing Barrier</strong>: the dynamic, branching-based optimizations that make conventional software efficient become the very conduits that leak secrets under encryption. The future of FHE, therefore, depends not on translating legacy code, but on redefining problems from first principles.</p> <p>The most pressing challenge is FHE’s dominant performance bottleneck: memory I/O. Optimizing individual operations in isolation is insufficient; for any non-trivial business logic, the cumulative overhead of sequential IR-level kernels quickly makes applications impractical. Inspired by high-performance GPU computing, <strong>the solution is to move from fine-grained primitives toward a hierarchy of fused, coarse-grained kernels—reducing data movement, exploiting massive parallelism, and unlocking practical performance</strong>.</p> <p>Building these kernels also solves the ecosystem’s chicken-and-egg problem: useful, coarse-grained kernels must come first to create immediate value, attract developers, and enable sustainable growth. <strong>A lightweight orchestration framework can manage this efficiently, where the trusted host performs simple runtime scheduling—dispatching fused kernels when available and falling back to basic ones otherwise</strong>. Since scheduling depends only on the public sequence of API calls, it remains inherently secure.</p> <p>Beyond kernels, the next step is to formalize the offloading boundary between the trusted host and the secure executor. Unlike GPU computing, where offloading is pragmatic and API-driven, <strong>FHE’s boundary is a security contract that must be explicit and rigorously defined</strong>. Establishing this foundation transforms FHE development from a set of clever hacks into a true engineering discipline.</p> <p>Ultimately, this leads to a <strong>truly FHE-native programming model</strong>. Compilers will evolve from magical translators into <strong>policy enforcers</strong>, guiding developers toward parallel-first problem formulations. By combining optimized fused kernels with explicit host-device boundaries, we can unlock scalable, secure, and high-performance computation—turning FHE from a theoretical promise into a practical reality.</p> <p>Looking further ahead, inspiration can be drawn from <strong>quantum computing</strong>, where developers also work within strict constraints, composing algorithms from a limited set of gates while carefully managing entanglement. Similarly, <strong>tomorrow’s FHE developers will think less like conventional software engineers and more like quantum algorithm designers—crafting solutions</strong> in a fundamentally constrained, parallel-first universe.</p> <p><br/></p> <h2 style="text-align:center; font-size:2em;">Stop translating. Start redefining.</h2> <p><br/></p> <hr/> <p><br/></p> <h2 id="9-references">9. References</h2> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>Asra Ali, Jaeho Choi, Bryant Gipson, Shruthi Gorantala, Jeremy Kun, Wouter Legiest, Lawrence Lim, Alexander Viand, Meron Zerihun Demissie &amp; Hongren Zheng, <strong>“HEIR: A Universal Compiler for Homomorphic Encryption”</strong>, arXiv preprint arXiv:2508.11095, 2025. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:1:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:1:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a></p> </li> <li id="fn:2"> <p>Ilaria Chillotti, Damien Ligier, Jean-Baptiste Orfila &amp; Samuel Tap, <strong>““Improved Programmable Bootstrapping with Larger Precision and Efficient Arithmetic Circuits for TFHE</strong>”, ASIACRYPT 2021. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:2:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:2:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a></p> </li> <li id="fn:3"> <p>Jean-Luc Watson, Sameer Wagh &amp; Raluca Ada Popa, <strong>“Piranha: A GPU Platform for Secure Computation”</strong>, USENIX Security 2022. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:3:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:3:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a></p> </li> <li id="fn:4"> <p>Craig Gentry, <strong>“A Fully Homomorphic Encryption Scheme”</strong>, STOC ’09. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p>John D. Owens, Mike Houston, David Luebke, Simon Green, John E. Stone &amp; James C. Phillips, <strong>“GPU Computing”</strong>, <em>Proceedings of the IEEE</em>, vol. 96, no. 5, 2008. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:6"> <p>Oded Goldreich &amp; Rafail Ostrovsky, <strong>“Software Protection and Simulation on Oblivious RAMs”</strong>, <em>Journal of the ACM</em>, vol. 43, no. 3, 1996. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:6:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:7"> <p>Ilaria Chillotti, Nicolas Gama, Mariya Georgieva &amp; Malika Izabachène, <strong>“TFHE: Fast Fully Homomorphic Encryption over the Torus”</strong>, <em>Journal of Cryptology</em>, 33(1), 2020. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:8"> <p>Youngjin Bae, Jaehyung Kim, Damien Stehlé &amp; Elias Suvanto, <strong>“Bootstrapping Small Integers With CKKS”</strong>, ASIACRYPT (1) 2024: 330-360 <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:9"> <p>Gilad Asharov, Ilan Komargodski, Wei-Kai Lin, Kartik Nayak, Enoch Peserico &amp; Elaine Shi, “<strong>OptORAMa: Optimal Oblivious RAM</strong>”, EUROCRYPT, 2020. <a href="#fnref:9" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:9:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a></p> </li> <li id="fn:10"> <p>Yu A. Malkov &amp; D. A. Yashunin, <strong>“Efficient and Robust Approximate Nearest Neighbor Search Using HNSW”</strong>, <em>IEEE TPAMI</em>, 2020. <a href="#fnref:10" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Sunchul Jung</name></author><summary type="html"><![CDATA[TL;DR: Fully Homomorphic Encryption (FHE) programming hits a fundamental Turing Barrier where secure computation forbids the dynamic branching that makes conventional software work, forcing it into a parallel-first paradigm surprisingly similar to the high-performance GPU model. This means the future of FHE isn't a magic compiler, but a hybrid architecture where a trusted client orchestrates complex logic, while an untrusted server executes simple, branchless secure kernels on encrypted data across a well-defined offloading boundary. Ultimately, developers must stop trying to translate old optimization habits and start redefining problems from the ground up, because in the world of FHE, performance isn't about pruning—it's about parallelism.]]></summary></entry><entry><title type="html">NeuJeans: Fast Private CNN Inference by Fusing Convolutions and Bootstrapping in FHE</title><link href="https://ckks.org/blog/2025/NueJeans/" rel="alternate" type="text/html" title="NeuJeans: Fast Private CNN Inference by Fusing Convolutions and Bootstrapping in FHE"/><published>2025-09-01T04:00:00+00:00</published><updated>2025-09-01T04:00:00+00:00</updated><id>https://ckks.org/blog/2025/NueJeans</id><content type="html" xml:base="https://ckks.org/blog/2025/NueJeans/"><![CDATA[<ul> <li>Written by Jaiyoung Park (Seoul National University)</li> <li>Based on <a href="https://arxiv.org/pdf/2312.04356">https://arxiv.org/pdf/2312.04356</a> (CCS 2024)</li> </ul> <p><em>TL;DR: NeuJeans introduces a new “Coefficients-in-Slot” (CinS) encoding for CKKS. It rethinks how convolutions are laid out and fuses them with bootstrapping, cutting latency on big models like ResNet running over ImageNet.</em></p> <hr/> <p>CKKS relies on the Ring Learning With Errors (RLWE) problem to encrypt messages into a single ciphertext. This ring structure is key to CKKS’s efficiency, as a single homomorphic operation can act on all encrypted values in parallel.</p> <p>However, the same structure complicates FHE circuit design. Linear algebra in the RLWE format often forces interactions <em>within</em> a ciphertext, which can dominate runtime by requiring many costly rotations<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup>. Consequently, an FHE program’s profile can differ significantly from its plaintext counterpart. In deep circuits such as neural networks, bootstrapping further amplifies this challenge.</p> <h2 id="encodings-in-ckks">Encodings in CKKS</h2> <p>In CKKS, the data arrangement within a ciphertext’s underlying polynomial is called its encoding. Different encodings are optimized for different types of computations. Choosing the correct ones and switching between them during an algorithm is crucial for overall performance. The two primary encodings are <strong>slot encoding</strong> and <strong>coefficient encoding</strong>. Since CKKS bootstrapping naturally involves transforms between these states, a well-designed application can perform computations in whichever encoding is most efficient at the moment.</p> <p>In <strong>slot encoding</strong>, a ciphertext holds a vector of complex numbers in distinct “slots”, and all arithmetic operations are performed element-wise. This structure is ideal for operations that can be parallelized easily, such as applying activation functions to every element in a vector or multiplying a diagonalized matrix with a vector. However, data aggregation across slots relies on rotations, which are computationally expensive.</p> <p>Conversely, in <strong>coefficient encoding</strong>, message values are placed directly into the coefficients of the ciphertext polynomial. A multiplication in this layout computes a <strong>negacyclic</strong> <strong>convolution</strong>. Formally, for vectors $\mathbf{x}, \mathbf{y} \in \mathbb{R}^\ell$, their product is</p> \[(\mathbf{x} \circledast_\ell \mathbf{y})_k = \sum_{i=0}^{k} x_iy_{k-i}-\sum_{i=k+1}^{\ell-1} x_iy_{\ell+k-i}, \quad 0 \le k &lt; \ell.\] <p>This is equivalent to a cyclic convolution with a “wrap-around and sign flip”, reflecting the polynomial modulus $X^\ell + 1$. Such convolutions align well with the operations fundamental to neural networks and signal processing. However, the RLWE ring degree $\ell$ (commonly $2^{14}$ or $2^{15}$ for 128-bit security) is much larger than the typical dimensions of images (e.g., $32 \times 32$ or $16 \times 16$). As a result, a single ciphertext implicitly encodes tens or even hundreds of images. This mismatch complicates efficient implementation, since it forces convolutions to occur across all channels simultaneously.</p> <p>CinS, or Coefficients-in-Slot encoding, is a generalized structure that combines the properties of both slot and coefficient encoding. If the total message vector has size $\ell=m\times n$, CinS encoding treats the vector as $n$ independent sub-vectors (slots), each of length $m$. Operations are then <strong>element-wise</strong> across the $n$ sub-vectors but <strong>convolutional</strong> within each of the $m$ elements. This hybrid approach allows for parallel convolutions. When $m=1$, CinS reduces to standard slot encoding, and when $n=1$, it becomes standard coefficient encoding.</p> <h2 id="encoding-transformation">Encoding Transformation</h2> <p>Switching between these encodings is essential for building complex applications. The <strong>Slot-to-Coefficient (S2C) transform</strong> converts a ciphertext from slot encoding to coefficient encoding, while its inverse (C2S) does the reverse. Fundamentally, the S2C transform is a <strong>Discrete Fourier Transform (DFT)</strong>. To perform it efficiently, the large \(\ell \times \ell\) DFT matrix is decomposed into a product of sparse matrices using the <strong>Cooley–Tukey Fast Fourier Transform (FFT)</strong> algorithm<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup>.</p> <p>The decomposition is applied to the <strong>bit-reversed DFT matrix</strong>, \(\mathcal{T}_{\ell}\), which simplifies the FFT structure and is defined recursively:</p> \[\mathcal{T}_\ell = \begin{bmatrix} I_{\ell/2} &amp; D_{\ell/2} \\ I_{\ell/2} &amp; -D_{\ell/2} \end{bmatrix} \cdot \begin{bmatrix} \mathcal{T}_{\ell/2} &amp; \mathbf{0} \\ \mathbf{0} &amp; \mathcal{T}_{\ell/2} \end{bmatrix}\] <p>Here, $I_{\ell/2}$ is the identity matrix, \(\mathcal{T}_{\ell/2}\) is the smaller bit-reversed DFT matrix, and $D_{\ell/2}$ is a diagonal matrix of complex “twiddle factors.” Applying this recursion $\log_2 \ell$ times decomposes $\mathcal{T}_\ell$ into a product of highly sparse, block-diagonal matrices.</p> \[\mathcal{T}_\ell = S_{\ell/2}^{(\ell)} \cdot S_{\ell/4}^{(\ell)} \cdots S_1^{(\ell)}\] <p>Each matrix $S_k^{(\ell)}$ in this product is a sparse, block-diagonal matrix whose structure is derived from the butterfly operations at each stage of the FFT algorithm.</p> <h2 id="neujeans-cins-encoding-and-fused-operations">NeuJeans: CinS Encoding and Fused Operations</h2> <p>NeuJeans defines CinS encoding using a partial S2C transform. Instead of applying the full DFT, it applies only the first $\log_2{m}$ stages, leaving the ciphertext in an intermediate state between slot and coefficient encoding. Formally, for a message vector $\mathbf{x}$ and slot encoding $s(\mathbf{x})$, CinS is:</p> \[\phi(\mathbf{x};m,n) = S_{m/2}^{(\ell)} \cdots S_{1}^{(\ell)}(s(\mathbf{x}))\] <p>The strength of this encoding lies in its multiplication. The partial DFT places each sub-vector into coefficient form, so multiplying two CinS ciphertexts directly performs the desired operation: the element-wise negacyclic convolution of the sub-vectors. For two message vectors $\mathbf{x}, \mathbf{y}$, we have:</p> \[\phi(\mathbf{x};m,n) \odot \phi(\mathbf{y};m,n) = \phi(\mathbf{z};m,n)\] <p>where each sub-vector of the output, $\mathbf{z}_j$, is the negacyclic convolution of the input sub-vectors:</p> \[\mathbf{z}_j = \mathbf{x}_j \circledast_m \mathbf{y}_j \quad \text{for } j = 0, \dots, n-1\] <p>This multiplication property allows NeuJeans to map the many 2D convolutions of a CNN layer into a single homomorphic multiplication.</p> <div class="row mt-3"> <div class="col-sm-9 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2509_Jaiyoung/conv2d-to-conv-480.webp 480w,/assets/img/blog/2509_Jaiyoung/conv2d-to-conv-800.webp 800w,/assets/img/blog/2509_Jaiyoung/conv2d-to-conv-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2509_Jaiyoung/conv2d-to-conv.png" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>The figure above illustrates this mapping: the image and kernel are zero-padded and flattened, with $I_n$ and $K_{m,n}$ converted into vector $i_n$ and $k_{m,n}$. As shown, the negacyclic convolution $o_{m,n}=i_n \circledast k_{m,n}$ is equivalent to 2D convolution result \(O_{m,n}=O_n*_{2D}K_{m,n}\). The extra terms ($x_1$ to $x_5$) that arise from zero-padding are later removed using a mask vector.</p> <h3 id="fusing-with-bootstrapping">Fusing with Bootstrapping</h3> <p>NeuJeans goes further by <strong>fusing</strong> the convolution with bootstrapping. Bootstrapping already performs a full S2C transform; NeuJeans inserts the convolution midway, creating a sequence like: <code class="language-plaintext highlighter-rouge">[Part 1 of S2C] -&gt; Convolution -&gt; [Part 2 of S2C]</code></p> <p>Since both the convolution and the second half of S2C are sparse diagonal matrices, the server can precompute their product offline. At runtime, this reduces to: <code class="language-plaintext highlighter-rouge">[Part 1 of S2C] -&gt; [Fused convolution + Part 2 of S2C]</code></p> <p>This fusion removes redundant steps, streamlining the bootstrapping process.</p> <h2 id="application-to-large-scale-cnn-inference">Application to Large-Scale CNN Inference</h2> <p>By combining CinS encoding with fused bootstrapping, NeuJeans enables faster 2D convolutions and a leaner S2C transform. We evaluated the approach on <strong>ResNet-18/50</strong> and <strong>MobileNet-V2</strong> using the HEaaN library. ResNet-18 and ResNet-50 are built primarily from standard convolution layers, whereas MobileNet-V2 adopts depthwise separable convolutions for efficiency.</p> <p>Across all three models, NeuJeans consistently outperforms prior methods<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup> <sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> based on slot and coefficient encodings, achieving</p> <ul> <li>7.7$\times$-30.3$\times$ speedup in convolution layers</li> <li>1.47$\times$-1.84$\times$ speedup in total inference time</li> </ul> <p>Overall, NeuJeans demonstrates how a small shift in encoding design—using CinS and fusing operations into bootstrapping–can deliver meaningful efficiency gains for encrypted CNN inference. These results underscore encoding design as an effective path for performance optimization. Looking ahead, we aim to extend this approach to emerging architectures, including diffusion models and graph neural networks.</p> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>J. Park. “<a href="https://ckks.org/blog/2025/ccmm/">Ciphertext-Ciphertext Matrix Multiplication: Fast for Large Matrices</a>.” CKKS.org 2025. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2"> <p>K. Han, M. Hhan, J. H. Cheon. “Improved Homomorphic Discrete Fourier Transforms and FHE Bootstrapping.” IEEE Access 2019. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>D. Kim, C. Guyot. “Optimized Privacy-preserving CNN Inference with Fully Homomorphic Encryption.” <em>IEEE Transactions on Information Forensics and Security</em> 2023. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p>E. Lee, J. W. Lee, J. Lee, Y. S. Kim, Y. Kim, J. S. No, W. Choi. “Low-Complexity Deep Convolutional Neural Networks on Fully Homomorphic Encryption Using Multiplexed Parallel Convolutions” ICML 2022. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Jaiyoung Park</name></author><summary type="html"><![CDATA[TL;DR: NeuJeans introduces a new “Coefficients-in-Slot” (CinS) encoding for CKKS. It rethinks how convolutions are laid out and fuses them with bootstrapping, cutting latency on big models like ResNet running over ImageNet.]]></summary></entry><entry><title type="html">Ciphertext-Ciphertext Matrix Multiplication: Fast for Large Matrices</title><link href="https://ckks.org/blog/2025/ccmm/" rel="alternate" type="text/html" title="Ciphertext-Ciphertext Matrix Multiplication: Fast for Large Matrices"/><published>2025-08-10T04:00:00+00:00</published><updated>2025-08-10T04:00:00+00:00</updated><id>https://ckks.org/blog/2025/ccmm</id><content type="html" xml:base="https://ckks.org/blog/2025/ccmm/"><![CDATA[<ul> <li>Written By <a href="https://jaihyun.com">Jai Hyun Park</a></li> <li>Based on <a href="https://ia.cr/2025/448">https://ia.cr/2025/448</a> (Eurocrypt 2025)</li> </ul> <p><em>TL;DR: We propose fast ciphertext-ciphertext matrix multiplication (CC-MM) algorithms for large matrices. Our algorithms consist of plaintext matrix multiplications (PP-MM) and ciphertext matrix transpose algorithms (C-MT). We introduce and utilize new fast C-MT algorithms for large matrices. By leveraging high-performance BLAS libraries to accelerate PP-MM, we implement large-scale CC-MM with substantial performance improvements.</em></p> <hr/> <h2 id="the-matrix-has-you">The Matrix Has You</h2> <blockquote> <p><em>“The Matrix is everywhere. It is all around us. Even now, in this very room.”</em> — <strong>Morpheus</strong>, <em>The Matrix</em> (1999), written by Lilly &amp; Lana Wachowski</p> </blockquote> <p>Matrix multiplication (PP-MM) is one of the central tasks in machine learning (ML). In particular, for large ML models, accelerating large matrix multiplication is important as its cost grows cubically with the dimension of the matrix. There exist many highly optimized open linear algebra libraries (BLAS libraries). On my machine using a single thread CPU, the OpenBLAS library took 1.47 seconds to multiply two large \(4096\times 4096\) matrices, which was 30x faster than my schoolbook implementation.</p> <p><strong>Matrix multiplication on <em>encrypted data</em></strong> plays an important role in privacy-preserving ML, when a server trains or performs inference on ML models using encrypted data. For example, when using homomorphic encryption for large language models, one needs to multiply matrices of various dimensions either in encrypted or unencrypted form. For the multiplication between plaintext and ciphertext matrices (PC-MM), a recent work BCHPS24<sup id="fnref:1"><a href="#fn:1" class="footnote" rel="footnote" role="doc-noteref">1</a></sup> significantly improved the efficiency compared to the state of the art, taking 17.1 seconds for \(4096\times 4096\) matrices on an Intel Xeon Gold 6242 CPU running at 2.80GHz, using a single thread.</p> <p>However, ciphertext-ciphertext matrix multiplication (CC-MM) has been much slower than PC-MM and PP-MM. A notable work JKLS19<sup id="fnref:2"><a href="#fn:2" class="footnote" rel="footnote" role="doc-noteref">2</a></sup> took 0.6 seconds for multiplying two \(64\times 64\) matrices on an Intel Core i9 CPU with 4 cores rated at 2.3GHz. If one naively utilizes JKLS19 for \(4096\times 4096\) matrices, it takes more than 19 hours. This is not desirable. Note that most of the current CC-MM implementations are based on JKLS19 or its variants.</p> <h2 id="as-fast-as-matrix-multiplications-in-clear">As Fast As Matrix Multiplications in Clear</h2> <p><em>Reduction to PP-MM</em></p> <p>A lower bound for the cost of PC-MM and CC-MM is the cost of PP-MM: if CC-MM becomes faster than PP-MM, we should replace all existing PP-MM codes with this new CC-MM implementation. How close can PC-MM and CC-MM be to PP-MM?</p> <p>BCHPS24 suggested a strategy to answer this question by providing a reduction from PC-MM to PP-MMs. Precisely, they perform PC-MM by using multiple PP-MMs with other minor computations. It makes PC-MM comparable to PP-MM up to small constant factors. Note that the impact of this reduction strategy is significant in practice; JKLS19 would take several hours for large matrices even though the JKLS19 algorithm provides a good asymptotic complexity. The reduction allows to exploit BLAS-based PC-MM.</p> <div class="row mt-3"> <div class="col-sm-10 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2508_JaiHyun/PC-MM-480.webp 480w,/assets/img/blog/2508_JaiHyun/PC-MM-800.webp 800w,/assets/img/blog/2508_JaiHyun/PC-MM-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2508_JaiHyun/PC-MM.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>The reduction introduced in BCHPS24 starts with a simple equation borrowed from LZ22<sup id="fnref:3"><a href="#fn:3" class="footnote" rel="footnote" role="doc-noteref">3</a></sup>. A bundle of coefficient-encoded CKKS ciphertexts encrypting each row of a plaintext matrix \(\textbf{V}\) corresponds to a pair of plaintext matrices \((\textbf{A}, \textbf{B})\) satisfying</p> \[\textbf{A} \cdot \textbf{S}^*+\textbf{B}~\approx~\textbf{V},\] <p>where \(\textbf{S}^*\) is a structured matrix determined by the ciphertext format<sup id="fnref:4"><a href="#fn:4" class="footnote" rel="footnote" role="doc-noteref">4</a></sup> and the secret. Once multiplying a plaintext matrix \(\textbf{U}\) to above equation, it becomes</p> \[(\textbf{U}\cdot\textbf{A})\cdot \textbf{S}^* +(\textbf{U}\cdot\textbf{B})~\approx~\textbf{U}\cdot\textbf{V}.\] <p>This process results in CKKS ciphertexts that encrypt each row of the product matrix \(\textbf{U}\cdot\textbf{V}\). This is a reduction from PC-MM to two plaintext matrix multiplications \(\textbf{U}\cdot\textbf{A}\) and \(\textbf{U}\cdot\textbf{B}\). A part of the results from BCHPS24 is the utilization of BLAS PP-MM libraries for various dimensional PC-MM tasks.</p> <h3 id="what-about-cc-mm">What about CC-MM?</h3> <p>One may ask whether we can reduce CC-MM to PP-MMs in the same way. However, CC-MM is not as easy as PC-MM. Suppose we have two bundles of CKKS ciphertexts that decrypt to \(\textbf{U}\) and \(\textbf{V}\), respectively. We can represent them as two pairs of matrices \((\textbf{A}_{\textbf{U}}, \textbf{B}_\textbf{U})\) and \((\textbf{A}_\textbf{V}, \textbf{B}_\textbf{V})\) such that \(\textbf{A}_\textbf{U} \cdot \textbf{S}^*+\textbf{B}_\textbf{U}~\approx~\textbf{U}\) and \(\textbf{A}_\textbf{V} \cdot \textbf{S}^*+\textbf{B}_\textbf{V}~\approx~\textbf{V}\). Once we multiply them, we obtain</p> \[\textbf{U}\cdot\textbf{V}~\approx~(\textbf{A}_\textbf{U} \cdot \textbf{S}^*\cdot \textbf{A}_\textbf{V} \cdot \textbf{S}^*)+(\textbf{A}_\textbf{U} \cdot \textbf{S}^*\cdot\textbf{B}_\textbf{V})+(\textbf{B}_\textbf{U}\cdot \textbf{A}_\textbf{V} \cdot \textbf{S}^*)+(\textbf{B}_\textbf{U}\cdot \textbf{B}_\textbf{V}).\] <p>In contrast to PC-MM, it does not reduce to PP-MMs between \(\textbf{A}_\textbf{U}, \textbf{B}_\textbf{U}, \textbf{A}_\textbf{V}\), and \(\textbf{B}_\textbf{V}\). The secret matrices \(\textbf{S}^*\) appears in the middle of the multiplications. To interpret this using CKKS ciphertexts, one may want to swap the order of matrix multiplication (e.g., \(\textbf{A}_\textbf{U} \cdot\textbf{B}_\textbf{V}\cdot \textbf{S}^*\) instead of \(\textbf{A}_\textbf{U} \cdot \textbf{S}^*\cdot\textbf{B}_\textbf{V}\) ). However, matrix multiplication is not commutative.</p> <h2 id="ciphertext-matrix-transpose">Ciphertext Matrix Transpose</h2> <p>Here is an approach to solve the above problem. For each \(\textbf{S}^*\cdot\textbf{M}\), suppose we can find an appropriate \(\textbf{M}'\) such that \(\textbf{S}^*\cdot\textbf{M}~=~\textbf{M}'\cdot\textbf{S}^*\). Once we find it, we replace \(\textbf{S}^*\cdot\textbf{M}\) with \(\textbf{M}'\cdot\textbf{S}^*\). Then, the above equation turns into</p> \[\textbf{U}\cdot\textbf{V}~\approx~(\textbf{A}_\textbf{U} \cdot \textbf{A}_\textbf{V}' )\cdot {\textbf{S}^*}^2+(\textbf{A}_\textbf{U} \cdot\textbf{B}_\textbf{V}'+\textbf{B}_\textbf{U}\cdot \textbf{A}_\textbf{V})\cdot \textbf{S}^*+(\textbf{B}_\textbf{U}\cdot \textbf{B}_\textbf{V}).\] <p>This reduces CC-MM to four PP-MMs, \(\textbf{A}_\textbf{U} \cdot \textbf{A}_\textbf{V}'\) , \(\textbf{A}_\textbf{U} \cdot\textbf{B}_\textbf{V}'\), \(\textbf{B}_\textbf{U}\cdot \textbf{A}_\textbf{V}\), and \(\textbf{B}_\textbf{U}\cdot \textbf{B}_\textbf{V}\), and a relinearization.</p> <p>Let us take a step back to understand what \({\textbf{S}^*}\cdot\textbf{M}\) means. A bundle of CKKS ciphertexts encrypting each <em>column</em> of a plaintext matrix \(\textbf{V}\) corresponds to a pair of matrices \((\textbf{A}, \textbf{B})\) such that</p> \[{\textbf{S}^*}^T\cdot \textbf{A} +\textbf{B}~\approx~\textbf{V}.\] <p>Thereby, finding \((\textbf{A}', \textbf{B}')\) from \((\textbf{A},\textbf{B})\) such that \(\textbf{A}'\cdot\textbf{S}^*+\textbf{B}'~\approx~{\textbf{S}^*}^T\cdot \textbf{A} +\textbf{B}~\approx~\textbf{V}\) is equivalent to the <strong>ciphertext matrix transpose</strong> (C-MT) problem. Even though this is not exactly the same as finding \(\textbf{M}'\) from \(\textbf{M}\) that \(\textbf{S}^*\cdot\textbf{M}~=~\textbf{M}'\cdot\textbf{S}^*\), the above argument hints that a fast C-MT algorithm would be useful for fast CC-MM.</p> <div class="row mt-3"> <div class="col-sm-9 mt-3 mt-md-0 mx-auto d-block"> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2508_JaiHyun/C-MT-480.webp 480w,/assets/img/blog/2508_JaiHyun/C-MT-800.webp 800w,/assets/img/blog/2508_JaiHyun/C-MT-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2508_JaiHyun/C-MT.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> </div> </div> <p>Algebraically, the C-MT is related to converting \(\{m_i(X)~\approx~ \sum_j \textbf{M}_{i,j}X^j \}_i\) into \(\{m_j'(X) ~\approx~ \sum_i \textbf{M}_{i,j}X^i \}_j\). We can use algebraic and algorithmic ideas to devise efficient C-MT algorithms. Please refer to Section 3 of Park25<sup id="fnref:6"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> for the details. The paper introduces two C-MT algorithms, one focusing on the latency (Section 3) and the other focusing on the key size (Section 5). Independently, Craig Gentry has also sketched another C-MT approach involving bivariate polynomials in a recent talk in FHE.org conference.<sup id="fnref:7"><a href="#fn:7" class="footnote" rel="footnote" role="doc-noteref">6</a></sup></p> <h2 id="ciphertext-ciphertext-matrix-multiplication">Ciphertext-Ciphertext Matrix Multiplication</h2> <p>Finally, we put it all together to devise a fast CC-MM algorithm for large matrices. Assume that we are given two pairs of matrices \((\textbf{A}_\textbf{U}, \textbf{B}_\textbf{U})\) and \((\textbf{A}_\textbf{V}, \textbf{B}_\textbf{V})\) such that \(\textbf{A}_\textbf{U} \cdot \textbf{S}^*+\textbf{B}_\textbf{U}~\approx~\textbf{U}\) and \(\textbf{A}_\textbf{V} \cdot \textbf{S}^*+\textbf{B}_\textbf{V}~\approx~\textbf{V}\). As previously mentioned, it implies that</p> \[\textbf{U}\cdot\textbf{V}~\approx~(\textbf{A}_\textbf{U} \cdot \textbf{S}^*+\textbf{B}_\textbf{U})\cdot(\textbf{A}_\textbf{V} \cdot \textbf{S}^*+\textbf{B}_\textbf{V}).\] <p>The CC-MM algorithm follows the following steps.</p> <figure> <picture> <source class="responsive-img-srcset" srcset="/assets/img/blog/2508_JaiHyun/CC-MM-480.webp 480w,/assets/img/blog/2508_JaiHyun/CC-MM-800.webp 800w,/assets/img/blog/2508_JaiHyun/CC-MM-1400.webp 1400w," type="image/webp" sizes="95vw"/> <img src="/assets/img/blog/2508_JaiHyun/CC-MM.jpg" class="img-fluid rounded" width="100%" height="auto" data-zoomable="" loading="eager" onerror="this.onerror=null; $('.responsive-img-srcset').remove();"/> </picture> </figure> <p><br/></p> <ol> <li> <p>It first performs C-MT on \((\textbf{A}_\textbf{U}, \textbf{B}_\textbf{U})\), obtaining a pair of matrices \((\underline{\textbf{A}}_\textbf{U}, \underline{\textbf{B}}_\textbf{U})\) that satisfies \({\textbf{S}^*}^T\cdot\underline{\textbf{A}}_\textbf{U}+\underline{\textbf{B}}_\textbf{U}\approx\textbf{U}\). Then,</p> \[\textbf{U}\cdot\textbf{V}~\approx~({\textbf{S}^*}^T\cdot\underline{\textbf{A}}_\textbf{U}+\underline{\textbf{B}}_\textbf{U})\cdot(\textbf{A}_\textbf{V} \cdot \textbf{S}^*+\textbf{B}_\textbf{V}).\] </li> <li> <p>It computes four PP-MMs, \(\underline{\textbf{A}}_\textbf{U} \cdot \textbf{A}_\textbf{V}\), \(\underline{\textbf{A}}_\textbf{U} \cdot \textbf{B}_\textbf{V}\), \(\underline{\textbf{B}}_\textbf{U}\cdot \textbf{A}_\textbf{V}\), and \(\underline{\textbf{B}}_\textbf{U}\cdot \textbf{B}_\textbf{V}\), using fast BLAS libraries. Let \(\textbf{C}_{0,0}\) , \(\textbf{C}_{0,1}\), \(\textbf{C}_{1,0}\), \(\textbf{C}_{1,1}\) be the output matrices, respectively. Then, it holds that</p> \[\textbf{U}\cdot\textbf{V}~\approx~{\textbf{S}^*}^T\cdot \textbf{C}_{0,0} \cdot \textbf{S}^*+{\textbf{S}^*}^T\cdot\textbf{C}_{0,1}+\textbf{C}_{1,0}\cdot \textbf{S}^*+\textbf{C}_{1,1}.\] </li> <li> <p>It transposes \((\textbf{C}_{0,0}, \textbf{0})\) and \((\textbf{C}_{0,1}, \textbf{0})\) to find matrices that \({\textbf{S}^*}^T\cdot \textbf{C}_{0,0}\approx \textbf{A}\cdot\textbf{S}^*+\textbf{B}\) and \({\textbf{S}^*}^T\cdot \textbf{C}_{0,1}\approx \textbf{A}'\cdot\textbf{S}^*+\textbf{B}'\). Since each entry of \(\textbf{S}^*\) is small, it impliest that \({\textbf{S}^*}^T\cdot \textbf{C}_{0,0}\cdot \textbf{S}^*\approx \textbf{A}\cdot{\textbf{S}^*}^2+\textbf{B}\cdot \textbf{S}^*\). After several matrix additions, it obtains three matrices \((\textbf{C}_2, \textbf{C}_1, \textbf{C}_0)\) such that</p> \[\textbf{U}\cdot\textbf{V}~\approx~\textbf{C}_2 \cdot{\textbf{S}^*}^2+\textbf{C}_1\cdot\textbf{S}^*+\textbf{C}_0.\] </li> <li> <p>It relinearizes and rescales each row of \((\textbf{C}_2, \textbf{C}_1, \textbf{C}_0)\) as is done usually in CKKS. It finally obtains a pair of matrices \((\textbf{A}_\times, \textbf{B}_\times)\) such that</p> \[\textbf{U}\cdot\textbf{V}~\approx~\textbf{A}_\times\cdot\textbf{S}^*+\textbf{B}_\times.\] <p>This corresponds to CKKS encryptions of the product matrix \(\textbf{U}\cdot\textbf{V}\), completing the CC-MM.</p> </li> </ol> <p>Note that the algorithm consists of three C-MTs and four PP-MMs. The cost of C-MT is asymptotically minor compared to the cost of PP-MM, and one can fully utilize the high efficiency of BLAS libraries for the PP-MMs. The implementation in Park25<sup id="fnref:6:1"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> took 85.2 seconds to multiply two \(4096\times 4096\) encrypted matrices on an Intel Xeon Gold 6242 CPU running at 2.80GHz, using a single thread. This is a huge improvement compared to the previous 19 hours. Park25<sup id="fnref:6:2"><a href="#fn:6" class="footnote" rel="footnote" role="doc-noteref">5</a></sup> introduces another CC-MM algorithm that focuses on the key size, so take a look if you are interested!</p> <h2 id="next-steps">Next Steps</h2> <p>In the more recent work BCHPS25<sup id="fnref:5"><a href="#fn:5" class="footnote" rel="footnote" role="doc-noteref">7</a></sup>, which is an extended and consolidated journal version of this paper and BCHPS24, my colleagues and I introduce fast CC-MM algorithms with pre-computations. As a future direction, it would be interesting to explore whether CC-MM for small-dimensional matrices can be reduced to PP-MM of the same dimensions, without any pre-computations.</p> <div class="footnotes" role="doc-endnotes"> <ol> <li id="fn:1"> <p>Y. Bae, J. H. Cheon, G. Hanrot, J. H. Park, D. Stehlé. “Plaintext-Ciphertext Matrix Multiplication and FHE Bootstrapping: Fast and Fused.” Crypto 2024. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:2"> <p>X. Jiang, M. Kim, K. Lauter, Y. Song. “Secure Outsourced Matrix Computation and Application to Neural Networks.” CCS 2018. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:3"> <p>J. Liu, L. F. Zhang. “Privacy-preserving and publicly verifiable matrix multiplication.” IEEE Transactions on Services Computing. 2022. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:4"> <p>In BCHPS24, the authors describe the structures under various ciphertext formats including MLWE, RLWE, shared-a RLWE, and shared-a MLWE. They also provide efficient conversions between different ciphertext formats, which particularly enabled them to take the conventional RLWE-based CKKS ciphertexts as inputs and outputs. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:6"> <p>J. H. Park. “Ciphertext-Ciphertext Matrix Multiplication: Fast for Large Matrices.” Eurocrypt 2025. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a> <a href="#fnref:6:1" class="reversefootnote" role="doc-backlink">&#8617;<sup>2</sup></a> <a href="#fnref:6:2" class="reversefootnote" role="doc-backlink">&#8617;<sup>3</sup></a></p> </li> <li id="fn:7"> <p>C. Gentry, “Reducing Encrypted Matrix Multiplication to Plaintext Matrix Multiplication.” FHE.org Conference 2025. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> <li id="fn:5"> <p>Y. Bae, J. H. Cheon, G. Hanrot, J. H. Park, D. Stehlé. “Fast Homomorphic Linear Algebra with BLAS.” Preprint. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p> </li> </ol> </div>]]></content><author><name>Jai Hyun Park</name></author><summary type="html"><![CDATA[TL;DR: We propose fast ciphertext-ciphertext matrix multiplication (CC-MM) algorithms for large matrices. Our algorithms consist of plaintext matrix multiplications (PP-MM) and ciphertext matrix transpose algorithms (C-MT). We introduce and utilize new fast C-MT algorithms for large matrices. By leveraging high-performance BLAS libraries to accelerate PP-MM, we implement large-scale CC-MM with substantial performance improvements.]]></summary></entry></feed>