Montgomery’s Trick

Note: This is extracted from our article on the power of GPU parallelization.

The objective is to invert a batch or array of field elements. In a naïve approach, one would invert each element in the array. This would mean applying the extended Euclidean algorithm for each element to compute its inverse. However, since field inversion is an expensive operation, we will use something called Montgomery’s trick.

It starts with an array of elements to invert.

a1,a2,a3,a4Fp

It then computes the accumulated products up to each element.

β1=a1 β2=a1×a2 β3=a1×a2×a3 β4=a1×a2×a3×a4

This could be done with as many products as there are elements in the array.

β1=a1 β2=β1×a2 β3=β2×a3 β4=β3×a4

In the next step, we compute the field inversion for the final accumulated product, using the extended Euclidean algorithm.

β41eea(β4)

This will be the only time we use this expensive operation.

Now, we want to calculate the inversion of each element from the inversion of the final accumulated product. The trick uses the previously accumulated products for this, as follows.

a41=β41×β3

How does that work? Well, we could explain it by imagining the field inversion as a division. This is an intuitive representation only, since there is no such thing as division in Fp.

a41=β41×β3 1a1×a2×a3×a4×a1×a2×a3 =1a4

The inversion of the accumulated product is equal to the product of all the individual inverses. By multiplying it with the elements we don’t want, those inversions are canceled out, and only the element we want remains.

Now it can calculate the inversion of the previous accumulated product.

β31=β41×a4

The same idea applies.

β31=β41×a4 1a1×a2×a3×a4×a4 =1a1×a2×a3

This idea would now be sequentially repeated for the remaining elements in the array.

Element Inversion of the accumulated product Inversion of the element
a4 β41eea(β4) a41=β41×β3
a3 β31=β41×a4 a31=β31×β2
a2 β21=β31×a3 a21=β21×β1
a1 β11=β21×a2 a11=β11

Which leaves us with every element’s inversion.

But why have we done all this? Well, for an array with n elements:

  • A naïve implementation would need n field inversions.
  • Montgomery’s trick needs 3(n1) field multiplications and only one field inversion.
    • Needs n1 multiplications to compute the accumulated products.
    • Needs one field inversion for the final accumulated product.
    • Needs 2(n1) multiplications to invert each element.

Knowing how expensive the extended Euclidean algorithm is, we will gladly take the trade.