Modular multiplication is an essential operation in ECC. Two main approaches may be employed. The first is known as interleaved modular multiplication using Montgomery’s method. Montgomery multiplication is widely used in implementations where arbitrary curves are desired. Another approach is known as multiply-then-reduce and is used in elliptic curves built over finite fields of Merssene primes. Merssene primes are the special type of primes which allow for efficient modular reduction through series of additions and subtractions. In order to optimize the multiplication process, some ECC processors use the divide and conquer approach of Karatsuba–Ofman multiplications, where others use embedded multipliers and DSP blocks within FPGA fabrics. Since modular division in affine coordinates is a costly process, numerous coordinate representation systems have been proposed to compensate this cost by means of extra multiplications and additions (e.g., Jacobian coordinates). Conversion back to affine representation can be mechanized using Fermat’s little theorem. Such processors may implement a dedicated squarer to speed up the inversion process. On the other hand, binary GCD modular division algorithm is utilized in many ECC processors where affine coordinate system is used. Binary GCD algorithm is based on simple add and shift operations, while the same operations are used by Montgomery multiplication. Hence, many ECC processors with combined modular division and multiplication blocks have been proposed. The complexity of modular division algorithms is approximately O(2n),where n is the size of operands and the running time is variable and depends directly on the inputs.
- Long Data path
- More Area and More Power
Redundant Signed Digits:
The RSD representation, first introduced by Avizienis, is a carry free arithmetic where integers are represented by the difference of two other integers. An integerXis represented by the difference of its x+ and x− components, where x+ is the positive component and x− is the negative component. The nature of the RSD representation has the advantage
of performing addition and subtraction without the need of the two’s complement representation. On the other hand, an overhead is introduced due to the redundancy in the integer representation, since an integer in RSD representation requires double word length compared with typical two’s complement representation. In radix-2 balanced RSD represented integers, digits of such integers are either 1, 0, or−1.
The complexity of the regular multiplication using the schoolbook method is O(n2). Karatsuba and Ofman proposed a methodology to perform a multiplication with complexityO(n1.58)by dividing the operands of the multiplication into smaller and equal segments. Having two operands of length nto be multiplied, the Karatsuba–Ofman methodology suggests to split the two operands into high-(H)and low-(L) segments as follows:
Consider β as the base for the operands, where βis 2 in case of integers and βis x in case of polynomials. Then, the multiplication of both operands is performed as follows: considering a=aL+aHβ[n/2]and b=bL+bHβ[n/2](1)
Hence, four half-sized multiplications are needed, where Karatsuba methodology reformulate (1) to
Therefore, only three half-sized multiplications are needed. The original Karatsuba algorithm is performed recursively, where the operands are segmented into smaller parts until a reasonable size is reached, and then regular multiplications of the smaller segments are performed recursively.
Binary GCD Modular Division:
A modular division algorithm is proposed based on the extended Euclidean algorithm. This algorithm is considered as the basis for several hardware implementations of modular division. Algorithm 1 computes the modular division Z ≡X/Y (mod M) based on the plus– minus version of the original binary GCD algorithm. The algorithm instantiates the four registers A, B, U, and V that are initialized with Y, M, X, and 0, respectively. Then, it constantly reduces the values of Y and Min order to calculate the GCD(Y,M) which is equal to 1 in well-formed elliptic curves where the modulo is prime. The registers U and V are used to calculate the quotient and the operations performed on these registers are similar to the operations performed on the A and B registers. The operations on the registers A and B are performed by repetitively reducing the contents of both registers by simple shift or add/subtract-shift operations based on the conditions whether the intermediate contents are even or not. In the case where both registers contents are odd, the content of both registers are added ifA+Bis divisible by 4 or subtracted,(A−B), otherwise. Two variables ρ and δ are used to control the iterations of the algorithm based on the bounds of the registers contents, where δ=α−β, 2α and 2β are the upper bounds ofAandB, respectively, and ρ=min(α, β).
The AU is the core unit of the processor that includes the following blocks: 1) modular addition/subtraction block; 2) modular multiplication block; and 3) modular division block.
Modular Addition and Subtraction:
Addition is used in the accumulation process during the multiplication, as well as, in the binary GCD modular divider algorithm. In the proposed implementation, radix-2 RSD representation system as carry free representation is used. In RSD with radix-2, digits are represented by 0, 1, and−1, where digit 0 is coded with 00, digit 1 is coded with 10, and digit −1is coded with 01. In Fig. 2, an RSD adder is presented that is built from generalized full adders. The problem with this adder is that it tends to expand the addition result even if there is no overflow, since it restricts the least significant digit (LSD) to be digit−1 only. This unnecessary overflow affects the reduction process later and produces some control complexities in the overall processor architecture. However, the overflow is easily managed when the adder is instantiated as a subblock within a multiplier or a divider as is the case in the proposed implementation.
In order to overcome the problem of overflow introduced in the adder proposed, a new adder is proposed based on the work proposed. The proposed adder consists of two layers, where layer 1 generates the carry and the interim sum, and layer 2 generates the sum, as shown in Fig. 3. Table I shows the addition rules that are performed by layer 1 of the RSD adder, where RSD digits 0,+1, and−1 are represented by Z, P, and N, respectively. It works by assuring that layer 2 does not generate overflow through the use of previous digits in layer 1. The proposed adder is used as the main block in the modular addition component to take advantage of the reduced overflow feature. However, overflow is not an issue in both the multiplier and the divider when an RSD adder is used as an internal block. Hence, the reduced area is taken as an advantage in instantiating adders within the multiplier and the divider.
Karatsuba’s multiplier recursive nature is considered a major drawback when implemented in hardware. Hardware complexity increases exponentially with the size of the operands to be multiplied. To overcome this drawback, Karatsuba method is applied at two levels. A recursive Karatsuba block that works depthwise, and an iterative Karatsuba that works widthwise.
Recursive Construction of Karatsuba Multiplier:
In general, the reduced complexity of Karatsuba multiplication comes from the fact that four half-word multiplications are replaced by three half-word multiplications with some additions and subtractions. However, the complexity impact increases with the increase of the recursive depth of the multiplier. Hence, it is not sufficient to divide the operands into halves and apply the Karatsuba method at this level only. Operands of sizen-RSD digits are divided into two (low and high) equal sizedn/2-RSD digits branches. The low branches are multiplied through an n/2 Karatsuba multiplier and the high branches are multiplied through anothern/2 Karatsuba multiplier. Implementation difficulties arise with the middle Karatsuba multiplier when multiplying the results of addition of the low and high branches of each operand by itself. The results of the addition are of sizen/2+1-RSD digits so that an unbalanced Karatsuba multiplier of sizen/2+1 is required. Hence, the carry generated by the middle addition operation needs to be addressed to avoid implementation complexities of the unbalanced Karatsuba multiplier.
High-Radix Modular Division:
Binary GCD algorithm is an efficient way of performing modular division since it is based on addition, subtraction, and shifting operations. The complexity of the division operation comes from the fact that the running time of the algorithm is inconsistent and is input dependent. As seen in Algorithm 2, three main states define the flow of the algorithm. In the first state, the divider is checked whether it is even or odd. In the second state, the content of the corresponding registers are swapped according to the flag δ. In the last state, division by 4 modulo M is performed.
In order to efficiently implement Algorithm 1 in hardware, the following list of operations should be adopted to be executed efficiently in hardware. First, division by 2 or by 4 is simply performed by shifting to right 1-digit/2-digits accordingly based on the guarantee that the LSDs are zeros in line 3 and 12 of the algorithm. On the other hand, division by 2 moduloM(division by 4 modulo M) is performed by adding or subtracting the dividend to or from the modulus according to whether the dividend is even or odd and the value of M(mod 4). For both δ andρ, a comparison with 0 is necessary. However, an efficient alternative is to initialize a vector of sizenwith all zeros except the least significant byte (LSB) for δ and the most significant byte (MSB) forρ. Hence, the counting down of ρ is performed by shifting 1 bit to right and only the LSB is checked for the loop termination. On the other hand, a flag is needed to control the shift direction ofδ, where the flag and the value of the LSB are used to determine whether it is less than zero or not. The implementation of the algorithm follows the implementation proposed. The modular divider architecture is shown in Fig. 6. Three RSD adders are used along with three 3×1 multiplexers and one 4×1 multiplexer with some control logic.
- Short Data path
- Low Power and Low Area