This assignment was to design and implement an array multiplier, pipeline that multiplier with different numbers of stages, and compare the implementation with each other as well as with the simple instantiation of a DSP 48. It also attempts to make a highly parameterizable multiplier.
An array multiplier was designed in VHDL that was highly parameterized. The multiplier can be parameterized for different input word widths and number of desired pipeline stages. Originally the multiplier was parameterizable for I/O registers but this only complicated the problem and I/O registers were added automatically by the pipelining structure itself. The basic template for the design consists of rows of ripple-carry adders with the results of each line feed into the next. This design is shown in Figure 1.
Each of the experiments were on the multiplier with a data width of 18 bits. This multiplier was pipelined with a single stage, 2, 3, 6, 9, and 18 stages. The maximum clock frequency and the resource usage of these different implementations was compared. Further the timing results, clock speed and critical path latency from the synthesis report and the post PAR report are compared.
Figure 1: This design is representative of the one implemented in these experiments. This shows a 8×5 multiplier with the critical path highlighted. This is the critical path that we found when doing a single stage.
The parameterized array multiplier was synthesized for each pipeline depth in Xilinx ISE 9.2 for the Virtex 4 LX100 part at the fastest speed grade (-12). A separate design, consisting simply of an 18 bit multiplier implemented in a DSP 48 was also synthesized. The results of this synthesis is shown in Figures 2 and 3.
Figure 2 shows the maximum clock frequency attainable by each of the different pipeline depths applied to the multiplier. The DSP 48 is not pipelined and the 424 MHz for this design is from the PAR report for the DSP. The synthesis report returns a value for the expected operating frequency of the design and the PAR report returns the actual delay of the critical path for the design. When these are compared in Figure 2, the synthesis report frequency is on average 50% better than the PAR frequency. It is also interesting to note that the increase in frequency is fairly linear with the increase in pipeline stages. Notice also that the reported frequency for the 18 stage multiplier is better than that reported for the DSP 48.
Figure 3 shows the resources used in the FPGA for each design implementation. The values that were expected for the base implementation were 162 slices 324 LUTs. These numbers assume that each one bit stage of the multiplier will occupy one LUT and therefore two bits per slice. Notice in Figure 3 that the resources increase as the pipeline stages increase. This is due mostly to the need to register both inputs, the carry bits, and the already computed result bits in each pipeline stage. When the implemented, post PAR, design was viewed in FPGA editor, the structure of the multiplier was very regular. There is some interesting routing to get to and from the IOBs but the structure of the multiplier itself is regular. The contributes to the achieved speeds.
This assignment was not difficult in itself but some of the ways that were tried to implement it created difficulties. Initially it was attempted to build a single cell consisting of an “AND” gate and a full adder and then tile these together to create the array structure. This implementation did not work well in the timing analysis because it was not using carry chain logic built into the FPGA. It was discovered that in order for the Xilinx tools to properly infer this chain, the ”+” operand had to be used. This motivated a change in implementation to two nested for-generate statements.
Code for the implementation and the DSP instantiation used in this experiment are available here.
A PDF of this writeup is available here.