3
\$\begingroup\$

I am creating a design that must be portable across different tools: Xilinx Vivado, Intel Quartus, Microsemi Libero. The design uses multiplier followed by adder that accumulates the results from the multiplier. This forms a multiply-accumulate block.

I can include or exclude a register stage between the multiplier and the adder. My question is, how should it be decided if this should be included or excluded? Since the design could eventually be use in any tool for any FPGA, how do I decide what to do with that register stage?

The FSM design becomes a bit simpler without that register but the question is, how to reach a decision. Will not having a register there affect Fmax? Maybe. By how much it depends on the rest of the design on that FPGA. This is what causes the confusion.

\$\endgroup\$

3 Answers 3

6
\$\begingroup\$

Such architectural quirks are commonly solved with having a small architecture-specific module for each non-portable function, for each architecture, and including it in the build. Have a subdirectory for each platform with these modules. They can all have the same names on each platform so they are easy to include in higher levels of design. During build you just tell the platform which include directory to use.

A "portable" design doesn't need to mean it's some ivory tower "works everywhere" code with no platform specific bits. It just means that you can build it using the tools you mentioned. That very often implies that there are platform-specific modules just to get the performance if you care for it, and usually you do.

In real life, designs are not inherently portable as a whole. But a majority of the code is portable. Just little bits are not, and those bits are well documented and can be independently scrutinized as necessary.

\$\endgroup\$
4
\$\begingroup\$

If you're trying to create an ultra-portable module, there are limits to the extent to which you can optimize performance. The best you can do is pick an architecture that is likely to perform well on the vast majority of implementation technologies.

In general, having the pipeline register will improve throughput on most technologies. You should go ahead and design the more complicated FSM. Having a simpler FSM only saves you a little bit of work now, at the cost of lower performance for every future user of your module. Not a worthwhile tradeoff.

\$\endgroup\$
3
  • \$\begingroup\$ This is just one of the issues I am dealing with. Another is that, I believe that if I use the device specific IP e.g dual port RAM and DSP MACC e.t.c., it will be better. But then, how to specify what vendor files to choose when simulating design and building design (synthesis). I believe FuseSoC or EDALize has solution but not sure. Have you encountered this problem before? We could rely on the synthesis tools (Intel, Xilinx, Microsemi) to infer the correct IP but we never know for sure. \$\endgroup\$
    – gyuunyuu
    Commented Jul 10 at 12:56
  • 1
    \$\begingroup\$ The tools in general do that kind of inference pretty well. You just need to be willing to spot-check them every now and then. \$\endgroup\$
    – Dave Tweed
    Commented Jul 10 at 17:27
  • \$\begingroup\$ Now I did a lot of experiments. I added a register between Mult and Adder. In Quartus Cyclone 10 LP with just that design component alone as top level, I got 120MHz without the pipeline register and 145MHz with it. So the change is significant. In Libero I saw that the critical path was the ROM output (one of the operands) until the output of the adder. I got a better result when I tied the multiplier inputs to '0' when clock enable for MACC is low. This means that the registers feeding the multiplier do not have any reset or clear, I guess this reduces the routing resource. \$\endgroup\$
    – gyuunyuu
    Commented Jul 14 at 16:31
0
\$\begingroup\$

I noticed that I get the best result with the MACC when: The multiplier inputs are tied to '0' when clock enable is LOW, AND, the output of the multiplier is registered i.e there is a pipeline stage between the multiplier and the accumulator. The output of the adder is obviously registered since it needs to be feed back into the adder as the second operand for multiply accumulate functionality.

Now what I did notice is that when I have the MACC as the top level entity in Quartus for a Cyclone 10 LP FPGA, I get fmax as about 120MHz without pipeline register and 145MHz with it. The difference is quite significant as can be seen. The Libero SoC was used to compile the same MACC for IGLOO2 FPGA. I got some similar figures for the fmax.

The bottom line is that, the pipeline register is actually important since without it, the fmax is going to reduce significantly. The actual fmax you get in a design depends on many different factors than just the RTL alone. The fmax also depends on the speed grade of the device. This figure I am giving should only be used towards the realization that the presence or absence of the pipeline register has a large impact on the fmax.

\$\endgroup\$

Not the answer you're looking for? Browse other questions tagged or ask your own question.