## Hot Chips: Intel Openes Knights Mill Architecture

Hot Chips: Intel opens "Knights Mill" architecture

21.08.2017 19:31 UhrAndreas Stiller

The new Knights Mills commands were already known, but not the subtleties of the implementation. So there is once again, as in the Pentium 4, a "double-pumped execution"

Intel announced the expansion of the Xeon-Phi processors "Knights Mill," the Quad-FMA and the Virtual Neural Network Instructions VNNI with int16 data format. At the Hot Chips conference, which is currently taking place in the Cupertino region of California, it was now possible to see how the vector units and the execution pipelines have been designed for this purpose. The remainder in the chip around the improved Atom Silvermont core remains the same as in Knights Landing (KLN)

The processor architecture and the core architecture are the same as Knights Landing, but the Vector Processing Unit VPU (lower right) has been redesigned for Knights Mill.

Image: Intel

At the KLN, each of the two vector units has two FMA ports for both Double (DP) and Single Precision (SP). This makes a total of 32 DP flops / clock and 64 SP flops / clock. For Knights Mill (KLM) a DP port was sacrificed and replaced by SP / VNNI units. The old SP units were also installed on two SP / VNNIs each. Thus, the SP power doubles to 128 SP flops per clock; For int16 you get 256 Ops / Clock. This assumes a theoretical peak power of 13.8 TFlops (SP), 3.5 TFlops (DP) and 27.6 Tops (int16) on the assumed 1.5 GHz clock.

At Knights Landing there were two symmetrical ports for DP / SP, with Knights Mill one has a DP unit less, but twice as many SP units, which now also VNNI-16 master.

Image: Intel

Double-pumped

For both SP and VNNI-16, there are new quad-FMA commands that work with "double-pumped execution", so that the four FMA commands with 1 fetch / rename + 1 load (for all four scalars) And twice two "pumped" FMAs need only about four clocks of latency more than a single FMA command (1 fetch / rename, 1 load, 1 FMA).

Quad-FMA is a small step towards a tensor unit. Intel has introduced "double-pumped execution", so the execution takes only a little longer than with an FMA command.

Image: Intel

Contrary to original assumptions, however, VNNI does not offer a "Half Precision" floating-point format such as Nvidia Pascal / Volta or AMD Vega, but works with variable-precision integers. For example, the Quad-FMA command Int16 at the input and Int32 at the output. Intel emphasizes that for int16 data areas with a "mantissa" of 15 bits is more accurate than fp 16 with 10-bit mantissa, but it also admits that fixed-point arithmetic requires additional software overhead for dynamic areas, Effective performance can be significantly reduced by software requirement.

(As)