Content is from the course ECE327 at the University of Waterloo. Eventually I will clean it up with proper citations.

The design process

You can look at hardware from many different perspectives.

System integration level: connect blocks or IP cores to implement a complete system.
Register-transfer level (RTL): digital logic operations performed on data signals as they flow between registers.
Digital logic gates: gates with binary inputs and outputs.
Circuit implementation, transistor level: considering voltages and currents with transistors as switches.
Physical level: the layout of N/P wells, gates, sources and drains.
Semiconductor physics level: the movement of electrons and holes in the semiconductor material.

SystemVerilog and HDLs

We will focus on the RTL level and above. To that end, we will use the SystemVerilog hardware description language (HDL).

Our goals are:

To describe the hardware so that it may be auto-synthesized, implemented using physical design CAD tools.
To verify hardware functionality using testbench simulations.
- To that end, SystemVerilog contains non-synthesizable HDLs for modelling and verification.

An HDL is not a programming language. It is a description language. The biggest difference is that it has an “explicit notion of time and concurrency”: actions to be taken are mapped cycle by cycle.

To design effectively using HDLs, think in terms of hardware: registers, logic gates, memories, adders, multipliers, and muxes, etc. If you can’t imagine the resulting hardware implementation of your HDL code, it is probably not right!

Design flow

Logic synthesis:
Converts code into circuit netlist (list of gates and connections between them)
Performs optimizations
May be tweaked with several parameters dealing with area footprint, power consumption, and timing constraints.
Functional simulation:
Verifies that circuit netlist functions as intended.
Uses simulation testbenches written in SystemVerilog.
Results in timing diagrams/simulation waveforms.
Note that these simulations are done under ideal gate conditions (no gate or wire delays).
Physical design:
Maps the circuit netlist a specific technology platform (the standard cells for ASIC Process Design Kits (PDKs), or FPGA lookup tables (LUTs) and reconfigurable blocks)
Determines placement of these components.
Routes wires between them to realize final circuit. For example, with FPGAs there are only a certain number of wire tracks and spacing considerations.
Timing simulation:
Simulates post-physical-implementation netlist with vendor-provided timing models for the components and wire delay given clock frequencies.
Component delay is the time to

SystemVerilog to HW

Conditional if/else or ternary: multiplexer with condition as select.
Case statement/switch: one multiplexer with parameter as select.
always_ff: add register to output of procedural block.
Non-blocking assignment (<=): shift register chain.
For loop in procedural block: describes module behaviour.
Generate for loop: instantiate modules in a loop.

Simulation testbenches

The testbench is not synthesized to hardware. It is used only to verify and simulate a synthesizable DUT. It instantiates a component design under test (DUT), provides inputs, and compares outputs to “golden” correct outputs.

Testbenches are functional tests that run in simulation. They do not model timing or physical design.

`timescale 1ns/1ps  // measured time and time resolution

module adder8b_tb ();
logic [7:0] in1, in2;
logic [8:0] out;

adder8b dut (
  .in1(in1),
  .in2(in2),
  .out(out)
);

initial begin
  $monitor($time, "ns: in1=%d, in2=%d, out=%d", in1, in2, out);
  in1 = 8'd3; in2 = 8'd2;
  #10 in1 = 8'd10; in2 = 8'd34; // wait 10 units of delay before right
  #10 in1 = 8'd22; in2 = 8'd17;
  #10 in1 = 8'd13; in2 = 8'd85;
  #10 in1 = 8'd74; in2 = 8'd44;
  #10 $stop;
end

endmodule

Useful non-synthesizable directives:

Directive	Function
`$monitor`	Prints the values of the variables when they change
`$stop`	Pauses the simulation, breakpoint
`$finish`	Ends the simulation
`$time`	Output current simulation time
`$display	Outputs message at any time

Generating a clock

Use a different initial block to generate a clock signal.

`timescale 1ns/1ps

module sequential_module_tb ();
parameter CLK_PERIOD = 2; //2ns clock

logic clk;

sequential_module dut (
  .clk(clk),
);

initial begin
  clk = 1'b0;
  forever #(CLK_PERIOD/2) clk = ~clk; // toggle clock every half period
                                      // Duty cycle may be adjusted
end

initial begin
// Your code here
end

endmodule

Simulation

In SystemVerilog, simulations are event-driven with a priority queue (also known as a “stratified event queue”). There are two types:

Evaluation: produce result of computation.
Update: update signals.

Update events may happen right after evaluation events (blocking, continuous assignment). Updates can also happen only at the end of the current time step (non-blocking assignment). An update event may trigger another evaluation event.

The queue is earliest event first, first come first serve. This also means that concurrent blocks have events queued in any order.

For example, the following testbench will generate a queue:

module swap_tb();
logic clk, rst, out1, out;
swap dut(clk, rst, out1, out2);

initial begin
  clk = 1'b0;
  forever #1 clk = ~clk;
end

initial begin
  rst = 1;
  #2 rst = 0;
end
endmodule

At time 0, priority queue is

|clk_new = 0|    eval event
|clk = clk_new|  update event
|rst_new = 1|    eval event
|rst = rst_new|  update event

A clock posedge will queue more events depending on the sequential DUT.

Non-blocking assignments have updates saved in different queue that are applied sequentially (on order of receipt) at end of time step.

Log

Day 3

always_ff indicates that the final results of procedural block are “registered”. Module docs need to say that updates are per clock cycle (@posedge)

If block was concurrent non-blocking, the left-hand side of all non-blocking assignments translate to registers.

// 8 bit inputs

logic [15:0] y0;
logic [16:0] y1;
logic [24:0] y2;
logic [25:0] y3;

always_ff @(posedge clk) begin
  y0 <= a * x;
  y1 <= y0 + b;
  y2 <= y1 * x;
  y3 <= y2 + c;
end

// All of y0, y1, y2, and y3 will become registers
// Think "shift register"

Conditions

In always blocks, condition branches instantiate muxes.
Select line is condition logic.
Circuits in all branches run concurrently but only one output is selected

logic signed [15:0] res;
always_comb begin
  if (!op) begin
    res = a + b;
  end else begin
    res = a * b;
  end
end
assign y = res;

// equivalent to
assign y = (!op)? a+b : a*b;

Incomplete condition: generates feedback from mux output to its input, aka don’t change the output if condition does not happen. In always_comb block: Feed output res back into mux input: “latch” Design carefully to avoid metastability (value must stabilize before select line changes) Or don’t do this.

In always_ff block: output is registered so this is much safer.

Conditions may be non-mutually exclusive. First condition is mux closest to the output. Order defines condition priority

always_comb begin
  if (x > 1000 begin // highest priority, mux closest to output
    y = c;
  end else if (x > 100) begin
    y = b;
  end else begin
    y = 0; // lowest priority, mux furthest from output
  end
end

Once again: design in HW mindset first, then go to verilog!!! Then view the synthesized netlist and check your expectations.

Case statements: are a mux with condition as select line. Since it’s one mux, there’s no notion of condition priority.
Removing the default case infers a latch:
- If no case is matched, the output will not change.
- This is a latch, which is bad.
- Don’t do this (or maybe in an always_ff only).
Procedural blocks: last non-blocking assignment ofa signal wins. Previous assignments are ignored. This can be used to define a default value of a procedural block.

always_ff @(posedge clk) begin
  y <= 0; // default value
  if (x > 1000) begin
    y <= c;
  end else if (x > 100) begin
    y <= b;
  end
end

Example: summing a vector (use for loops in a procedural block)
Key takeway: no “iteration” looping construct in hardware (besides from clk-loop flipflops operation)

// Generates an adder tree.
// Iteration is done in the HDL, not in the HW, and i is not in HW.

module vector_sum #(
  parameter N = 128
)(
  input clk,
  input rst,
  input signed [7:0] a [0:N-1],
  output signed [31:0] sum
);

logic signed [31:0] res;
integer i;

always_ff @ (posedge clk) begin
  if (rst ) begin
    res = 0;
  end else if
    res = 0;
    for (i=0; i<N; i=i+1) begin
      res = res + a[i];
    end
  end
end

assign sum = res;

endmodule

Generate block allows you to instantiate modules in loop instead of manually. For loops in procedural blocks describe behaviour of module, generate for loops instantiate sub-modules.

genvar i;
generate
for (i = 0; i < N; i = i + 1)
begin: gen_adders // lets you refers to specific adder as "gen_adders[i].add_inst"
  adder add_inst(
    .a(a[i]),
    .b(b[i]),
    .out(res[i])
  );
end
endgenerate

Implementing simulation testbenches. Wrapper around module to be tested. Module to be tested is called “design under test” (DUT)

Testbench is a module that instantiates DUT as a subcomponent. It provides test inputs and monitors DUT outputs. Compares to reference “golden” outputs to verify correctness.

Day 2

Day 1

Process technology improvements have gradually levelled off.
Moore’s Law (number of transistors will double every generation) and Dennard’s Scaling laws (power density remains constant) [1] are no longer valid since ~2000.
- Electron leakage (power consumption goes up)
- Transistors have to be this big (few atoms, physical limit)
- Heat dissapation at small areas.
Hardware design requires we think outside of the box and do not depend on Moore’s Law.
Techniques:
- Architectural: heterogenous computing (big cores and little cores for different workloads), SIMD and AVX parallelism (amortizing i-cache, reg-file lookup and ctrl logic across several operatons)
- Multi-die system-in-package: integrate many chiplets into a single package (e.g. AMD EPYC) and 3D stacking
- Data-transport: use near memory and in-memory compute
- Different paradigms: not traditional silicon, like event-based processing and spiking, or optical computing
- Domain-specific computing: this course’s focus, creation of chips for certain classes of workloads like Google’s Tensor Processing Unit (TPU) for machine learning.
See section (The Design process)

References

[1]

R. H. Dennard, F. H. Gaensslen, H.-N. Yu, V. L. Rideout, E. Bassous, and A. R. LeBlanc, “Design of ion-implanted MOSFET’s with very small physical dimensions,” IEEE J. Solid-State Circuits, vol. 9, no. 5, pp. 256–268, Oct. 1974, doi: 10.1109/JSSC.1974.1050511.