Hardware Deployment Guide

This guide explains how to deploy TorchLogix models to FPGAs and other hardware platforms using direct Verilog/RTL generation.

Overview

TorchLogix can generate hardware descriptions (Verilog RTL) directly from trained models, enabling deployment to FPGAs and ASIC implementations. This provides an alternative to the traditional C→HLS→RTL pipeline and offers several advantages for logic gate networks.

Why Hardware Deployment?

Benefits:

  • Ultra-low latency: FPGA implementations can achieve sub-microsecond inference

  • High throughput: Massive parallelism enables processing thousands of inputs per second

  • Energy efficiency: Specialized hardware is more power-efficient than general-purpose CPUs/GPUs

  • Deterministic timing: Predictable performance for real-time applications

Use Cases:

  • Low-latency inference at network edge

  • Real-time signal processing and control systems

  • High-throughput batch processing

  • Embedded systems with strict power budgets

Design Approach: Direct Gate-Level Synthesis

TorchLogix uses a direct gate-level synthesis approach rather than LUT-based truth tables:

TorchLogix Model (PyTorch)
         ↓
   Gate Operations (AND, OR, XOR, etc.)
         ↓
   Verilog Expressions
         ↓
   FPGA Synthesis

Why direct gates instead of LUTs?

  1. Natural mapping: TorchLogix already defines 16 gate operations

  2. Better optimization: Modern synthesis tools optimize gate-level HDL effectively

  3. More readable: assign out = a & b; is clearer than case statements

  4. Binary structure: Each neuron has exactly 2 inputs → perfect for direct gates

Comparison: Direct Verilog vs C→HLS

Aspect

Direct Verilog

C→HLS→RTL

Generation

Direct from model

Compile C, then HLS

Intermediate

None

C code + HLS directives

Control

Full RTL control

HLS tool dependent

Readability

Gate-level, explicit

High-level C abstractions

Use Case

FPGA-specific deployment

Cross-platform (CPU + FPGA)

Both approaches are supported by TorchLogix. Use direct Verilog for FPGA-specific optimization and C→HLS for flexibility.

Supported Layer Types

Layer Type

Verilog Support

Notes

LogicDense

✅ Fully Supported

Direct gate synthesis

LogicConv2d

✅ Fully Supported

Binary tree structure

LogicConv3d

✅ Fully Supported

Binary tree with 3D indexing

Flatten

✅ Supported

Wire passthrough

OrPooling

⚠️ TODO

Recognized but not yet generated

GroupSum

⚠️ TODO

Recognized but not yet generated

Models using unsupported layers can still generate Verilog for supported portions, or use C code generation as an alternative.


Basic Verilog Export

Quick Start

Generate Verilog from any trained TorchLogix model:

import torch
import torch.nn as nn
from torchlogix.layers import LogicDense, GroupSum
from torchlogix import CompiledLogicNet

# Create or load your model
model = nn.Sequential(
    LogicDense(8, 32, connections="fixed", device="cpu"),
    LogicDense(32, 32, connections="fixed", device="cpu"),
    GroupSum(1, tau=1.0)
)

# Compile the model
compiled = CompiledLogicNet(
    model,
    input_shape=(8,),
    use_bitpacking=False,
    num_bits=1
)

# Generate Verilog code
verilog_code = compiled.get_verilog_code(module_name="my_logic_net")

# Export to file
compiled.export_hdl(
    output_dir="./verilog_output",
    module_name="my_logic_net",
    format="verilog"
)

API Reference

get_verilog_code(module_name, pipeline_stages)

Generates complete Verilog module as a string.

Parameters:

  • module_name (str): Name of the top-level Verilog module (default: "torchlogix_net")

  • pipeline_stages (int): Number of pipeline stages (default: 0)

    • 0: Fully combinational (no registers, 1 cycle latency)

    • 1: Single output register (helps synthesis)

    • N: Divide layers into N pipeline stages (N cycle latency)

    • len(layers): Full layer-level pipelining (highest fmax)

Returns: Complete Verilog code as string

export_hdl(output_dir, module_name, format, pipeline_stages)

Exports Verilog to a file.

Parameters:

  • output_dir (str): Directory to write Verilog file

  • module_name (str): Module name (default: "torchlogix_net")

  • format (str): HDL format, currently only "verilog" supported (default: "verilog")

  • pipeline_stages (int): Pipeline configuration (default: 0)

Creates: {output_dir}/{module_name}.v

Understanding Generated Verilog

Combinational Design (pipeline_stages=0)

For a simple 2-layer network:

module logic_net (
    input wire [7:0] inp,
    output wire [3:0] out
);
    // No clock or reset needed

    // Layer 0: LogicDense (4 neurons)
    wire [3:0] layer_0_out;
    assign layer_0_out[0] = (inp[0] & inp[2]);  // AND gate
    assign layer_0_out[1] = (inp[1] | inp[3]);  // OR gate
    assign layer_0_out[2] = (inp[4] ^ inp[5]);  // XOR gate
    assign layer_0_out[3] = ~(inp[6] & inp[7]); // NAND gate

    // Layer 1: LogicDense (2 neurons)
    assign out[0] = (layer_0_out[0] | layer_0_out[1]);
    assign out[1] = (layer_0_out[2] ^ layer_0_out[3]);

endmodule

Characteristics:

  • Pure combinational logic (no state)

  • No clock or reset signals

  • 1 cycle latency (output available same cycle as input)

  • Critical path spans entire network

Pipelined Design (pipeline_stages=2)

With pipeline registers:

module logic_net (
    input wire clk,
    input wire rst,
    input wire [7:0] inp,
    output reg [3:0] out
);
    // Combinational wires
    wire [3:0] layer_0_comb;
    wire [3:0] out_comb;

    // Pipeline register
    reg [3:0] layer_0_out;

    // Layer 0: Combinational logic
    assign layer_0_comb[0] = (inp[0] & inp[2]);
    assign layer_0_comb[1] = (inp[1] | inp[3]);
    assign layer_0_comb[2] = (inp[4] ^ inp[5]);
    assign layer_0_comb[3] = ~(inp[6] & inp[7]);

    // Pipeline register after Layer 0
    always @(posedge clk) begin
        if (rst)
            layer_0_out <= 4'd0;
        else
            layer_0_out <= layer_0_comb;
    end

    // Layer 1: Combinational logic
    assign out_comb[0] = (layer_0_out[0] | layer_0_out[1]);
    assign out_comb[1] = (layer_0_out[2] ^ layer_0_out[3]);

    // Output register
    always @(posedge clk) begin
        if (rst)
            out <= 4'd0;
        else
            out <= out_comb;
    end

endmodule

Characteristics:

  • Synchronous design with clock and reset

  • Registers break up long combinational paths

  • N cycle latency (where N = pipeline_stages)

  • Higher maximum frequency (fmax)

Gate Operations Supported

All 16 two-input Boolean operations are supported:

Gate ID

Operation

Verilog Expression

0

Zero (constant)

1'b0

1

AND

(a & b)

2

A AND NOT B

(a & ~b)

3

A (passthrough)

a

4

NOT A AND B

(~a & b)

5

B (passthrough)

b

6

XOR

(a ^ b)

7

OR

(a | b)

8

NOR

~(a | b)

9

XNOR

~(a ^ b)

10

NOT B

~b

11

B IMPLIES A

(~b | a)

12

NOT A

~a

13

A IMPLIES B

(~a | b)

14

NAND

~(a & b)

15

One (constant)

1'b1


Pipelining for Large Models

The Problem: Large Combinational Designs

By default, TorchLogix generates fully combinational Verilog where all logic executes in a single clock cycle. This works well for small models but causes serious problems for larger ones:

Symptoms of Combinational Overload

  • Synthesis fails or runs for hours without completing

  • Verilog files >1M lines take forever to process

  • Very low maximum frequency (fmax < 50 MHz)

  • Timing closure failures (negative WNS)

  • “Design too large” errors from Vivado

Why This Happens

  • Deep combinational paths through many layers

  • Synthesis tools struggle to optimize very large logic cones

  • Critical path delay grows with model depth

  • No natural break points for timing optimization

The Solution: Pipeline Stages

Pipelining inserts registers between layers to break up long combinational paths:

Combinational (pipeline_stages=0):
  Input → [Layer0 → Layer1 → Layer2 → Layer3] → Output
  All in 1 cycle, huge critical path

Pipelined (pipeline_stages=4):
  Input → [Layer0] → REG → [Layer1] → REG → [Layer2] → REG → [Layer3] → REG → Output
  4 cycles latency, short critical paths

Benefits

  • Synthesis succeeds even for very large models

  • Much faster synthesis time (minutes vs hours)

  • Higher maximum frequency (200+ MHz vs <50 MHz)

  • Predictable timing closure

  • Better resource utilization

Trade-offs

  • Increased latency (N cycles instead of 1)

  • More flip-flops (registers consume area)

  • Need to handle clock and reset signals

Pipeline Stage Options

pipeline_stages=0 - Fully Combinational (Default)

verilog = compiled.get_verilog_code(pipeline_stages=0)
  • No registers, no clock required

  • 1 cycle latency

  • Use for: Small models (<10 layers), initial prototyping

  • Avoid for: Large models (synthesis will fail)

pipeline_stages=1 - Output Register Only

verilog = compiled.get_verilog_code(pipeline_stages=1)
  • Single register at output

  • 1 cycle latency

  • Use for: Medium models where synthesis struggles but you need low latency

  • Best for: 10-30 layer models

pipeline_stages=N - N Pipeline Stages

# 4 pipeline stages
verilog = compiled.get_verilog_code(pipeline_stages=4)
  • Layers divided into N groups, register after each group

  • N cycle latency

  • Use for: Large models (50-200 layers)

  • Best for: Balancing latency vs synthesis speed

Full Layer-Level Pipelining

# Register between every layer
num_layers = len([m for m in model.modules() if isinstance(m, (LogicDense, LogicConv2d))])
verilog = compiled.get_verilog_code(pipeline_stages=num_layers)

# Or just use a large number
verilog = compiled.get_verilog_code(pipeline_stages=999)
  • Register after every single layer

  • Maximum possible fmax

  • Highest latency (= number of layers)

  • Use for: Very large models (>200 layers) or maximum throughput applications

Choosing the Right Pipeline Configuration

Decision Tree

Is synthesis failing or very slow?
│
├─ NO → Use pipeline_stages=0 (fully combinational)
│        Lowest latency, simplest design
│
└─ YES → How many layers in your model?
         │
         ├─ <20 layers → pipeline_stages=1
         │               (Output register only)
         │
         ├─ 20-100 layers → pipeline_stages=4 to 8
         │                  (Balanced approach)
         │
         └─ >100 layers → pipeline_stages=N/4 to N
                          (N = number of layers)

Size Guidelines

Model Characteristics

Recommended Config

Latency

Benefits

<10 layers, <100K Verilog lines

pipeline_stages=0

1 cycle

Simple, low latency

10-30 layers, synthesis slow

pipeline_stages=1

1 cycle

Helps synthesis

30-100 layers

pipeline_stages=4

4 cycles

Good balance

100-200 layers

pipeline_stages=8-16

8-16 cycles

Reliable synthesis

>200 layers

pipeline_stages=N/4

N/4 cycles

Fast synthesis

Maximum throughput needed

pipeline_stages=999

N cycles

Highest fmax

Empirical Testing

Start conservative and increase pipelining if needed:

# Step 1: Try combinational
verilog = compiled.get_verilog_code(pipeline_stages=0)
# Try to synthesize... if it fails or is very slow:

# Step 2: Add output register
verilog = compiled.get_verilog_code(pipeline_stages=1)
# Try to synthesize... if still slow:

# Step 3: Increase stages
for stages in [2, 4, 8, 16]:
    verilog = compiled.get_verilog_code(pipeline_stages=stages)
    # Synthesize and check timing/area trade-off

Performance Optimization

Finding Optimal Pipeline Depth

Run synthesis with different configurations and compare:

import subprocess

results = []
for stages in [0, 1, 2, 4, 8, 16]:
    verilog = compiled.get_verilog_code(
        module_name=f'design_p{stages}',
        pipeline_stages=stages
    )

    # Save Verilog
    with open(f'design_p{stages}.v', 'w') as f:
        f.write(verilog)

    # Synthesize (see Synthesis section for details)
    subprocess.run([
        'vivado', '-mode', 'batch',
        '-source', 'synthesize.tcl',
        '-tclargs', f'design_p{stages}.v', 'xc7z020clg400-1'
    ])

    # Parse and compare results
    # results.append((stages, luts, ffs, fmax, synthesis_time))

# Find optimal trade-off based on your requirements

Common Issues

Issue: Pipelined design has lower fmax than expected

  • Cause: Not enough pipeline stages, or uneven distribution

  • Solution: Increase pipeline_stages or try full layer-level pipelining

Issue: Too much area consumed by registers

  • Cause: Too many pipeline stages for the model size

  • Solution: Reduce pipeline_stages to find balance

Issue: Synthesis still slow with pipelining

  • Cause: Individual layers may still be very large

  • Solution:

    • Check if conv layers with large receptive fields need breaking up

    • Use more pipeline stages

    • Consider model architecture changes


Testing Generated Verilog

Functional testing and verification ensures your generated Verilog matches the expected behavior from the trained model.

Prerequisites

You’ll need one of the following simulators:

  • Vivado Simulator (xsim) - Included with Vivado

  • ModelSim/QuestaSim - Commercial simulator from Mentor/Siemens

  • Icarus Verilog - Open-source, free (apt install iverilog or brew install icarus-verilog)

  • Verilator - Fast open-source simulator (apt install verilator or brew install verilator)

Step 1: Generate Test Vectors

Export test vectors from your trained model using Python:

import torch
import numpy as np
from torchlogix import CompiledLogicNet

# Load your trained model
model = ...  # Your trained TorchLogix model

# Generate test vectors
compiled = CompiledLogicNet(model, input_shape=(8,), use_bitpacking=False, num_bits=1)
compiled.compile()

# Generate random binary test cases
num_tests = 100
input_size = 8  # Match your model's input size
test_inputs = np.random.randint(0, 2, (num_tests, input_size), dtype=np.int8)

# Get expected outputs
test_outputs = []
for inp in test_inputs:
    out = compiled.forward(inp.reshape(1, -1))
    test_outputs.append(out[0])

# Save to files for testbench
np.savetxt('test_vectors_input.txt', test_inputs, fmt='%d')
np.savetxt('test_vectors_output.txt', np.array(test_outputs), fmt='%d')

print(f"Generated {num_tests} test vectors")

Step 2: Create a Verilog Testbench

Create a testbench file tb_logic_net.v for combinational designs:

`timescale 1ns/1ps

module tb_logic_net;
    // Parameters
    parameter INPUT_WIDTH = 8;
    parameter OUTPUT_WIDTH = 2;
    parameter NUM_TESTS = 100;

    // Signals
    reg [INPUT_WIDTH-1:0] inp;
    wire [OUTPUT_WIDTH-1:0] out;

    // Expected output
    reg [OUTPUT_WIDTH-1:0] expected_out;

    // Test vectors
    reg [INPUT_WIDTH-1:0] test_inputs [0:NUM_TESTS-1];
    reg [OUTPUT_WIDTH-1:0] test_outputs [0:NUM_TESTS-1];

    integer i;
    integer errors;

    // Instantiate the DUT (Device Under Test)
    logic_net dut (
        .inp(inp),
        .out(out)
    );

    // Load test vectors
    initial begin
        $readmemb("test_vectors_input.txt", test_inputs);
        $readmemb("test_vectors_output.txt", test_outputs);
        errors = 0;
    end

    // Test stimulus
    initial begin
        $display("Starting testbench...");
        $display("Time\t\tInput\t\tOutput\t\tExpected\tStatus");
        $display("----\t\t-----\t\t------\t\t--------\t------");

        // Run through all test vectors
        for (i = 0; i < NUM_TESTS; i = i + 1) begin
            inp = test_inputs[i];
            expected_out = test_outputs[i];
            #10;  // Wait 10ns for combinational logic to settle

            // Check output
            if (out !== expected_out) begin
                $display("%0t\t%b\t%b\t%b\t\tFAIL", $time, inp, out, expected_out);
                errors = errors + 1;
            end else begin
                $display("%0t\t%b\t%b\t%b\t\tPASS", $time, inp, out, expected_out);
            end
        end

        // Summary
        $display("\n========================================");
        $display("Test Summary");
        $display("========================================");
        $display("Total tests: %0d", NUM_TESTS);
        $display("Passed:      %0d", NUM_TESTS - errors);
        $display("Failed:      %0d", errors);

        if (errors == 0) begin
            $display("\nALL TESTS PASSED!");
        end else begin
            $display("\nSOME TESTS FAILED!");
        end

        $finish;
    end

    // Optional: Generate VCD waveform dump
    initial begin
        $dumpfile("tb_logic_net.vcd");
        $dumpvars(0, tb_logic_net);
    end

endmodule

Testbench for Pipelined Designs

For pipelined designs, you need to account for latency:

module tb_pipelined_logic_net;
    parameter PIPELINE_LATENCY = 4;  // Match your pipeline_stages

    reg clk = 0;
    reg rst = 1;
    reg [7:0] inp;
    wire [1:0] out;

    // Generate clock (100 MHz)
    always #5 clk = ~clk;

    logic_net dut (
        .clk(clk),
        .rst(rst),
        .inp(inp),
        .out(out)
    );

    initial begin
        // Reset sequence
        rst = 1;
        #20 rst = 0;  // Release reset after 2 cycles

        // Run tests with pipeline latency
        for (i = 0; i < NUM_TESTS; i = i + 1) begin
            inp = test_inputs[i];

            // Wait for pipeline to fill
            repeat(PIPELINE_LATENCY) @(posedge clk);

            expected_out = test_outputs[i];

            // Check output
            if (out !== expected_out) begin
                $display("FAIL: Input %b -> Output %b (expected %b)",
                         test_inputs[i], out, expected_out);
                errors = errors + 1;
            end
        end

        // Summary...
        $finish;
    end
endmodule

Step 3: Simulate with Different Tools

Option A: Vivado Simulator (xsim)

# Compile the design
xvlog logic_net.v
xvlog tb_logic_net.v

# Elaborate
xelab -debug typical tb_logic_net -s tb_logic_net_sim

# Run simulation
xsim tb_logic_net_sim -runall

# View waveforms (GUI mode)
xsim tb_logic_net_sim -gui

Option B: Icarus Verilog

# Compile and run in one step
iverilog -o sim.out logic_net.v tb_logic_net.v
vvp sim.out

# View waveforms with GTKWave
gtkwave tb_logic_net.vcd

Option C: ModelSim/QuestaSim

# Create work library
vlib work

# Compile sources
vlog logic_net.v
vlog tb_logic_net.v

# Simulate
vsim -c tb_logic_net -do "run -all; quit"

# Or run with GUI
vsim tb_logic_net
# In ModelSim GUI: run -all

Step 4: Analyze Results

Success Criteria

  • All test vectors should produce matching outputs

  • No ‘X’ or ‘Z’ values in outputs (indicates uninitialized or high-impedance states)

  • Combinational delay should be minimal (typically < 1ns for simple gates)

Common Issues

Mismatched outputs:

  • Verify test vector format (binary vs decimal)

  • Check that Verilog module name matches instantiation in testbench

  • Ensure input/output widths match

X or Z values:

  • Usually indicates undriven wires

  • Check all wires in generated Verilog have assignments

Compilation errors:

  • Verify Verilog syntax

  • Check for Verilog-1995 vs Verilog-2001 compatibility issues

Self-Checking Testbench

For automated testing, create a self-checking testbench that exits with an error code:

initial begin
    // ... run tests ...

    if (errors != 0) begin
        $display("FAIL: %0d errors detected", errors);
        $fatal(1, "Test failed");  // Exit with error
    end else begin
        $display("PASS: All tests passed");
    end
    $finish;
end

Then use in scripts:

#!/bin/bash
iverilog -o sim.out logic_net.v tb_logic_net.v && vvp sim.out
if [ $? -eq 0 ]; then
    echo "Simulation PASSED"
else
    echo "Simulation FAILED"
    exit 1
fi

Synthesis with Vivado

Synthesis converts Verilog RTL code into gate-level netlists optimized for specific FPGA parts, providing resource estimates, timing analysis, and power consumption data.

Overview

Synthesis provides:

  • Resource Estimates: LUTs, FFs, DSPs, BRAM usage

  • Timing Analysis: Maximum frequency (fmax), Worst Negative Slack (WNS), critical paths

  • Power Estimates: Static and dynamic power consumption

Prerequisites

  • Vivado Design Suite installed (tested with 2019.1+)

  • Generated Verilog file from TorchLogix (e.g., logic_net.v)

  • Target FPGA part number (e.g., xc7z020clg400-1 for Pynq-Z2)

Quick Start

# Navigate to your Verilog output directory
cd verilog_output/

# Run synthesis with TCL script
vivado -mode batch -source synthesize.tcl -tclargs logic_net.v xc7z020clg400-1

# Results will be in synthesis_reports/
ls synthesis_reports/
# utilization.txt  timing.txt  power.txt

Synthesis Methods

Method 2: Vivado GUI

  1. Launch Vivado: vivado

  2. Create New Project

    • Click “Create Project”

    • Choose project location

    • Select “RTL Project”

  3. Add Verilog Source

    • In Flow Navigator: “Add Sources”

    • “Add or create design sources”

    • Add your logic_net.v file

  4. Select Target Part

    • Common parts:

      • Pynq-Z2: xc7z020clg400-1

      • ZCU104: xczu7ev-ffvc1156-2-e

      • Artix-7: xc7a35tcpg236-1

  5. Run Synthesis

    • In Flow Navigator: “Run Synthesis”

  6. View Reports

    • After synthesis: “Open Synthesized Design”

    • Reports → Utilization, Timing Summary

Method 3: Vivado Tcl Console

vivado -mode tcl

Then in Tcl console:

# Create in-memory project
create_project -in_memory -part xc7z020clg400-1

# Read Verilog
read_verilog logic_net.v

# Run synthesis
synth_design -top logic_net -mode out_of_context

# Generate reports
report_utilization -file utilization.txt
report_timing_summary -file timing.txt
report_power -file power.txt

exit

Understanding the Reports

Utilization Report

Shows FPGA resource usage:

+-------------------------+------+-------+-----------+-------+
|        Site Type        | Used | Fixed | Available | Util% |
+-------------------------+------+-------+-----------+-------+
| Slice LUTs              |  147 |     0 |     53200 |  0.28 |
|   LUT as Logic          |  147 |     0 |     53200 |  0.28 |
|   LUT as Memory         |    0 |     0 |     17400 |  0.00 |
| Slice Registers         |    0 |     0 |    106400 |  0.00 |
|   Register as Flip Flop |    0 |     0 |    106400 |  0.00 |
|   Register as Latch     |    0 |     0 |    106400 |  0.00 |
| F7 Muxes                |    8 |     0 |     26600 |  0.03 |
| F8 Muxes                |    2 |     0 |     13300 |  0.02 |
+-------------------------+------+-------+-----------+-------+

Key Metrics:

  • Slice LUTs: Primary logic resource. Each LUT can implement any 6-input Boolean function.

  • Slice Registers: Flip-flops for sequential logic (0 for purely combinational designs)

  • F7/F8 Muxes: Larger multiplexers for wide logic functions

  • DSPs: Digital Signal Processing blocks (typically 0 for logic gate networks)

  • BRAM: Block RAM (typically 0 for logic gate networks)

Timing Report

Shows timing analysis:

Timing Summary (ns)
-------------------
WNS(ns)      TNS(ns)      WHS(ns)      THS(ns)      WPWS(ns)     TPWS(ns)
-------      -------      -------      -------      --------     --------
  7.500        0.000        0.300        0.000         3.500        0.000

Key Metrics:

  • WNS (Worst Negative Slack): Most critical timing margin

    • Positive = timing met

    • Negative = timing violation

  • Critical Path Delay: Longest combinational path through the design

For combinational designs:

# Get maximum delay through the design
report_timing -delay_type min_max -max_paths 10 -file timing_paths.txt

Example output:

Critical Path Delay: 4.832 ns
Maximum Frequency: 206.95 MHz (if this were in a clocked design)

Power Report

Shows estimated power consumption:

Total On-Chip Power (W)  : 0.082
Dynamic (W)              : 0.012
Device Static (W)        : 0.070

Interpreting Results for TorchLogix Models

Resource Estimates

LUT Count Interpretation:

  • Each logic gate (AND, OR, XOR, etc.) typically maps to a fractional LUT

  • Expect roughly 0.5-1 LUT per gate operation

  • Tree structures may share LUTs efficiently

  • A 1000-gate network might use 500-800 LUTs

Resource Scaling:

  • Linear layers: O(neurons) LUTs

  • Convolutional layers: O(kernels × receptive_field²) LUTs

  • Deeper trees → more efficient LUT packing

Pipelining Impact:

  • Adds flip-flops (FFs) for registers

  • May actually reduce LUTs through better optimization

  • Example: 50-layer model with 4 pipeline stages might use 20% fewer LUTs

Latency Estimates

Combinational Delay:

  • Total latency = critical path delay

  • Typical delays:

    • Simple AND/OR/XOR: 0.1-0.2 ns per gate

    • Deep trees (10 levels): 2-4 ns total

    • Very deep networks: 5-10 ns

Maximum Frequency:

fmax = 1 / critical_path_delay

Example: 4.5 ns critical path → fmax ≈ 222 MHz

Pipelined vs Combinational Designs

Combinational (pipeline_stages=0)

verilog = compiled.get_verilog_code(pipeline_stages=0)  # Default
  • No clock or reset signals

  • All logic in single cycle

  • Synthesis uses -mode out_of_context

  • Problem: May fail for large models (>1M Verilog lines)

Timing Analysis:

Combinational (pipeline_stages=0):
  Critical path: 25 ns
  fmax: 40 MHz

Pipelined (pipeline_stages>0)

verilog = compiled.get_verilog_code(pipeline_stages=4)  # 4 stages
  • Has clock (clk) and reset (rst) signals

  • Logic divided into N pipeline stages

  • Synthesis auto-detects and applies clock constraints

  • Solution: Enables synthesis of very large models

Timing Analysis:

Pipelined (pipeline_stages=4):
  Critical path: 6 ns
  fmax: 166 MHz
  Latency: 4 cycles = 24 ns @ 166 MHz

When to use pipelining:

  • Synthesis fails or runs for hours → Use pipeline_stages=1 or more

  • Model >20 layers → Consider pipeline_stages=4-8

  • Model >100 layers → Use pipeline_stages=16 or higher

Synthesis Strategies

For Minimum Latency

synth_design -top logic_net -mode out_of_context \
    -directive PerformanceOptimized \
    -no_lc  # Disable logic combining for speed

For Minimum Area

synth_design -top logic_net -mode out_of_context \
    -directive AreaOptimized_high \
    -shreg_min_size 5  # Aggressive resource sharing

For Balanced Results

synth_design -top logic_net -mode out_of_context \
    -directive Default

Common Issues and Solutions

Issue: “Multi-driven net” Error

Cause: Multiple assign statements to the same wire. Solution: Check generated Verilog for duplicate assignments.

Issue: Unrealistically High fmax

Cause: No input/output delays specified. Solution: Add timing constraints with realistic I/O delays.

Issue: Very High LUT Count

Cause: Unoptimized gate structure or deep logic trees. Solution:

  • Check for constant propagation opportunities

  • Verify gate operations are using optimal Boolean functions

  • Consider factoring logic differently in training

Next Steps After Synthesis

  1. Implementation: Run place & route for final timing/resource numbers

opt_design
place_design
route_design
report_timing_summary -file post_route_timing.txt
report_utilization -file post_route_utilization.txt
  1. Bitstream Generation: Create FPGA configuration file

write_bitstream -force logic_net.bit
  1. Hardware Testing: Deploy to actual FPGA and validate


Complete Workflow Examples

Example 1: Small Model - Full Workflow

Train, export, test, and synthesize a small MNIST classifier:

#!/usr/bin/env python3
import torch
import torch.nn as nn
from torchlogix.layers import LogicDense, GroupSum
from torchlogix import CompiledLogicNet
import numpy as np
import subprocess

# 1. Create and train model
model = nn.Sequential(
    LogicDense(784, 128, connections="fixed", device="cpu"),
    LogicDense(128, 128, connections="fixed", device="cpu"),
    LogicDense(128, 100, connections="fixed", device="cpu"),
    GroupSum(10, tau=10.0)
)

# Train your model here...
# model.load_state_dict(torch.load('trained_model.pt'))

# 2. Compile and export Verilog
compiled = CompiledLogicNet(
    model,
    input_shape=(784,),
    use_bitpacking=False,
    num_bits=1
)

# Small model: fully combinational
verilog = compiled.get_verilog_code(
    module_name="mnist_classifier",
    pipeline_stages=0
)

with open('mnist_classifier.v', 'w') as f:
    f.write(verilog)

print("✓ Generated mnist_classifier.v")

# 3. Generate test vectors
num_tests = 100
test_inputs = np.random.randint(0, 2, (num_tests, 784), dtype=np.int8)
test_outputs = []

compiled.compile()
for inp in test_inputs:
    out = compiled.forward(inp.reshape(1, -1))
    test_outputs.append(out[0])

np.savetxt('test_vectors_input.txt', test_inputs, fmt='%d')
np.savetxt('test_vectors_output.txt', np.array(test_outputs), fmt='%d')

print(f"✓ Generated {num_tests} test vectors")

# 4. Run simulation (using Icarus Verilog)
subprocess.run([
    'iverilog', '-o', 'sim.out',
    'mnist_classifier.v', 'tb_mnist_classifier.v'
])
result = subprocess.run(['vvp', 'sim.out'], capture_output=True, text=True)

if "ALL TESTS PASSED" in result.stdout:
    print("✓ Simulation passed")
else:
    print("✗ Simulation failed")
    exit(1)

# 5. Run synthesis
subprocess.run([
    'vivado', '-mode', 'batch',
    '-source', 'synthesize.tcl',
    '-tclargs', 'mnist_classifier.v', 'xc7z020clg400-1', 'reports/'
])

print("✓ Synthesis complete - check reports/ directory")

Example 2: Large Model with Pipelining

#!/usr/bin/env python3
from torchlogix import CompiledLogicNet
import subprocess

# Load large pre-trained model
model = ...  # 50+ layers

compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1)

# Try different pipeline configurations
configs = [
    (0, "Fully combinational"),
    (1, "Single output register"),
    (4, "4 pipeline stages"),
    (8, "8 pipeline stages"),
]

for pipeline_stages, description in configs:
    print(f"\n{'='*60}")
    print(f"Testing: {description} (pipeline_stages={pipeline_stages})")
    print(f"{'='*60}")

    # Generate Verilog
    verilog = compiled.get_verilog_code(
        module_name=f"large_model_p{pipeline_stages}",
        pipeline_stages=pipeline_stages
    )

    filename = f"large_model_p{pipeline_stages}.v"
    with open(filename, 'w') as f:
        f.write(verilog)

    print(f"✓ Generated {filename} ({len(verilog)} bytes)")

    # Synthesize
    result = subprocess.run([
        'vivado', '-mode', 'batch',
        '-source', 'synthesize.tcl',
        '-tclargs', filename, 'xc7z020clg400-1', f'reports_p{pipeline_stages}/'
    ], capture_output=True, text=True, timeout=600)

    if result.returncode == 0:
        print(f"✓ Synthesis succeeded")

        # Parse results
        with open(f'reports_p{pipeline_stages}/summary.txt') as f:
            print(f.read())
    else:
        print(f"✗ Synthesis failed or timed out")

print("\n" + "="*60)
print("Compare results in reports_p*/ directories")
print("="*60)

Example 3: Convolutional Model

#!/usr/bin/env python3
import torch.nn as nn
from torchlogix.layers import LogicConv2d, OrPooling, LogicDense, GroupSum
from torchlogix import CompiledLogicNet

# Create convolutional model
model = nn.Sequential(
    LogicConv2d(
        in_dim=(28, 28),
        channels=1,
        num_kernels=16,
        tree_depth=3,
        receptive_field_size=5,
        padding=2,
        connections="fixed",
        device="cpu"
    ),
    OrPooling(kernel_size=2, stride=2),

    LogicConv2d(
        in_dim=(14, 14),
        channels=16,
        num_kernels=32,
        tree_depth=3,
        receptive_field_size=3,
        padding=1,
        connections="fixed",
        device="cpu"
    ),
    OrPooling(kernel_size=2, stride=2),

    nn.Flatten(),
    LogicDense(32*7*7, 256, connections="fixed", device="cpu"),
    LogicDense(256, 100, connections="fixed", device="cpu"),
    GroupSum(10, tau=10.0)
)

# Train model...

# Export with pipelining (medium-sized model)
compiled = CompiledLogicNet(
    model,
    input_shape=(1, 28, 28),
    use_bitpacking=False,
    num_bits=1
)

verilog = compiled.get_verilog_code(
    module_name="conv_classifier",
    pipeline_stages=4  # 4 stages for medium conv model
)

compiled.export_hdl(
    "./conv_verilog_output",
    module_name="conv_classifier",
    pipeline_stages=4
)

print("✓ Exported convolutional model to conv_verilog_output/")

Example 4: Python-Driven Synthesis Loop

Automate synthesis and collect metrics:

#!/usr/bin/env python3
import subprocess
import re
import pandas as pd

def synthesize_and_extract_metrics(verilog_file, part, report_dir):
    """Run Vivado synthesis and extract key metrics."""

    # Run synthesis
    result = subprocess.run([
        'vivado', '-mode', 'batch',
        '-source', 'synthesize.tcl',
        '-tclargs', verilog_file, part, report_dir
    ], capture_output=True, text=True, timeout=600)

    if result.returncode != 0:
        return None

    # Parse summary
    metrics = {}
    with open(f'{report_dir}/summary.txt', 'r') as f:
        for line in f:
            if 'LUTs:' in line:
                metrics['luts'] = int(re.search(r'\d+', line).group())
            elif 'FFs:' in line:
                metrics['ffs'] = int(re.search(r'\d+', line).group())

    # Parse timing
    with open(f'{report_dir}/timing.txt', 'r') as f:
        content = f.read()
        wns_match = re.search(r'WNS\(ns\)\s+([-\d.]+)', content)
        if wns_match:
            metrics['wns_ns'] = float(wns_match.group(1))

    return metrics

# Run experiments
compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1)

results = []
for stages in [0, 1, 2, 4, 8, 16]:
    print(f"Synthesizing with pipeline_stages={stages}...")

    verilog = compiled.get_verilog_code(
        module_name=f'model_p{stages}',
        pipeline_stages=stages
    )

    filename = f'model_p{stages}.v'
    with open(filename, 'w') as f:
        f.write(verilog)

    metrics = synthesize_and_extract_metrics(
        filename,
        'xc7z020clg400-1',
        f'reports_p{stages}'
    )

    if metrics:
        metrics['pipeline_stages'] = stages
        results.append(metrics)
        print(f"  LUTs: {metrics['luts']}, FFs: {metrics['ffs']}, WNS: {metrics['wns_ns']} ns")

# Create summary DataFrame
df = pd.DataFrame(results)
df.to_csv('synthesis_comparison.csv', index=False)

print("\n" + "="*60)
print("Synthesis Comparison")
print("="*60)
print(df)

Advanced Topics

Custom Timing Constraints

For more accurate timing analysis, create custom constraints in constraints.xdc:

# Virtual clock for timing analysis (10ns = 100 MHz)
create_clock -period 10.000 -name virtual_clk

# Input delay (assume inputs arrive 2ns after clock edge)
set_input_delay -clock virtual_clk 2.000 [get_ports inp*]

# Output delay (assume outputs must be stable 2ns before next clock edge)
set_output_delay -clock virtual_clk 2.000 [get_ports out*]

# For pipelined designs with actual clock
create_clock -period 5.000 -name clk [get_ports clk]  # 200 MHz

# Relax timing on reset path
set_false_path -from [get_ports rst]

Load in synthesis:

read_xdc constraints.xdc
synth_design -top logic_net

Comparing C Code and Verilog Implementations

TorchLogix can generate both C code (for HLS or CPU) and Verilog:

from torchlogix import CompiledLogicNet

compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1)

# Generate C code
c_code = compiled.get_c_code()
with open('model.c', 'w') as f:
    f.write(c_code)

# Generate Verilog
verilog_code = compiled.get_verilog_code()
with open('model.v', 'w') as f:
    f.write(verilog_code)

# Compile C for CPU execution
compiled.compile(compiler='gcc', optimization_level='-O3')

# Now you can compare:
# - C compiled to native CPU code
# - C compiled through Vivado HLS to RTL
# - Direct Verilog synthesis

# All three should produce functionally identical results

Integration with C→HLS Pipeline

For users who want to explore the HLS route:

# Generate optimized C code
compiled = CompiledLogicNet(model)
c_code = compiled.get_c_code()

with open('model.c', 'w') as f:
    f.write(c_code)

Then use Vivado HLS:

# hls_script.tcl
open_project hls_project
set_top model_top
add_files model.c
open_solution "solution1"
set_part {xc7z020clg400-1}
create_clock -period 10

csynth_design
export_design -format ip_catalog
exit
vivado_hls -f hls_script.tcl

This generates an IP core that can be integrated into larger FPGA designs.

Layer Support Status and Future Work

Currently Supported

Layer

Status

Implementation

LogicDense

✅ Complete

Direct gate synthesis

LogicConv2d

✅ Complete

Binary tree structure

LogicConv3d

✅ Complete

Binary tree with 3D indexing

Flatten

✅ Complete

Wire passthrough

In Progress / TODO

Layer

Status

Notes

OrPooling

⚠️ TODO

OR reduction tree needed

GroupSum

⚠️ TODO

Adder tree implementation needed

Models using unsupported layers will generate placeholders in Verilog. For production use:

  • Use C code generation for complete model support

  • Wait for future releases with complete Verilog support

  • Manually implement missing layers if needed

Optimization Tips

Model Architecture:

  • Prefer models with 20-100 layers for good synthesis results

  • Very deep models (>200 layers) require aggressive pipelining

  • Conv layers with large receptive fields may need optimization

Pipeline Configuration:

  • Start with pipeline_stages=0 for small models

  • Increase incrementally if synthesis fails or is slow

  • Use full pipelining (pipeline_stages=999) for maximum fmax

Synthesis Directives:

  • Use -directive PerformanceOptimized for speed

  • Use -directive AreaOptimized_high for small FPGAs

  • Experiment with different strategies for your specific design

Resource Utilization:

  • Target <70% LUT utilization for good place-and-route results

  • Very high utilization (>90%) may cause routing failures

  • Consider model size vs available FPGA resources during training


Summary

TorchLogix provides comprehensive support for deploying logic gate networks to FPGA hardware:

  1. Direct Verilog Generation: Export trained models to readable, synthesizable Verilog RTL

  2. Configurable Pipelining: Balance latency and synthesis complexity with flexible pipeline stages

  3. Testing Support: Generate test vectors and verify functionality with standard simulators

  4. Synthesis Integration: Automate synthesis with Vivado to obtain resource and timing estimates

  5. Complete Workflows: End-to-end examples from training to hardware deployment

Key Takeaways:

  • Use pipeline_stages=0 for small models, increase for larger ones

  • Always test generated Verilog with simulations before synthesis

  • Synthesis provides accurate resource and timing estimates before hardware deployment

  • Both Verilog and C code generation are supported for maximum flexibility

For questions or issues with hardware deployment, please refer to the TorchLogix repository or open an issue.