Hardware Deployment Guide

This guide explains how to deploy TorchLogix models to FPGAs and other hardware platforms using direct Verilog/RTL generation.

Overview

TorchLogix can generate hardware descriptions (Verilog RTL) directly from trained models, enabling deployment to FPGAs and ASIC implementations. This provides an alternative to the traditional C→HLS→RTL pipeline and offers several advantages for logic gate networks.

Why Hardware Deployment?

Benefits:

Ultra-low latency: FPGA implementations can achieve sub-microsecond inference
High throughput: Massive parallelism enables processing thousands of inputs per second
Energy efficiency: Specialized hardware is more power-efficient than general-purpose CPUs/GPUs
Deterministic timing: Predictable performance for real-time applications

Use Cases:

Low-latency inference at network edge
Real-time signal processing and control systems
High-throughput batch processing
Embedded systems with strict power budgets

Design Approach: Direct Gate-Level Synthesis

TorchLogix uses a direct gate-level synthesis approach rather than LUT-based truth tables:

TorchLogix Model (PyTorch)
         ↓
   Gate Operations (AND, OR, XOR, etc.)
         ↓
   Verilog Expressions
         ↓
   FPGA Synthesis

Why direct gates instead of LUTs?

Natural mapping: TorchLogix already defines 16 gate operations
Better optimization: Modern synthesis tools optimize gate-level HDL effectively
More readable: assign out = a & b; is clearer than case statements
Binary structure: Each neuron has exactly 2 inputs → perfect for direct gates

Comparison: Direct Verilog vs C→HLS

Aspect	Direct Verilog	C→HLS→RTL
Generation	Direct from model	Compile C, then HLS
Intermediate	None	C code + HLS directives
Control	Full RTL control	HLS tool dependent
Readability	Gate-level, explicit	High-level C abstractions
Use Case	FPGA-specific deployment	Cross-platform (CPU + FPGA)

Both approaches are supported by TorchLogix. Use direct Verilog for FPGA-specific optimization and C→HLS for flexibility.

Supported Layer Types

Layer Type	Verilog Support	Notes
LogicDense	✅ Fully Supported	Direct gate synthesis
LogicConv2d	✅ Fully Supported	Binary tree structure
LogicConv3d	✅ Fully Supported	Binary tree with 3D indexing
Flatten	✅ Supported	Wire passthrough
OrPooling	⚠️ TODO	Recognized but not yet generated
GroupSum	⚠️ TODO	Recognized but not yet generated

Models using unsupported layers can still generate Verilog for supported portions, or use C code generation as an alternative.

Basic Verilog Export

Quick Start

Generate Verilog from any trained TorchLogix model:

import torch
import torch.nn as nn
from torchlogix.layers import LogicDense, GroupSum
from torchlogix import CompiledLogicNet

# Create or load your model
model = nn.Sequential(
    LogicDense(8, 32, connections="fixed", device="cpu"),
    LogicDense(32, 32, connections="fixed", device="cpu"),
    GroupSum(1, tau=1.0)
)

# Compile the model
compiled = CompiledLogicNet(
    model,
    input_shape=(8,),
    use_bitpacking=False,
    num_bits=1
)

# Generate Verilog code
verilog_code = compiled.get_verilog_code(module_name="my_logic_net")

# Export to file
compiled.export_hdl(
    output_dir="./verilog_output",
    module_name="my_logic_net",
    format="verilog"
)

API Reference

`get_verilog_code(module_name, pipeline_stages)`

Generates complete Verilog module as a string.

Parameters:

module_name (str): Name of the top-level Verilog module (default: "torchlogix_net")
pipeline_stages (int): Number of pipeline stages (default: 0)
- 0: Fully combinational (no registers, 1 cycle latency)
- 1: Single output register (helps synthesis)
- N: Divide layers into N pipeline stages (N cycle latency)
- len(layers): Full layer-level pipelining (highest fmax)

Returns: Complete Verilog code as string

`export_hdl(output_dir, module_name, format, pipeline_stages)`

Exports Verilog to a file.

Parameters:

output_dir (str): Directory to write Verilog file
module_name (str): Module name (default: "torchlogix_net")
format (str): HDL format, currently only "verilog" supported (default: "verilog")
pipeline_stages (int): Pipeline configuration (default: 0)

Creates: {output_dir}/{module_name}.v

Understanding Generated Verilog

Combinational Design (pipeline_stages=0)

For a simple 2-layer network:

module logic_net (
    input wire [7:0] inp,
    output wire [3:0] out
);
    // No clock or reset needed

    // Layer 0: LogicDense (4 neurons)
    wire [3:0] layer_0_out;
    assign layer_0_out[0] = (inp[0] & inp[2]);  // AND gate
    assign layer_0_out[1] = (inp[1] | inp[3]);  // OR gate
    assign layer_0_out[2] = (inp[4] ^ inp[5]);  // XOR gate
    assign layer_0_out[3] = ~(inp[6] & inp[7]); // NAND gate

    // Layer 1: LogicDense (2 neurons)
    assign out[0] = (layer_0_out[0] | layer_0_out[1]);
    assign out[1] = (layer_0_out[2] ^ layer_0_out[3]);

endmodule

Characteristics:

Pure combinational logic (no state)
No clock or reset signals
1 cycle latency (output available same cycle as input)
Critical path spans entire network

Pipelined Design (pipeline_stages=2)

With pipeline registers:

module logic_net (
    input wire clk,
    input wire rst,
    input wire [7:0] inp,
    output reg [3:0] out
);
    // Combinational wires
    wire [3:0] layer_0_comb;
    wire [3:0] out_comb;

    // Pipeline register
    reg [3:0] layer_0_out;

    // Layer 0: Combinational logic
    assign layer_0_comb[0] = (inp[0] & inp[2]);
    assign layer_0_comb[1] = (inp[1] | inp[3]);
    assign layer_0_comb[2] = (inp[4] ^ inp[5]);
    assign layer_0_comb[3] = ~(inp[6] & inp[7]);

    // Pipeline register after Layer 0
    always @(posedge clk) begin
        if (rst)
            layer_0_out <= 4'd0;
        else
            layer_0_out <= layer_0_comb;
    end

    // Layer 1: Combinational logic
    assign out_comb[0] = (layer_0_out[0] | layer_0_out[1]);
    assign out_comb[1] = (layer_0_out[2] ^ layer_0_out[3]);

    // Output register
    always @(posedge clk) begin
        if (rst)
            out <= 4'd0;
        else
            out <= out_comb;
    end

endmodule

Characteristics:

Synchronous design with clock and reset
Registers break up long combinational paths
N cycle latency (where N = pipeline_stages)
Higher maximum frequency (fmax)

Gate Operations Supported

All 16 two-input Boolean operations are supported:

Gate ID	Operation	Verilog Expression
0	Zero (constant)	`1'b0`
1	AND	`(a & b)`
2	A AND NOT B	`(a & ~b)`
3	A (passthrough)	`a`
4	NOT A AND B	`(~a & b)`
5	B (passthrough)	`b`
6	XOR	`(a ^ b)`
7	OR	`(a \| b)`
8	NOR	`~(a \| b)`
9	XNOR	`~(a ^ b)`
10	NOT B	`~b`
11	B IMPLIES A	`(~b \| a)`
12	NOT A	`~a`
13	A IMPLIES B	`(~a \| b)`
14	NAND	`~(a & b)`
15	One (constant)	`1'b1`

Pipelining for Large Models

The Problem: Large Combinational Designs

By default, TorchLogix generates fully combinational Verilog where all logic executes in a single clock cycle. This works well for small models but causes serious problems for larger ones:

Symptoms of Combinational Overload

Synthesis fails or runs for hours without completing
Verilog files >1M lines take forever to process
Very low maximum frequency (fmax < 50 MHz)
Timing closure failures (negative WNS)
“Design too large” errors from Vivado

Why This Happens

Deep combinational paths through many layers
Synthesis tools struggle to optimize very large logic cones
Critical path delay grows with model depth
No natural break points for timing optimization

The Solution: Pipeline Stages

Pipelining inserts registers between layers to break up long combinational paths:

Combinational (pipeline_stages=0):
  Input → [Layer0 → Layer1 → Layer2 → Layer3] → Output
  All in 1 cycle, huge critical path

Pipelined (pipeline_stages=4):
  Input → [Layer0] → REG → [Layer1] → REG → [Layer2] → REG → [Layer3] → REG → Output
  4 cycles latency, short critical paths

Benefits

Synthesis succeeds even for very large models
Much faster synthesis time (minutes vs hours)
Higher maximum frequency (200+ MHz vs <50 MHz)
Predictable timing closure
Better resource utilization

Trade-offs

Increased latency (N cycles instead of 1)
More flip-flops (registers consume area)
Need to handle clock and reset signals

Pipeline Stage Options

`pipeline_stages=0` - Fully Combinational (Default)

verilog = compiled.get_verilog_code(pipeline_stages=0)

No registers, no clock required
1 cycle latency
Use for: Small models (<10 layers), initial prototyping
Avoid for: Large models (synthesis will fail)

`pipeline_stages=1` - Output Register Only

verilog = compiled.get_verilog_code(pipeline_stages=1)

Single register at output
1 cycle latency
Use for: Medium models where synthesis struggles but you need low latency
Best for: 10-30 layer models

`pipeline_stages=N` - N Pipeline Stages

# 4 pipeline stages
verilog = compiled.get_verilog_code(pipeline_stages=4)

Layers divided into N groups, register after each group
N cycle latency
Use for: Large models (50-200 layers)
Best for: Balancing latency vs synthesis speed

Full Layer-Level Pipelining

# Register between every layer
num_layers = len([m for m in model.modules() if isinstance(m, (LogicDense, LogicConv2d))])
verilog = compiled.get_verilog_code(pipeline_stages=num_layers)

# Or just use a large number
verilog = compiled.get_verilog_code(pipeline_stages=999)

Register after every single layer
Maximum possible fmax
Highest latency (= number of layers)
Use for: Very large models (>200 layers) or maximum throughput applications

Choosing the Right Pipeline Configuration

Decision Tree

Is synthesis failing or very slow?
│
├─ NO → Use pipeline_stages=0 (fully combinational)
│        Lowest latency, simplest design
│
└─ YES → How many layers in your model?
         │
         ├─ <20 layers → pipeline_stages=1
         │               (Output register only)
         │
         ├─ 20-100 layers → pipeline_stages=4 to 8
         │                  (Balanced approach)
         │
         └─ >100 layers → pipeline_stages=N/4 to N
                          (N = number of layers)

Size Guidelines

Model Characteristics	Recommended Config	Latency	Benefits
<10 layers, <100K Verilog lines	`pipeline_stages=0`	1 cycle	Simple, low latency
10-30 layers, synthesis slow	`pipeline_stages=1`	1 cycle	Helps synthesis
30-100 layers	`pipeline_stages=4`	4 cycles	Good balance
100-200 layers	`pipeline_stages=8-16`	8-16 cycles	Reliable synthesis
>200 layers	`pipeline_stages=N/4`	N/4 cycles	Fast synthesis
Maximum throughput needed	`pipeline_stages=999`	N cycles	Highest fmax

Empirical Testing

Start conservative and increase pipelining if needed:

# Step 1: Try combinational
verilog = compiled.get_verilog_code(pipeline_stages=0)
# Try to synthesize... if it fails or is very slow:

# Step 2: Add output register
verilog = compiled.get_verilog_code(pipeline_stages=1)
# Try to synthesize... if still slow:

# Step 3: Increase stages
for stages in [2, 4, 8, 16]:
    verilog = compiled.get_verilog_code(pipeline_stages=stages)
    # Synthesize and check timing/area trade-off

Performance Optimization

Finding Optimal Pipeline Depth

Run synthesis with different configurations and compare:

import subprocess

results = []
for stages in [0, 1, 2, 4, 8, 16]:
    verilog = compiled.get_verilog_code(
        module_name=f'design_p{stages}',
        pipeline_stages=stages
    )

    # Save Verilog
    with open(f'design_p{stages}.v', 'w') as f:
        f.write(verilog)

    # Synthesize (see Synthesis section for details)
    subprocess.run([
        'vivado', '-mode', 'batch',
        '-source', 'synthesize.tcl',
        '-tclargs', f'design_p{stages}.v', 'xc7z020clg400-1'
    ])

    # Parse and compare results
    # results.append((stages, luts, ffs, fmax, synthesis_time))

# Find optimal trade-off based on your requirements

Common Issues

Issue: Pipelined design has lower fmax than expected

Cause: Not enough pipeline stages, or uneven distribution
Solution: Increase pipeline_stages or try full layer-level pipelining

Issue: Too much area consumed by registers

Cause: Too many pipeline stages for the model size
Solution: Reduce pipeline_stages to find balance

Issue: Synthesis still slow with pipelining

Cause: Individual layers may still be very large
Solution:
- Check if conv layers with large receptive fields need breaking up
- Use more pipeline stages
- Consider model architecture changes

Testing Generated Verilog

Functional testing and verification ensures your generated Verilog matches the expected behavior from the trained model.

Prerequisites

You’ll need one of the following simulators:

Vivado Simulator (xsim) - Included with Vivado
ModelSim/QuestaSim - Commercial simulator from Mentor/Siemens
Icarus Verilog - Open-source, free (apt install iverilog or brew install icarus-verilog)
Verilator - Fast open-source simulator (apt install verilator or brew install verilator)

Step 1: Generate Test Vectors

Export test vectors from your trained model using Python:

import torch
import numpy as np
from torchlogix import CompiledLogicNet

# Load your trained model
model = ...  # Your trained TorchLogix model

# Generate test vectors
compiled = CompiledLogicNet(model, input_shape=(8,), use_bitpacking=False, num_bits=1)
compiled.compile()

# Generate random binary test cases
num_tests = 100
input_size = 8  # Match your model's input size
test_inputs = np.random.randint(0, 2, (num_tests, input_size), dtype=np.int8)

# Get expected outputs
test_outputs = []
for inp in test_inputs:
    out = compiled.forward(inp.reshape(1, -1))
    test_outputs.append(out[0])

# Save to files for testbench
np.savetxt('test_vectors_input.txt', test_inputs, fmt='%d')
np.savetxt('test_vectors_output.txt', np.array(test_outputs), fmt='%d')

print(f"Generated {num_tests} test vectors")

Step 2: Create a Verilog Testbench

Create a testbench file tb_logic_net.v for combinational designs:

`timescale 1ns/1ps

module tb_logic_net;
    // Parameters
    parameter INPUT_WIDTH = 8;
    parameter OUTPUT_WIDTH = 2;
    parameter NUM_TESTS = 100;

    // Signals
    reg [INPUT_WIDTH-1:0] inp;
    wire [OUTPUT_WIDTH-1:0] out;

    // Expected output
    reg [OUTPUT_WIDTH-1:0] expected_out;

    // Test vectors
    reg [INPUT_WIDTH-1:0] test_inputs [0:NUM_TESTS-1];
    reg [OUTPUT_WIDTH-1:0] test_outputs [0:NUM_TESTS-1];

    integer i;
    integer errors;

    // Instantiate the DUT (Device Under Test)
    logic_net dut (
        .inp(inp),
        .out(out)
    );

    // Load test vectors
    initial begin
        $readmemb("test_vectors_input.txt", test_inputs);
        $readmemb("test_vectors_output.txt", test_outputs);
        errors = 0;
    end

    // Test stimulus
    initial begin
        $display("Starting testbench...");
        $display("Time\t\tInput\t\tOutput\t\tExpected\tStatus");
        $display("----\t\t-----\t\t------\t\t--------\t------");

        // Run through all test vectors
        for (i = 0; i < NUM_TESTS; i = i + 1) begin
            inp = test_inputs[i];
            expected_out = test_outputs[i];
            #10;  // Wait 10ns for combinational logic to settle

            // Check output
            if (out !== expected_out) begin
                $display("%0t\t%b\t%b\t%b\t\tFAIL", $time, inp, out, expected_out);
                errors = errors + 1;
            end else begin
                $display("%0t\t%b\t%b\t%b\t\tPASS", $time, inp, out, expected_out);
            end
        end

        // Summary
        $display("\n========================================");
        $display("Test Summary");
        $display("========================================");
        $display("Total tests: %0d", NUM_TESTS);
        $display("Passed:      %0d", NUM_TESTS - errors);
        $display("Failed:      %0d", errors);

        if (errors == 0) begin
            $display("\nALL TESTS PASSED!");
        end else begin
            $display("\nSOME TESTS FAILED!");
        end

        $finish;
    end

    // Optional: Generate VCD waveform dump
    initial begin
        $dumpfile("tb_logic_net.vcd");
        $dumpvars(0, tb_logic_net);
    end

endmodule

Testbench for Pipelined Designs

For pipelined designs, you need to account for latency:

module tb_pipelined_logic_net;
    parameter PIPELINE_LATENCY = 4;  // Match your pipeline_stages

    reg clk = 0;
    reg rst = 1;
    reg [7:0] inp;
    wire [1:0] out;

    // Generate clock (100 MHz)
    always #5 clk = ~clk;

    logic_net dut (
        .clk(clk),
        .rst(rst),
        .inp(inp),
        .out(out)
    );

    initial begin
        // Reset sequence
        rst = 1;
        #20 rst = 0;  // Release reset after 2 cycles

        // Run tests with pipeline latency
        for (i = 0; i < NUM_TESTS; i = i + 1) begin
            inp = test_inputs[i];

            // Wait for pipeline to fill
            repeat(PIPELINE_LATENCY) @(posedge clk);

            expected_out = test_outputs[i];

            // Check output
            if (out !== expected_out) begin
                $display("FAIL: Input %b -> Output %b (expected %b)",
                         test_inputs[i], out, expected_out);
                errors = errors + 1;
            end
        end

        // Summary...
        $finish;
    end
endmodule

Step 3: Simulate with Different Tools

Option A: Vivado Simulator (xsim)

# Compile the design
xvlog logic_net.v
xvlog tb_logic_net.v

# Elaborate
xelab -debug typical tb_logic_net -s tb_logic_net_sim

# Run simulation
xsim tb_logic_net_sim -runall

# View waveforms (GUI mode)
xsim tb_logic_net_sim -gui

Option B: Icarus Verilog

# Compile and run in one step
iverilog -o sim.out logic_net.v tb_logic_net.v
vvp sim.out

# View waveforms with GTKWave
gtkwave tb_logic_net.vcd

Option C: ModelSim/QuestaSim

# Create work library
vlib work

# Compile sources
vlog logic_net.v
vlog tb_logic_net.v

# Simulate
vsim -c tb_logic_net -do "run -all; quit"

# Or run with GUI
vsim tb_logic_net
# In ModelSim GUI: run -all

Step 4: Analyze Results

Success Criteria

All test vectors should produce matching outputs
No ‘X’ or ‘Z’ values in outputs (indicates uninitialized or high-impedance states)
Combinational delay should be minimal (typically < 1ns for simple gates)

Common Issues

Mismatched outputs:

Verify test vector format (binary vs decimal)
Check that Verilog module name matches instantiation in testbench
Ensure input/output widths match

X or Z values:

Usually indicates undriven wires
Check all wires in generated Verilog have assignments

Compilation errors:

Verify Verilog syntax
Check for Verilog-1995 vs Verilog-2001 compatibility issues

Self-Checking Testbench

For automated testing, create a self-checking testbench that exits with an error code:

initial begin
    // ... run tests ...

    if (errors != 0) begin
        $display("FAIL: %0d errors detected", errors);
        $fatal(1, "Test failed");  // Exit with error
    end else begin
        $display("PASS: All tests passed");
    end
    $finish;
end

Then use in scripts:

#!/bin/bash
iverilog -o sim.out logic_net.v tb_logic_net.v && vvp sim.out
if [ $? -eq 0 ]; then
    echo "Simulation PASSED"
else
    echo "Simulation FAILED"
    exit 1
fi

Synthesis with Vivado

Synthesis converts Verilog RTL code into gate-level netlists optimized for specific FPGA parts, providing resource estimates, timing analysis, and power consumption data.

Overview

Synthesis provides:

Resource Estimates: LUTs, FFs, DSPs, BRAM usage
Timing Analysis: Maximum frequency (fmax), Worst Negative Slack (WNS), critical paths
Power Estimates: Static and dynamic power consumption

Prerequisites

Vivado Design Suite installed (tested with 2019.1+)
Generated Verilog file from TorchLogix (e.g., logic_net.v)
Target FPGA part number (e.g., xc7z020clg400-1 for Pynq-Z2)

Quick Start

# Navigate to your Verilog output directory
cd verilog_output/

# Run synthesis with TCL script
vivado -mode batch -source synthesize.tcl -tclargs logic_net.v xc7z020clg400-1

# Results will be in synthesis_reports/
ls synthesis_reports/
# utilization.txt  timing.txt  power.txt

Synthesis Methods

Method 1: Batch Mode with TCL Script (Recommended)

Create synthesize.tcl:

# Parse arguments
set verilog_file [lindex $argv 0]
set part_number [lindex $argv 1]
set report_dir [lindex $argv 2]

if {$report_dir == ""} {
    set report_dir "synthesis_reports"
}

# Create output directory
file mkdir $report_dir

# Create in-memory project
create_project -in_memory -part $part_number

# Read Verilog source
read_verilog $verilog_file

# Detect if design has clock port
set has_clock [expr {[llength [get_ports -quiet clk]] > 0}]

if {$has_clock} {
    puts "INFO: Detected clocked design, applying timing constraints"

    # Create clock constraint (10ns period = 100 MHz)
    create_clock -period 10.0 -name clk [get_ports clk]

    # Input/output delays
    set_input_delay -clock clk 2.0 [get_ports -filter {NAME != clk && DIRECTION == IN}]
    set_output_delay -clock clk 2.0 [get_ports -filter {DIRECTION == OUT}]

    # Run synthesis
    synth_design -top [get_property TOP [current_fileset]]
} else {
    puts "INFO: Detected combinational design"

    # Run synthesis in out-of-context mode
    synth_design -top [get_property TOP [current_fileset]] -mode out_of_context
}

# Generate reports
report_utilization -file ${report_dir}/utilization.txt
report_timing_summary -file ${report_dir}/timing.txt
report_power -file ${report_dir}/power.txt

# Extract key metrics
set lut_count [get_property LUT_AS_LOGIC [get_cells -hierarchical -filter {PRIMITIVE_TYPE =~ LUT*}] | llength]
set ff_count [get_property PRIMITIVE_COUNT [get_cells -hierarchical -filter {PRIMITIVE_TYPE =~ REGISTER*}]]

puts "\n========================================"
puts "Synthesis Summary"
puts "========================================"
puts "LUTs:  $lut_count"
puts "FFs:   $ff_count"
puts "========================================"

# Save summary
set summary_file [open "${report_dir}/summary.txt" w]
puts $summary_file "LUTs: $lut_count"
puts $summary_file "FFs: $ff_count"
close $summary_file

puts "Reports saved to: $report_dir/"
exit

Run synthesis:

vivado -mode batch -source synthesize.tcl -tclargs logic_net.v xc7z020clg400-1 reports/

Method 2: Vivado GUI

Launch Vivado: vivado
Create New Project
- Click “Create Project”
- Choose project location
- Select “RTL Project”
Add Verilog Source
- In Flow Navigator: “Add Sources”
- “Add or create design sources”
- Add your logic_net.v file
Select Target Part
- Common parts:
  - Pynq-Z2: xc7z020clg400-1
  - ZCU104: xczu7ev-ffvc1156-2-e
  - Artix-7: xc7a35tcpg236-1
Run Synthesis
- In Flow Navigator: “Run Synthesis”
View Reports
- After synthesis: “Open Synthesized Design”
- Reports → Utilization, Timing Summary

Method 3: Vivado Tcl Console

vivado -mode tcl

Then in Tcl console:

# Create in-memory project
create_project -in_memory -part xc7z020clg400-1

# Read Verilog
read_verilog logic_net.v

# Run synthesis
synth_design -top logic_net -mode out_of_context

# Generate reports
report_utilization -file utilization.txt
report_timing_summary -file timing.txt
report_power -file power.txt

exit

Understanding the Reports

Utilization Report

Shows FPGA resource usage:

+-------------------------+------+-------+-----------+-------+
|        Site Type        | Used | Fixed | Available | Util% |
+-------------------------+------+-------+-----------+-------+
| Slice LUTs              |  147 |     0 |     53200 |  0.28 |
|   LUT as Logic          |  147 |     0 |     53200 |  0.28 |
|   LUT as Memory         |    0 |     0 |     17400 |  0.00 |
| Slice Registers         |    0 |     0 |    106400 |  0.00 |
|   Register as Flip Flop |    0 |     0 |    106400 |  0.00 |
|   Register as Latch     |    0 |     0 |    106400 |  0.00 |
| F7 Muxes                |    8 |     0 |     26600 |  0.03 |
| F8 Muxes                |    2 |     0 |     13300 |  0.02 |
+-------------------------+------+-------+-----------+-------+

Key Metrics:

Slice LUTs: Primary logic resource. Each LUT can implement any 6-input Boolean function.
Slice Registers: Flip-flops for sequential logic (0 for purely combinational designs)
F7/F8 Muxes: Larger multiplexers for wide logic functions
DSPs: Digital Signal Processing blocks (typically 0 for logic gate networks)
BRAM: Block RAM (typically 0 for logic gate networks)

Timing Report

Shows timing analysis:

Timing Summary (ns)
-------------------
WNS(ns)      TNS(ns)      WHS(ns)      THS(ns)      WPWS(ns)     TPWS(ns)
-------      -------      -------      -------      --------     --------
  7.500        0.000        0.300        0.000         3.500        0.000

Key Metrics:

WNS (Worst Negative Slack): Most critical timing margin
- Positive = timing met
- Negative = timing violation
Critical Path Delay: Longest combinational path through the design

For combinational designs:

# Get maximum delay through the design
report_timing -delay_type min_max -max_paths 10 -file timing_paths.txt

Example output:

Critical Path Delay: 4.832 ns
Maximum Frequency: 206.95 MHz (if this were in a clocked design)

Power Report

Shows estimated power consumption:

Total On-Chip Power (W)  : 0.082
Dynamic (W)              : 0.012
Device Static (W)        : 0.070

Interpreting Results for TorchLogix Models

Resource Estimates

LUT Count Interpretation:

Each logic gate (AND, OR, XOR, etc.) typically maps to a fractional LUT
Expect roughly 0.5-1 LUT per gate operation
Tree structures may share LUTs efficiently
A 1000-gate network might use 500-800 LUTs

Resource Scaling:

Linear layers: O(neurons) LUTs
Convolutional layers: O(kernels × receptive_field²) LUTs
Deeper trees → more efficient LUT packing

Pipelining Impact:

Adds flip-flops (FFs) for registers
May actually reduce LUTs through better optimization
Example: 50-layer model with 4 pipeline stages might use 20% fewer LUTs

Latency Estimates

Combinational Delay:

Total latency = critical path delay
Typical delays:
- Simple AND/OR/XOR: 0.1-0.2 ns per gate
- Deep trees (10 levels): 2-4 ns total
- Very deep networks: 5-10 ns

Maximum Frequency:

fmax = 1 / critical_path_delay

Example: 4.5 ns critical path → fmax ≈ 222 MHz

Pipelined vs Combinational Designs

Combinational (pipeline_stages=0)

verilog = compiled.get_verilog_code(pipeline_stages=0)  # Default

No clock or reset signals
All logic in single cycle
Synthesis uses -mode out_of_context
Problem: May fail for large models (>1M Verilog lines)

Timing Analysis:

Combinational (pipeline_stages=0):
  Critical path: 25 ns
  fmax: 40 MHz

Pipelined (pipeline_stages>0)

verilog = compiled.get_verilog_code(pipeline_stages=4)  # 4 stages

Has clock (clk) and reset (rst) signals
Logic divided into N pipeline stages
Synthesis auto-detects and applies clock constraints
Solution: Enables synthesis of very large models

Timing Analysis:

Pipelined (pipeline_stages=4):
  Critical path: 6 ns
  fmax: 166 MHz
  Latency: 4 cycles = 24 ns @ 166 MHz

When to use pipelining:

Synthesis fails or runs for hours → Use pipeline_stages=1 or more
Model >20 layers → Consider pipeline_stages=4-8
Model >100 layers → Use pipeline_stages=16 or higher

Synthesis Strategies

For Minimum Latency

synth_design -top logic_net -mode out_of_context \
    -directive PerformanceOptimized \
    -no_lc  # Disable logic combining for speed

For Minimum Area

synth_design -top logic_net -mode out_of_context \
    -directive AreaOptimized_high \
    -shreg_min_size 5  # Aggressive resource sharing

For Balanced Results

synth_design -top logic_net -mode out_of_context \
    -directive Default

Common Issues and Solutions

Issue: “Multi-driven net” Error

Cause: Multiple assign statements to the same wire. Solution: Check generated Verilog for duplicate assignments.

Issue: Unrealistically High fmax

Cause: No input/output delays specified. Solution: Add timing constraints with realistic I/O delays.

Issue: Very High LUT Count

Cause: Unoptimized gate structure or deep logic trees. Solution:

Check for constant propagation opportunities
Verify gate operations are using optimal Boolean functions
Consider factoring logic differently in training

Next Steps After Synthesis

Implementation: Run place & route for final timing/resource numbers

opt_design
place_design
route_design
report_timing_summary -file post_route_timing.txt
report_utilization -file post_route_utilization.txt

Bitstream Generation: Create FPGA configuration file

write_bitstream -force logic_net.bit

Hardware Testing: Deploy to actual FPGA and validate

Complete Workflow Examples

Example 1: Small Model - Full Workflow

Train, export, test, and synthesize a small MNIST classifier:

#!/usr/bin/env python3
import torch
import torch.nn as nn
from torchlogix.layers import LogicDense, GroupSum
from torchlogix import CompiledLogicNet
import numpy as np
import subprocess

# 1. Create and train model
model = nn.Sequential(
    LogicDense(784, 128, connections="fixed", device="cpu"),
    LogicDense(128, 128, connections="fixed", device="cpu"),
    LogicDense(128, 100, connections="fixed", device="cpu"),
    GroupSum(10, tau=10.0)
)

# Train your model here...
# model.load_state_dict(torch.load('trained_model.pt'))

# 2. Compile and export Verilog
compiled = CompiledLogicNet(
    model,
    input_shape=(784,),
    use_bitpacking=False,
    num_bits=1
)

# Small model: fully combinational
verilog = compiled.get_verilog_code(
    module_name="mnist_classifier",
    pipeline_stages=0
)

with open('mnist_classifier.v', 'w') as f:
    f.write(verilog)

print("✓ Generated mnist_classifier.v")

# 3. Generate test vectors
num_tests = 100
test_inputs = np.random.randint(0, 2, (num_tests, 784), dtype=np.int8)
test_outputs = []

compiled.compile()
for inp in test_inputs:
    out = compiled.forward(inp.reshape(1, -1))
    test_outputs.append(out[0])

np.savetxt('test_vectors_input.txt', test_inputs, fmt='%d')
np.savetxt('test_vectors_output.txt', np.array(test_outputs), fmt='%d')

print(f"✓ Generated {num_tests} test vectors")

# 4. Run simulation (using Icarus Verilog)
subprocess.run([
    'iverilog', '-o', 'sim.out',
    'mnist_classifier.v', 'tb_mnist_classifier.v'
])
result = subprocess.run(['vvp', 'sim.out'], capture_output=True, text=True)

if "ALL TESTS PASSED" in result.stdout:
    print("✓ Simulation passed")
else:
    print("✗ Simulation failed")
    exit(1)

# 5. Run synthesis
subprocess.run([
    'vivado', '-mode', 'batch',
    '-source', 'synthesize.tcl',
    '-tclargs', 'mnist_classifier.v', 'xc7z020clg400-1', 'reports/'
])

print("✓ Synthesis complete - check reports/ directory")

Example 2: Large Model with Pipelining

#!/usr/bin/env python3
from torchlogix import CompiledLogicNet
import subprocess

# Load large pre-trained model
model = ...  # 50+ layers

compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1)

# Try different pipeline configurations
configs = [
    (0, "Fully combinational"),
    (1, "Single output register"),
    (4, "4 pipeline stages"),
    (8, "8 pipeline stages"),
]

for pipeline_stages, description in configs:
    print(f"\n{'='*60}")
    print(f"Testing: {description} (pipeline_stages={pipeline_stages})")
    print(f"{'='*60}")

    # Generate Verilog
    verilog = compiled.get_verilog_code(
        module_name=f"large_model_p{pipeline_stages}",
        pipeline_stages=pipeline_stages
    )

    filename = f"large_model_p{pipeline_stages}.v"
    with open(filename, 'w') as f:
        f.write(verilog)

    print(f"✓ Generated {filename} ({len(verilog)} bytes)")

    # Synthesize
    result = subprocess.run([
        'vivado', '-mode', 'batch',
        '-source', 'synthesize.tcl',
        '-tclargs', filename, 'xc7z020clg400-1', f'reports_p{pipeline_stages}/'
    ], capture_output=True, text=True, timeout=600)

    if result.returncode == 0:
        print(f"✓ Synthesis succeeded")

        # Parse results
        with open(f'reports_p{pipeline_stages}/summary.txt') as f:
            print(f.read())
    else:
        print(f"✗ Synthesis failed or timed out")

print("\n" + "="*60)
print("Compare results in reports_p*/ directories")
print("="*60)

Example 3: Convolutional Model

#!/usr/bin/env python3
import torch.nn as nn
from torchlogix.layers import LogicConv2d, OrPooling, LogicDense, GroupSum
from torchlogix import CompiledLogicNet

# Create convolutional model
model = nn.Sequential(
    LogicConv2d(
        in_dim=(28, 28),
        channels=1,
        num_kernels=16,
        tree_depth=3,
        receptive_field_size=5,
        padding=2,
        connections="fixed",
        device="cpu"
    ),
    OrPooling(kernel_size=2, stride=2),

    LogicConv2d(
        in_dim=(14, 14),
        channels=16,
        num_kernels=32,
        tree_depth=3,
        receptive_field_size=3,
        padding=1,
        connections="fixed",
        device="cpu"
    ),
    OrPooling(kernel_size=2, stride=2),

    nn.Flatten(),
    LogicDense(32*7*7, 256, connections="fixed", device="cpu"),
    LogicDense(256, 100, connections="fixed", device="cpu"),
    GroupSum(10, tau=10.0)
)

# Train model...

# Export with pipelining (medium-sized model)
compiled = CompiledLogicNet(
    model,
    input_shape=(1, 28, 28),
    use_bitpacking=False,
    num_bits=1
)

verilog = compiled.get_verilog_code(
    module_name="conv_classifier",
    pipeline_stages=4  # 4 stages for medium conv model
)

compiled.export_hdl(
    "./conv_verilog_output",
    module_name="conv_classifier",
    pipeline_stages=4
)

print("✓ Exported convolutional model to conv_verilog_output/")

Example 4: Python-Driven Synthesis Loop

Automate synthesis and collect metrics:

#!/usr/bin/env python3
import subprocess
import re
import pandas as pd

def synthesize_and_extract_metrics(verilog_file, part, report_dir):
    """Run Vivado synthesis and extract key metrics."""

    # Run synthesis
    result = subprocess.run([
        'vivado', '-mode', 'batch',
        '-source', 'synthesize.tcl',
        '-tclargs', verilog_file, part, report_dir
    ], capture_output=True, text=True, timeout=600)

    if result.returncode != 0:
        return None

    # Parse summary
    metrics = {}
    with open(f'{report_dir}/summary.txt', 'r') as f:
        for line in f:
            if 'LUTs:' in line:
                metrics['luts'] = int(re.search(r'\d+', line).group())
            elif 'FFs:' in line:
                metrics['ffs'] = int(re.search(r'\d+', line).group())

    # Parse timing
    with open(f'{report_dir}/timing.txt', 'r') as f:
        content = f.read()
        wns_match = re.search(r'WNS\(ns\)\s+([-\d.]+)', content)
        if wns_match:
            metrics['wns_ns'] = float(wns_match.group(1))

    return metrics

# Run experiments
compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1)

results = []
for stages in [0, 1, 2, 4, 8, 16]:
    print(f"Synthesizing with pipeline_stages={stages}...")

    verilog = compiled.get_verilog_code(
        module_name=f'model_p{stages}',
        pipeline_stages=stages
    )

    filename = f'model_p{stages}.v'
    with open(filename, 'w') as f:
        f.write(verilog)

    metrics = synthesize_and_extract_metrics(
        filename,
        'xc7z020clg400-1',
        f'reports_p{stages}'
    )

    if metrics:
        metrics['pipeline_stages'] = stages
        results.append(metrics)
        print(f"  LUTs: {metrics['luts']}, FFs: {metrics['ffs']}, WNS: {metrics['wns_ns']} ns")

# Create summary DataFrame
df = pd.DataFrame(results)
df.to_csv('synthesis_comparison.csv', index=False)

print("\n" + "="*60)
print("Synthesis Comparison")
print("="*60)
print(df)

Advanced Topics

Custom Timing Constraints

For more accurate timing analysis, create custom constraints in constraints.xdc:

# Virtual clock for timing analysis (10ns = 100 MHz)
create_clock -period 10.000 -name virtual_clk

# Input delay (assume inputs arrive 2ns after clock edge)
set_input_delay -clock virtual_clk 2.000 [get_ports inp*]

# Output delay (assume outputs must be stable 2ns before next clock edge)
set_output_delay -clock virtual_clk 2.000 [get_ports out*]

# For pipelined designs with actual clock
create_clock -period 5.000 -name clk [get_ports clk]  # 200 MHz

# Relax timing on reset path
set_false_path -from [get_ports rst]

Load in synthesis:

read_xdc constraints.xdc
synth_design -top logic_net

Comparing C Code and Verilog Implementations

TorchLogix can generate both C code (for HLS or CPU) and Verilog:

from torchlogix import CompiledLogicNet

compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1)

# Generate C code
c_code = compiled.get_c_code()
with open('model.c', 'w') as f:
    f.write(c_code)

# Generate Verilog
verilog_code = compiled.get_verilog_code()
with open('model.v', 'w') as f:
    f.write(verilog_code)

# Compile C for CPU execution
compiled.compile(compiler='gcc', optimization_level='-O3')

# Now you can compare:
# - C compiled to native CPU code
# - C compiled through Vivado HLS to RTL
# - Direct Verilog synthesis

# All three should produce functionally identical results

Integration with C→HLS Pipeline

For users who want to explore the HLS route:

# Generate optimized C code
compiled = CompiledLogicNet(model)
c_code = compiled.get_c_code()

with open('model.c', 'w') as f:
    f.write(c_code)

Then use Vivado HLS:

# hls_script.tcl
open_project hls_project
set_top model_top
add_files model.c
open_solution "solution1"
set_part {xc7z020clg400-1}
create_clock -period 10

csynth_design
export_design -format ip_catalog
exit

vivado_hls -f hls_script.tcl

This generates an IP core that can be integrated into larger FPGA designs.

Layer Support Status and Future Work

Currently Supported

Layer	Status	Implementation
LogicDense	✅ Complete	Direct gate synthesis
LogicConv2d	✅ Complete	Binary tree structure
LogicConv3d	✅ Complete	Binary tree with 3D indexing
Flatten	✅ Complete	Wire passthrough

In Progress / TODO

Layer	Status	Notes
OrPooling	⚠️ TODO	OR reduction tree needed
GroupSum	⚠️ TODO	Adder tree implementation needed

Models using unsupported layers will generate placeholders in Verilog. For production use:

Use C code generation for complete model support
Wait for future releases with complete Verilog support
Manually implement missing layers if needed

Optimization Tips

Model Architecture:

Prefer models with 20-100 layers for good synthesis results
Very deep models (>200 layers) require aggressive pipelining
Conv layers with large receptive fields may need optimization

Pipeline Configuration:

Start with pipeline_stages=0 for small models
Increase incrementally if synthesis fails or is slow
Use full pipelining (pipeline_stages=999) for maximum fmax

Synthesis Directives:

Use -directive PerformanceOptimized for speed
Use -directive AreaOptimized_high for small FPGAs
Experiment with different strategies for your specific design

Resource Utilization:

Target <70% LUT utilization for good place-and-route results
Very high utilization (>90%) may cause routing failures
Consider model size vs available FPGA resources during training

Summary

TorchLogix provides comprehensive support for deploying logic gate networks to FPGA hardware:

Direct Verilog Generation: Export trained models to readable, synthesizable Verilog RTL
Configurable Pipelining: Balance latency and synthesis complexity with flexible pipeline stages
Testing Support: Generate test vectors and verify functionality with standard simulators
Synthesis Integration: Automate synthesis with Vivado to obtain resource and timing estimates
Complete Workflows: End-to-end examples from training to hardware deployment

Key Takeaways:

Use pipeline_stages=0 for small models, increase for larger ones
Always test generated Verilog with simulations before synthesis
Synthesis provides accurate resource and timing estimates before hardware deployment
Both Verilog and C code generation are supported for maximum flexibility

For questions or issues with hardware deployment, please refer to the TorchLogix repository or open an issue.

Hardware Deployment Guide

Overview

Why Hardware Deployment?

Design Approach: Direct Gate-Level Synthesis

Comparison: Direct Verilog vs C→HLS

Supported Layer Types

Basic Verilog Export

Quick Start

API Reference

get_verilog_code(module_name, pipeline_stages)

export_hdl(output_dir, module_name, format, pipeline_stages)

Understanding Generated Verilog

Combinational Design (pipeline_stages=0)

Pipelined Design (pipeline_stages=2)

Gate Operations Supported

Pipelining for Large Models

The Problem: Large Combinational Designs

Symptoms of Combinational Overload

Why This Happens

The Solution: Pipeline Stages

Benefits

Trade-offs

Pipeline Stage Options

pipeline_stages=0 - Fully Combinational (Default)

pipeline_stages=1 - Output Register Only

pipeline_stages=N - N Pipeline Stages

Full Layer-Level Pipelining

Choosing the Right Pipeline Configuration

Decision Tree

Size Guidelines

Empirical Testing

Performance Optimization

Finding Optimal Pipeline Depth

Common Issues

Testing Generated Verilog

Prerequisites

Step 1: Generate Test Vectors

Step 2: Create a Verilog Testbench

Testbench for Pipelined Designs

Step 3: Simulate with Different Tools

Option A: Vivado Simulator (xsim)

Option B: Icarus Verilog

Option C: ModelSim/QuestaSim

Step 4: Analyze Results

Success Criteria

Common Issues

Self-Checking Testbench

Synthesis with Vivado

Overview

Prerequisites

Quick Start

Synthesis Methods

Method 1: Batch Mode with TCL Script (Recommended)

Method 2: Vivado GUI

Method 3: Vivado Tcl Console

Understanding the Reports

Utilization Report

Timing Report

Power Report

Interpreting Results for TorchLogix Models

Resource Estimates

Latency Estimates

Pipelined vs Combinational Designs

Combinational (pipeline_stages=0)

Pipelined (pipeline_stages>0)

Synthesis Strategies

For Minimum Latency

For Minimum Area

For Balanced Results

Common Issues and Solutions

Issue: “Multi-driven net” Error

Issue: Unrealistically High fmax

Issue: Very High LUT Count

Next Steps After Synthesis

Complete Workflow Examples

Example 1: Small Model - Full Workflow

Example 2: Large Model with Pipelining

Example 3: Convolutional Model

Example 4: Python-Driven Synthesis Loop

Advanced Topics

`get_verilog_code(module_name, pipeline_stages)`

`export_hdl(output_dir, module_name, format, pipeline_stages)`

`pipeline_stages=0` - Fully Combinational (Default)

`pipeline_stages=1` - Output Register Only

`pipeline_stages=N` - N Pipeline Stages