# Hardware Deployment Guide This guide explains how to deploy TorchLogix models to FPGAs and other hardware platforms using direct Verilog/RTL generation. ## Overview TorchLogix can generate hardware descriptions (Verilog RTL) directly from trained models, enabling deployment to FPGAs and ASIC implementations. This provides an alternative to the traditional C→HLS→RTL pipeline and offers several advantages for logic gate networks. ### Why Hardware Deployment? **Benefits**: - **Ultra-low latency**: FPGA implementations can achieve sub-microsecond inference - **High throughput**: Massive parallelism enables processing thousands of inputs per second - **Energy efficiency**: Specialized hardware is more power-efficient than general-purpose CPUs/GPUs - **Deterministic timing**: Predictable performance for real-time applications **Use Cases**: - Low-latency inference at network edge - Real-time signal processing and control systems - High-throughput batch processing - Embedded systems with strict power budgets ### Design Approach: Direct Gate-Level Synthesis TorchLogix uses a **direct gate-level synthesis** approach rather than LUT-based truth tables: ``` TorchLogix Model (PyTorch) ↓ Gate Operations (AND, OR, XOR, etc.) ↓ Verilog Expressions ↓ FPGA Synthesis ``` **Why direct gates instead of LUTs?** 1. **Natural mapping**: TorchLogix already defines 16 gate operations 2. **Better optimization**: Modern synthesis tools optimize gate-level HDL effectively 3. **More readable**: `assign out = a & b;` is clearer than case statements 4. **Binary structure**: Each neuron has exactly 2 inputs → perfect for direct gates ### Comparison: Direct Verilog vs C→HLS | Aspect | Direct Verilog | C→HLS→RTL | |--------|---------------|-----------| | **Generation** | Direct from model | Compile C, then HLS | | **Intermediate** | None | C code + HLS directives | | **Control** | Full RTL control | HLS tool dependent | | **Readability** | Gate-level, explicit | High-level C abstractions | | **Use Case** | FPGA-specific deployment | Cross-platform (CPU + FPGA) | Both approaches are supported by TorchLogix. Use direct Verilog for FPGA-specific optimization and C→HLS for flexibility. ### Supported Layer Types | Layer Type | Verilog Support | Notes | |------------|-----------------|-------| | **LogicDense** | ✅ Fully Supported | Direct gate synthesis | | **LogicConv2d** | ✅ Fully Supported | Binary tree structure | | **LogicConv3d** | ✅ Fully Supported | Binary tree with 3D indexing | | **Flatten** | ✅ Supported | Wire passthrough | | **OrPooling** | ⚠️ TODO | Recognized but not yet generated | | **GroupSum** | ⚠️ TODO | Recognized but not yet generated | Models using unsupported layers can still generate Verilog for supported portions, or use C code generation as an alternative. --- ## Basic Verilog Export ### Quick Start Generate Verilog from any trained TorchLogix model: ```python import torch import torch.nn as nn from torchlogix.layers import LogicDense, GroupSum from torchlogix import CompiledLogicNet # Create or load your model model = nn.Sequential( LogicDense(8, 32, connections="fixed", device="cpu"), LogicDense(32, 32, connections="fixed", device="cpu"), GroupSum(1, tau=1.0) ) # Compile the model compiled = CompiledLogicNet( model, input_shape=(8,), use_bitpacking=False, num_bits=1 ) # Generate Verilog code verilog_code = compiled.get_verilog_code(module_name="my_logic_net") # Export to file compiled.export_hdl( output_dir="./verilog_output", module_name="my_logic_net", format="verilog" ) ``` ### API Reference #### `get_verilog_code(module_name, pipeline_stages)` Generates complete Verilog module as a string. **Parameters**: - `module_name` (str): Name of the top-level Verilog module (default: `"torchlogix_net"`) - `pipeline_stages` (int): Number of pipeline stages (default: `0`) - `0`: Fully combinational (no registers, 1 cycle latency) - `1`: Single output register (helps synthesis) - `N`: Divide layers into N pipeline stages (N cycle latency) - `len(layers)`: Full layer-level pipelining (highest fmax) **Returns**: Complete Verilog code as string #### `export_hdl(output_dir, module_name, format, pipeline_stages)` Exports Verilog to a file. **Parameters**: - `output_dir` (str): Directory to write Verilog file - `module_name` (str): Module name (default: `"torchlogix_net"`) - `format` (str): HDL format, currently only `"verilog"` supported (default: `"verilog"`) - `pipeline_stages` (int): Pipeline configuration (default: `0`) **Creates**: `{output_dir}/{module_name}.v` ### Understanding Generated Verilog #### Combinational Design (pipeline_stages=0) For a simple 2-layer network: ```verilog module logic_net ( input wire [7:0] inp, output wire [3:0] out ); // No clock or reset needed // Layer 0: LogicDense (4 neurons) wire [3:0] layer_0_out; assign layer_0_out[0] = (inp[0] & inp[2]); // AND gate assign layer_0_out[1] = (inp[1] | inp[3]); // OR gate assign layer_0_out[2] = (inp[4] ^ inp[5]); // XOR gate assign layer_0_out[3] = ~(inp[6] & inp[7]); // NAND gate // Layer 1: LogicDense (2 neurons) assign out[0] = (layer_0_out[0] | layer_0_out[1]); assign out[1] = (layer_0_out[2] ^ layer_0_out[3]); endmodule ``` **Characteristics**: - Pure combinational logic (no state) - No clock or reset signals - 1 cycle latency (output available same cycle as input) - Critical path spans entire network #### Pipelined Design (pipeline_stages=2) With pipeline registers: ```verilog module logic_net ( input wire clk, input wire rst, input wire [7:0] inp, output reg [3:0] out ); // Combinational wires wire [3:0] layer_0_comb; wire [3:0] out_comb; // Pipeline register reg [3:0] layer_0_out; // Layer 0: Combinational logic assign layer_0_comb[0] = (inp[0] & inp[2]); assign layer_0_comb[1] = (inp[1] | inp[3]); assign layer_0_comb[2] = (inp[4] ^ inp[5]); assign layer_0_comb[3] = ~(inp[6] & inp[7]); // Pipeline register after Layer 0 always @(posedge clk) begin if (rst) layer_0_out <= 4'd0; else layer_0_out <= layer_0_comb; end // Layer 1: Combinational logic assign out_comb[0] = (layer_0_out[0] | layer_0_out[1]); assign out_comb[1] = (layer_0_out[2] ^ layer_0_out[3]); // Output register always @(posedge clk) begin if (rst) out <= 4'd0; else out <= out_comb; end endmodule ``` **Characteristics**: - Synchronous design with clock and reset - Registers break up long combinational paths - N cycle latency (where N = pipeline_stages) - Higher maximum frequency (fmax) #### Gate Operations Supported All 16 two-input Boolean operations are supported: | Gate ID | Operation | Verilog Expression | |---------|-----------|-------------------| | 0 | Zero (constant) | `1'b0` | | 1 | AND | `(a & b)` | | 2 | A AND NOT B | `(a & ~b)` | | 3 | A (passthrough) | `a` | | 4 | NOT A AND B | `(~a & b)` | | 5 | B (passthrough) | `b` | | 6 | XOR | `(a ^ b)` | | 7 | OR | `(a \| b)` | | 8 | NOR | `~(a \| b)` | | 9 | XNOR | `~(a ^ b)` | | 10 | NOT B | `~b` | | 11 | B IMPLIES A | `(~b \| a)` | | 12 | NOT A | `~a` | | 13 | A IMPLIES B | `(~a \| b)` | | 14 | NAND | `~(a & b)` | | 15 | One (constant) | `1'b1` | --- ## Pipelining for Large Models ### The Problem: Large Combinational Designs By default, TorchLogix generates **fully combinational** Verilog where all logic executes in a single clock cycle. This works well for small models but causes serious problems for larger ones: #### Symptoms of Combinational Overload - Synthesis fails or runs for hours without completing - Verilog files >1M lines take forever to process - Very low maximum frequency (fmax < 50 MHz) - Timing closure failures (negative WNS) - "Design too large" errors from Vivado #### Why This Happens - Deep combinational paths through many layers - Synthesis tools struggle to optimize very large logic cones - Critical path delay grows with model depth - No natural break points for timing optimization ### The Solution: Pipeline Stages **Pipelining** inserts registers between layers to break up long combinational paths: ``` Combinational (pipeline_stages=0): Input → [Layer0 → Layer1 → Layer2 → Layer3] → Output All in 1 cycle, huge critical path Pipelined (pipeline_stages=4): Input → [Layer0] → REG → [Layer1] → REG → [Layer2] → REG → [Layer3] → REG → Output 4 cycles latency, short critical paths ``` #### Benefits - Synthesis succeeds even for very large models - Much faster synthesis time (minutes vs hours) - Higher maximum frequency (200+ MHz vs <50 MHz) - Predictable timing closure - Better resource utilization #### Trade-offs - Increased latency (N cycles instead of 1) - More flip-flops (registers consume area) - Need to handle clock and reset signals ### Pipeline Stage Options #### `pipeline_stages=0` - Fully Combinational (Default) ```python verilog = compiled.get_verilog_code(pipeline_stages=0) ``` - No registers, no clock required - 1 cycle latency - **Use for:** Small models (<10 layers), initial prototyping - **Avoid for:** Large models (synthesis will fail) #### `pipeline_stages=1` - Output Register Only ```python verilog = compiled.get_verilog_code(pipeline_stages=1) ``` - Single register at output - 1 cycle latency - **Use for:** Medium models where synthesis struggles but you need low latency - **Best for:** 10-30 layer models #### `pipeline_stages=N` - N Pipeline Stages ```python # 4 pipeline stages verilog = compiled.get_verilog_code(pipeline_stages=4) ``` - Layers divided into N groups, register after each group - N cycle latency - **Use for:** Large models (50-200 layers) - **Best for:** Balancing latency vs synthesis speed #### Full Layer-Level Pipelining ```python # Register between every layer num_layers = len([m for m in model.modules() if isinstance(m, (LogicDense, LogicConv2d))]) verilog = compiled.get_verilog_code(pipeline_stages=num_layers) # Or just use a large number verilog = compiled.get_verilog_code(pipeline_stages=999) ``` - Register after every single layer - Maximum possible fmax - Highest latency (= number of layers) - **Use for:** Very large models (>200 layers) or maximum throughput applications ### Choosing the Right Pipeline Configuration #### Decision Tree ``` Is synthesis failing or very slow? │ ├─ NO → Use pipeline_stages=0 (fully combinational) │ Lowest latency, simplest design │ └─ YES → How many layers in your model? │ ├─ <20 layers → pipeline_stages=1 │ (Output register only) │ ├─ 20-100 layers → pipeline_stages=4 to 8 │ (Balanced approach) │ └─ >100 layers → pipeline_stages=N/4 to N (N = number of layers) ``` #### Size Guidelines | Model Characteristics | Recommended Config | Latency | Benefits | |-----------------------|--------------------|---------|----------| | <10 layers, <100K Verilog lines | `pipeline_stages=0` | 1 cycle | Simple, low latency | | 10-30 layers, synthesis slow | `pipeline_stages=1` | 1 cycle | Helps synthesis | | 30-100 layers | `pipeline_stages=4` | 4 cycles | Good balance | | 100-200 layers | `pipeline_stages=8-16` | 8-16 cycles | Reliable synthesis | | >200 layers | `pipeline_stages=N/4` | N/4 cycles | Fast synthesis | | Maximum throughput needed | `pipeline_stages=999` | N cycles | Highest fmax | #### Empirical Testing Start conservative and increase pipelining if needed: ```python # Step 1: Try combinational verilog = compiled.get_verilog_code(pipeline_stages=0) # Try to synthesize... if it fails or is very slow: # Step 2: Add output register verilog = compiled.get_verilog_code(pipeline_stages=1) # Try to synthesize... if still slow: # Step 3: Increase stages for stages in [2, 4, 8, 16]: verilog = compiled.get_verilog_code(pipeline_stages=stages) # Synthesize and check timing/area trade-off ``` ### Performance Optimization #### Finding Optimal Pipeline Depth Run synthesis with different configurations and compare: ```python import subprocess results = [] for stages in [0, 1, 2, 4, 8, 16]: verilog = compiled.get_verilog_code( module_name=f'design_p{stages}', pipeline_stages=stages ) # Save Verilog with open(f'design_p{stages}.v', 'w') as f: f.write(verilog) # Synthesize (see Synthesis section for details) subprocess.run([ 'vivado', '-mode', 'batch', '-source', 'synthesize.tcl', '-tclargs', f'design_p{stages}.v', 'xc7z020clg400-1' ]) # Parse and compare results # results.append((stages, luts, ffs, fmax, synthesis_time)) # Find optimal trade-off based on your requirements ``` #### Common Issues **Issue: Pipelined design has lower fmax than expected** - **Cause:** Not enough pipeline stages, or uneven distribution - **Solution:** Increase `pipeline_stages` or try full layer-level pipelining **Issue: Too much area consumed by registers** - **Cause:** Too many pipeline stages for the model size - **Solution:** Reduce `pipeline_stages` to find balance **Issue: Synthesis still slow with pipelining** - **Cause:** Individual layers may still be very large - **Solution:** - Check if conv layers with large receptive fields need breaking up - Use more pipeline stages - Consider model architecture changes --- ## Testing Generated Verilog Functional testing and verification ensures your generated Verilog matches the expected behavior from the trained model. ### Prerequisites You'll need one of the following simulators: - **Vivado Simulator (xsim)** - Included with Vivado - **ModelSim/QuestaSim** - Commercial simulator from Mentor/Siemens - **Icarus Verilog** - Open-source, free (`apt install iverilog` or `brew install icarus-verilog`) - **Verilator** - Fast open-source simulator (`apt install verilator` or `brew install verilator`) ### Step 1: Generate Test Vectors Export test vectors from your trained model using Python: ```python import torch import numpy as np from torchlogix import CompiledLogicNet # Load your trained model model = ... # Your trained TorchLogix model # Generate test vectors compiled = CompiledLogicNet(model, input_shape=(8,), use_bitpacking=False, num_bits=1) compiled.compile() # Generate random binary test cases num_tests = 100 input_size = 8 # Match your model's input size test_inputs = np.random.randint(0, 2, (num_tests, input_size), dtype=np.int8) # Get expected outputs test_outputs = [] for inp in test_inputs: out = compiled.forward(inp.reshape(1, -1)) test_outputs.append(out[0]) # Save to files for testbench np.savetxt('test_vectors_input.txt', test_inputs, fmt='%d') np.savetxt('test_vectors_output.txt', np.array(test_outputs), fmt='%d') print(f"Generated {num_tests} test vectors") ``` ### Step 2: Create a Verilog Testbench Create a testbench file `tb_logic_net.v` for combinational designs: ```verilog `timescale 1ns/1ps module tb_logic_net; // Parameters parameter INPUT_WIDTH = 8; parameter OUTPUT_WIDTH = 2; parameter NUM_TESTS = 100; // Signals reg [INPUT_WIDTH-1:0] inp; wire [OUTPUT_WIDTH-1:0] out; // Expected output reg [OUTPUT_WIDTH-1:0] expected_out; // Test vectors reg [INPUT_WIDTH-1:0] test_inputs [0:NUM_TESTS-1]; reg [OUTPUT_WIDTH-1:0] test_outputs [0:NUM_TESTS-1]; integer i; integer errors; // Instantiate the DUT (Device Under Test) logic_net dut ( .inp(inp), .out(out) ); // Load test vectors initial begin $readmemb("test_vectors_input.txt", test_inputs); $readmemb("test_vectors_output.txt", test_outputs); errors = 0; end // Test stimulus initial begin $display("Starting testbench..."); $display("Time\t\tInput\t\tOutput\t\tExpected\tStatus"); $display("----\t\t-----\t\t------\t\t--------\t------"); // Run through all test vectors for (i = 0; i < NUM_TESTS; i = i + 1) begin inp = test_inputs[i]; expected_out = test_outputs[i]; #10; // Wait 10ns for combinational logic to settle // Check output if (out !== expected_out) begin $display("%0t\t%b\t%b\t%b\t\tFAIL", $time, inp, out, expected_out); errors = errors + 1; end else begin $display("%0t\t%b\t%b\t%b\t\tPASS", $time, inp, out, expected_out); end end // Summary $display("\n========================================"); $display("Test Summary"); $display("========================================"); $display("Total tests: %0d", NUM_TESTS); $display("Passed: %0d", NUM_TESTS - errors); $display("Failed: %0d", errors); if (errors == 0) begin $display("\nALL TESTS PASSED!"); end else begin $display("\nSOME TESTS FAILED!"); end $finish; end // Optional: Generate VCD waveform dump initial begin $dumpfile("tb_logic_net.vcd"); $dumpvars(0, tb_logic_net); end endmodule ``` #### Testbench for Pipelined Designs For pipelined designs, you need to account for latency: ```verilog module tb_pipelined_logic_net; parameter PIPELINE_LATENCY = 4; // Match your pipeline_stages reg clk = 0; reg rst = 1; reg [7:0] inp; wire [1:0] out; // Generate clock (100 MHz) always #5 clk = ~clk; logic_net dut ( .clk(clk), .rst(rst), .inp(inp), .out(out) ); initial begin // Reset sequence rst = 1; #20 rst = 0; // Release reset after 2 cycles // Run tests with pipeline latency for (i = 0; i < NUM_TESTS; i = i + 1) begin inp = test_inputs[i]; // Wait for pipeline to fill repeat(PIPELINE_LATENCY) @(posedge clk); expected_out = test_outputs[i]; // Check output if (out !== expected_out) begin $display("FAIL: Input %b -> Output %b (expected %b)", test_inputs[i], out, expected_out); errors = errors + 1; end end // Summary... $finish; end endmodule ``` ### Step 3: Simulate with Different Tools #### Option A: Vivado Simulator (xsim) ```bash # Compile the design xvlog logic_net.v xvlog tb_logic_net.v # Elaborate xelab -debug typical tb_logic_net -s tb_logic_net_sim # Run simulation xsim tb_logic_net_sim -runall # View waveforms (GUI mode) xsim tb_logic_net_sim -gui ``` #### Option B: Icarus Verilog ```bash # Compile and run in one step iverilog -o sim.out logic_net.v tb_logic_net.v vvp sim.out # View waveforms with GTKWave gtkwave tb_logic_net.vcd ``` #### Option C: ModelSim/QuestaSim ```bash # Create work library vlib work # Compile sources vlog logic_net.v vlog tb_logic_net.v # Simulate vsim -c tb_logic_net -do "run -all; quit" # Or run with GUI vsim tb_logic_net # In ModelSim GUI: run -all ``` ### Step 4: Analyze Results #### Success Criteria - All test vectors should produce matching outputs - No 'X' or 'Z' values in outputs (indicates uninitialized or high-impedance states) - Combinational delay should be minimal (typically < 1ns for simple gates) #### Common Issues **Mismatched outputs:** - Verify test vector format (binary vs decimal) - Check that Verilog module name matches instantiation in testbench - Ensure input/output widths match **X or Z values:** - Usually indicates undriven wires - Check all wires in generated Verilog have assignments **Compilation errors:** - Verify Verilog syntax - Check for Verilog-1995 vs Verilog-2001 compatibility issues ### Self-Checking Testbench For automated testing, create a self-checking testbench that exits with an error code: ```verilog initial begin // ... run tests ... if (errors != 0) begin $display("FAIL: %0d errors detected", errors); $fatal(1, "Test failed"); // Exit with error end else begin $display("PASS: All tests passed"); end $finish; end ``` Then use in scripts: ```bash #!/bin/bash iverilog -o sim.out logic_net.v tb_logic_net.v && vvp sim.out if [ $? -eq 0 ]; then echo "Simulation PASSED" else echo "Simulation FAILED" exit 1 fi ``` --- ## Synthesis with Vivado Synthesis converts Verilog RTL code into gate-level netlists optimized for specific FPGA parts, providing resource estimates, timing analysis, and power consumption data. ### Overview Synthesis provides: - **Resource Estimates**: LUTs, FFs, DSPs, BRAM usage - **Timing Analysis**: Maximum frequency (fmax), Worst Negative Slack (WNS), critical paths - **Power Estimates**: Static and dynamic power consumption ### Prerequisites - **Vivado Design Suite** installed (tested with 2019.1+) - Generated Verilog file from TorchLogix (e.g., `logic_net.v`) - Target FPGA part number (e.g., `xc7z020clg400-1` for Pynq-Z2) ### Quick Start ```bash # Navigate to your Verilog output directory cd verilog_output/ # Run synthesis with TCL script vivado -mode batch -source synthesize.tcl -tclargs logic_net.v xc7z020clg400-1 # Results will be in synthesis_reports/ ls synthesis_reports/ # utilization.txt timing.txt power.txt ``` ### Synthesis Methods #### Method 1: Batch Mode with TCL Script (Recommended) Create `synthesize.tcl`: ```tcl # Parse arguments set verilog_file [lindex $argv 0] set part_number [lindex $argv 1] set report_dir [lindex $argv 2] if {$report_dir == ""} { set report_dir "synthesis_reports" } # Create output directory file mkdir $report_dir # Create in-memory project create_project -in_memory -part $part_number # Read Verilog source read_verilog $verilog_file # Detect if design has clock port set has_clock [expr {[llength [get_ports -quiet clk]] > 0}] if {$has_clock} { puts "INFO: Detected clocked design, applying timing constraints" # Create clock constraint (10ns period = 100 MHz) create_clock -period 10.0 -name clk [get_ports clk] # Input/output delays set_input_delay -clock clk 2.0 [get_ports -filter {NAME != clk && DIRECTION == IN}] set_output_delay -clock clk 2.0 [get_ports -filter {DIRECTION == OUT}] # Run synthesis synth_design -top [get_property TOP [current_fileset]] } else { puts "INFO: Detected combinational design" # Run synthesis in out-of-context mode synth_design -top [get_property TOP [current_fileset]] -mode out_of_context } # Generate reports report_utilization -file ${report_dir}/utilization.txt report_timing_summary -file ${report_dir}/timing.txt report_power -file ${report_dir}/power.txt # Extract key metrics set lut_count [get_property LUT_AS_LOGIC [get_cells -hierarchical -filter {PRIMITIVE_TYPE =~ LUT*}] | llength] set ff_count [get_property PRIMITIVE_COUNT [get_cells -hierarchical -filter {PRIMITIVE_TYPE =~ REGISTER*}]] puts "\n========================================" puts "Synthesis Summary" puts "========================================" puts "LUTs: $lut_count" puts "FFs: $ff_count" puts "========================================" # Save summary set summary_file [open "${report_dir}/summary.txt" w] puts $summary_file "LUTs: $lut_count" puts $summary_file "FFs: $ff_count" close $summary_file puts "Reports saved to: $report_dir/" exit ``` Run synthesis: ```bash vivado -mode batch -source synthesize.tcl -tclargs logic_net.v xc7z020clg400-1 reports/ ``` #### Method 2: Vivado GUI 1. Launch Vivado: `vivado` 2. Create New Project - Click "Create Project" - Choose project location - Select "RTL Project" 3. Add Verilog Source - In Flow Navigator: "Add Sources" - "Add or create design sources" - Add your `logic_net.v` file 4. Select Target Part - Common parts: - Pynq-Z2: `xc7z020clg400-1` - ZCU104: `xczu7ev-ffvc1156-2-e` - Artix-7: `xc7a35tcpg236-1` 5. Run Synthesis - In Flow Navigator: "Run Synthesis" 6. View Reports - After synthesis: "Open Synthesized Design" - Reports → Utilization, Timing Summary #### Method 3: Vivado Tcl Console ```bash vivado -mode tcl ``` Then in Tcl console: ```tcl # Create in-memory project create_project -in_memory -part xc7z020clg400-1 # Read Verilog read_verilog logic_net.v # Run synthesis synth_design -top logic_net -mode out_of_context # Generate reports report_utilization -file utilization.txt report_timing_summary -file timing.txt report_power -file power.txt exit ``` ### Understanding the Reports #### Utilization Report Shows FPGA resource usage: ``` +-------------------------+------+-------+-----------+-------+ | Site Type | Used | Fixed | Available | Util% | +-------------------------+------+-------+-----------+-------+ | Slice LUTs | 147 | 0 | 53200 | 0.28 | | LUT as Logic | 147 | 0 | 53200 | 0.28 | | LUT as Memory | 0 | 0 | 17400 | 0.00 | | Slice Registers | 0 | 0 | 106400 | 0.00 | | Register as Flip Flop | 0 | 0 | 106400 | 0.00 | | Register as Latch | 0 | 0 | 106400 | 0.00 | | F7 Muxes | 8 | 0 | 26600 | 0.03 | | F8 Muxes | 2 | 0 | 13300 | 0.02 | +-------------------------+------+-------+-----------+-------+ ``` **Key Metrics:** - **Slice LUTs**: Primary logic resource. Each LUT can implement any 6-input Boolean function. - **Slice Registers**: Flip-flops for sequential logic (0 for purely combinational designs) - **F7/F8 Muxes**: Larger multiplexers for wide logic functions - **DSPs**: Digital Signal Processing blocks (typically 0 for logic gate networks) - **BRAM**: Block RAM (typically 0 for logic gate networks) #### Timing Report Shows timing analysis: ``` Timing Summary (ns) ------------------- WNS(ns) TNS(ns) WHS(ns) THS(ns) WPWS(ns) TPWS(ns) ------- ------- ------- ------- -------- -------- 7.500 0.000 0.300 0.000 3.500 0.000 ``` **Key Metrics:** - **WNS (Worst Negative Slack)**: Most critical timing margin - Positive = timing met - Negative = timing violation - **Critical Path Delay**: Longest combinational path through the design **For combinational designs:** ```tcl # Get maximum delay through the design report_timing -delay_type min_max -max_paths 10 -file timing_paths.txt ``` Example output: ``` Critical Path Delay: 4.832 ns Maximum Frequency: 206.95 MHz (if this were in a clocked design) ``` #### Power Report Shows estimated power consumption: ``` Total On-Chip Power (W) : 0.082 Dynamic (W) : 0.012 Device Static (W) : 0.070 ``` ### Interpreting Results for TorchLogix Models #### Resource Estimates **LUT Count Interpretation:** - Each logic gate (AND, OR, XOR, etc.) typically maps to a fractional LUT - Expect roughly 0.5-1 LUT per gate operation - Tree structures may share LUTs efficiently - A 1000-gate network might use 500-800 LUTs **Resource Scaling:** - Linear layers: O(neurons) LUTs - Convolutional layers: O(kernels × receptive_field²) LUTs - Deeper trees → more efficient LUT packing **Pipelining Impact:** - Adds flip-flops (FFs) for registers - May actually **reduce** LUTs through better optimization - Example: 50-layer model with 4 pipeline stages might use 20% fewer LUTs #### Latency Estimates **Combinational Delay:** - Total latency = critical path delay - Typical delays: - Simple AND/OR/XOR: 0.1-0.2 ns per gate - Deep trees (10 levels): 2-4 ns total - Very deep networks: 5-10 ns **Maximum Frequency:** ``` fmax = 1 / critical_path_delay ``` Example: 4.5 ns critical path → fmax ≈ 222 MHz ### Pipelined vs Combinational Designs #### Combinational (pipeline_stages=0) ```python verilog = compiled.get_verilog_code(pipeline_stages=0) # Default ``` - No clock or reset signals - All logic in single cycle - Synthesis uses `-mode out_of_context` - **Problem:** May fail for large models (>1M Verilog lines) **Timing Analysis:** ``` Combinational (pipeline_stages=0): Critical path: 25 ns fmax: 40 MHz ``` #### Pipelined (pipeline_stages>0) ```python verilog = compiled.get_verilog_code(pipeline_stages=4) # 4 stages ``` - Has clock (clk) and reset (rst) signals - Logic divided into N pipeline stages - Synthesis auto-detects and applies clock constraints - **Solution:** Enables synthesis of very large models **Timing Analysis:** ``` Pipelined (pipeline_stages=4): Critical path: 6 ns fmax: 166 MHz Latency: 4 cycles = 24 ns @ 166 MHz ``` **When to use pipelining:** - Synthesis fails or runs for hours → Use `pipeline_stages=1` or more - Model >20 layers → Consider `pipeline_stages=4-8` - Model >100 layers → Use `pipeline_stages=16` or higher ### Synthesis Strategies #### For Minimum Latency ```tcl synth_design -top logic_net -mode out_of_context \ -directive PerformanceOptimized \ -no_lc # Disable logic combining for speed ``` #### For Minimum Area ```tcl synth_design -top logic_net -mode out_of_context \ -directive AreaOptimized_high \ -shreg_min_size 5 # Aggressive resource sharing ``` #### For Balanced Results ```tcl synth_design -top logic_net -mode out_of_context \ -directive Default ``` ### Common Issues and Solutions #### Issue: "Multi-driven net" Error **Cause:** Multiple assign statements to the same wire. **Solution:** Check generated Verilog for duplicate assignments. #### Issue: Unrealistically High fmax **Cause:** No input/output delays specified. **Solution:** Add timing constraints with realistic I/O delays. #### Issue: Very High LUT Count **Cause:** Unoptimized gate structure or deep logic trees. **Solution:** - Check for constant propagation opportunities - Verify gate operations are using optimal Boolean functions - Consider factoring logic differently in training ### Next Steps After Synthesis 1. **Implementation**: Run place & route for final timing/resource numbers ```tcl opt_design place_design route_design report_timing_summary -file post_route_timing.txt report_utilization -file post_route_utilization.txt ``` 2. **Bitstream Generation**: Create FPGA configuration file ```tcl write_bitstream -force logic_net.bit ``` 3. **Hardware Testing**: Deploy to actual FPGA and validate --- ## Complete Workflow Examples ### Example 1: Small Model - Full Workflow Train, export, test, and synthesize a small MNIST classifier: ```python #!/usr/bin/env python3 import torch import torch.nn as nn from torchlogix.layers import LogicDense, GroupSum from torchlogix import CompiledLogicNet import numpy as np import subprocess # 1. Create and train model model = nn.Sequential( LogicDense(784, 128, connections="fixed", device="cpu"), LogicDense(128, 128, connections="fixed", device="cpu"), LogicDense(128, 100, connections="fixed", device="cpu"), GroupSum(10, tau=10.0) ) # Train your model here... # model.load_state_dict(torch.load('trained_model.pt')) # 2. Compile and export Verilog compiled = CompiledLogicNet( model, input_shape=(784,), use_bitpacking=False, num_bits=1 ) # Small model: fully combinational verilog = compiled.get_verilog_code( module_name="mnist_classifier", pipeline_stages=0 ) with open('mnist_classifier.v', 'w') as f: f.write(verilog) print("✓ Generated mnist_classifier.v") # 3. Generate test vectors num_tests = 100 test_inputs = np.random.randint(0, 2, (num_tests, 784), dtype=np.int8) test_outputs = [] compiled.compile() for inp in test_inputs: out = compiled.forward(inp.reshape(1, -1)) test_outputs.append(out[0]) np.savetxt('test_vectors_input.txt', test_inputs, fmt='%d') np.savetxt('test_vectors_output.txt', np.array(test_outputs), fmt='%d') print(f"✓ Generated {num_tests} test vectors") # 4. Run simulation (using Icarus Verilog) subprocess.run([ 'iverilog', '-o', 'sim.out', 'mnist_classifier.v', 'tb_mnist_classifier.v' ]) result = subprocess.run(['vvp', 'sim.out'], capture_output=True, text=True) if "ALL TESTS PASSED" in result.stdout: print("✓ Simulation passed") else: print("✗ Simulation failed") exit(1) # 5. Run synthesis subprocess.run([ 'vivado', '-mode', 'batch', '-source', 'synthesize.tcl', '-tclargs', 'mnist_classifier.v', 'xc7z020clg400-1', 'reports/' ]) print("✓ Synthesis complete - check reports/ directory") ``` ### Example 2: Large Model with Pipelining ```python #!/usr/bin/env python3 from torchlogix import CompiledLogicNet import subprocess # Load large pre-trained model model = ... # 50+ layers compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1) # Try different pipeline configurations configs = [ (0, "Fully combinational"), (1, "Single output register"), (4, "4 pipeline stages"), (8, "8 pipeline stages"), ] for pipeline_stages, description in configs: print(f"\n{'='*60}") print(f"Testing: {description} (pipeline_stages={pipeline_stages})") print(f"{'='*60}") # Generate Verilog verilog = compiled.get_verilog_code( module_name=f"large_model_p{pipeline_stages}", pipeline_stages=pipeline_stages ) filename = f"large_model_p{pipeline_stages}.v" with open(filename, 'w') as f: f.write(verilog) print(f"✓ Generated {filename} ({len(verilog)} bytes)") # Synthesize result = subprocess.run([ 'vivado', '-mode', 'batch', '-source', 'synthesize.tcl', '-tclargs', filename, 'xc7z020clg400-1', f'reports_p{pipeline_stages}/' ], capture_output=True, text=True, timeout=600) if result.returncode == 0: print(f"✓ Synthesis succeeded") # Parse results with open(f'reports_p{pipeline_stages}/summary.txt') as f: print(f.read()) else: print(f"✗ Synthesis failed or timed out") print("\n" + "="*60) print("Compare results in reports_p*/ directories") print("="*60) ``` ### Example 3: Convolutional Model ```python #!/usr/bin/env python3 import torch.nn as nn from torchlogix.layers import LogicConv2d, OrPooling, LogicDense, GroupSum from torchlogix import CompiledLogicNet # Create convolutional model model = nn.Sequential( LogicConv2d( in_dim=(28, 28), channels=1, num_kernels=16, tree_depth=3, receptive_field_size=5, padding=2, connections="fixed", device="cpu" ), OrPooling(kernel_size=2, stride=2), LogicConv2d( in_dim=(14, 14), channels=16, num_kernels=32, tree_depth=3, receptive_field_size=3, padding=1, connections="fixed", device="cpu" ), OrPooling(kernel_size=2, stride=2), nn.Flatten(), LogicDense(32*7*7, 256, connections="fixed", device="cpu"), LogicDense(256, 100, connections="fixed", device="cpu"), GroupSum(10, tau=10.0) ) # Train model... # Export with pipelining (medium-sized model) compiled = CompiledLogicNet( model, input_shape=(1, 28, 28), use_bitpacking=False, num_bits=1 ) verilog = compiled.get_verilog_code( module_name="conv_classifier", pipeline_stages=4 # 4 stages for medium conv model ) compiled.export_hdl( "./conv_verilog_output", module_name="conv_classifier", pipeline_stages=4 ) print("✓ Exported convolutional model to conv_verilog_output/") ``` ### Example 4: Python-Driven Synthesis Loop Automate synthesis and collect metrics: ```python #!/usr/bin/env python3 import subprocess import re import pandas as pd def synthesize_and_extract_metrics(verilog_file, part, report_dir): """Run Vivado synthesis and extract key metrics.""" # Run synthesis result = subprocess.run([ 'vivado', '-mode', 'batch', '-source', 'synthesize.tcl', '-tclargs', verilog_file, part, report_dir ], capture_output=True, text=True, timeout=600) if result.returncode != 0: return None # Parse summary metrics = {} with open(f'{report_dir}/summary.txt', 'r') as f: for line in f: if 'LUTs:' in line: metrics['luts'] = int(re.search(r'\d+', line).group()) elif 'FFs:' in line: metrics['ffs'] = int(re.search(r'\d+', line).group()) # Parse timing with open(f'{report_dir}/timing.txt', 'r') as f: content = f.read() wns_match = re.search(r'WNS\(ns\)\s+([-\d.]+)', content) if wns_match: metrics['wns_ns'] = float(wns_match.group(1)) return metrics # Run experiments compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1) results = [] for stages in [0, 1, 2, 4, 8, 16]: print(f"Synthesizing with pipeline_stages={stages}...") verilog = compiled.get_verilog_code( module_name=f'model_p{stages}', pipeline_stages=stages ) filename = f'model_p{stages}.v' with open(filename, 'w') as f: f.write(verilog) metrics = synthesize_and_extract_metrics( filename, 'xc7z020clg400-1', f'reports_p{stages}' ) if metrics: metrics['pipeline_stages'] = stages results.append(metrics) print(f" LUTs: {metrics['luts']}, FFs: {metrics['ffs']}, WNS: {metrics['wns_ns']} ns") # Create summary DataFrame df = pd.DataFrame(results) df.to_csv('synthesis_comparison.csv', index=False) print("\n" + "="*60) print("Synthesis Comparison") print("="*60) print(df) ``` --- ## Advanced Topics ### Custom Timing Constraints For more accurate timing analysis, create custom constraints in `constraints.xdc`: ```tcl # Virtual clock for timing analysis (10ns = 100 MHz) create_clock -period 10.000 -name virtual_clk # Input delay (assume inputs arrive 2ns after clock edge) set_input_delay -clock virtual_clk 2.000 [get_ports inp*] # Output delay (assume outputs must be stable 2ns before next clock edge) set_output_delay -clock virtual_clk 2.000 [get_ports out*] # For pipelined designs with actual clock create_clock -period 5.000 -name clk [get_ports clk] # 200 MHz # Relax timing on reset path set_false_path -from [get_ports rst] ``` Load in synthesis: ```tcl read_xdc constraints.xdc synth_design -top logic_net ``` ### Comparing C Code and Verilog Implementations TorchLogix can generate both C code (for HLS or CPU) and Verilog: ```python from torchlogix import CompiledLogicNet compiled = CompiledLogicNet(model, input_shape=(784,), use_bitpacking=False, num_bits=1) # Generate C code c_code = compiled.get_c_code() with open('model.c', 'w') as f: f.write(c_code) # Generate Verilog verilog_code = compiled.get_verilog_code() with open('model.v', 'w') as f: f.write(verilog_code) # Compile C for CPU execution compiled.compile(compiler='gcc', optimization_level='-O3') # Now you can compare: # - C compiled to native CPU code # - C compiled through Vivado HLS to RTL # - Direct Verilog synthesis # All three should produce functionally identical results ``` ### Integration with C→HLS Pipeline For users who want to explore the HLS route: ```python # Generate optimized C code compiled = CompiledLogicNet(model) c_code = compiled.get_c_code() with open('model.c', 'w') as f: f.write(c_code) ``` Then use Vivado HLS: ```tcl # hls_script.tcl open_project hls_project set_top model_top add_files model.c open_solution "solution1" set_part {xc7z020clg400-1} create_clock -period 10 csynth_design export_design -format ip_catalog exit ``` ```bash vivado_hls -f hls_script.tcl ``` This generates an IP core that can be integrated into larger FPGA designs. ### Layer Support Status and Future Work #### Currently Supported | Layer | Status | Implementation | |-------|--------|---------------| | LogicDense | ✅ Complete | Direct gate synthesis | | LogicConv2d | ✅ Complete | Binary tree structure | | LogicConv3d | ✅ Complete | Binary tree with 3D indexing | | Flatten | ✅ Complete | Wire passthrough | #### In Progress / TODO | Layer | Status | Notes | |-------|--------|-------| | OrPooling | ⚠️ TODO | OR reduction tree needed | | GroupSum | ⚠️ TODO | Adder tree implementation needed | Models using unsupported layers will generate placeholders in Verilog. For production use: - Use C code generation for complete model support - Wait for future releases with complete Verilog support - Manually implement missing layers if needed ### Optimization Tips **Model Architecture:** - Prefer models with 20-100 layers for good synthesis results - Very deep models (>200 layers) require aggressive pipelining - Conv layers with large receptive fields may need optimization **Pipeline Configuration:** - Start with `pipeline_stages=0` for small models - Increase incrementally if synthesis fails or is slow - Use full pipelining (`pipeline_stages=999`) for maximum fmax **Synthesis Directives:** - Use `-directive PerformanceOptimized` for speed - Use `-directive AreaOptimized_high` for small FPGAs - Experiment with different strategies for your specific design **Resource Utilization:** - Target <70% LUT utilization for good place-and-route results - Very high utilization (>90%) may cause routing failures - Consider model size vs available FPGA resources during training --- ## Summary TorchLogix provides comprehensive support for deploying logic gate networks to FPGA hardware: 1. **Direct Verilog Generation**: Export trained models to readable, synthesizable Verilog RTL 2. **Configurable Pipelining**: Balance latency and synthesis complexity with flexible pipeline stages 3. **Testing Support**: Generate test vectors and verify functionality with standard simulators 4. **Synthesis Integration**: Automate synthesis with Vivado to obtain resource and timing estimates 5. **Complete Workflows**: End-to-end examples from training to hardware deployment **Key Takeaways:** - Use `pipeline_stages=0` for small models, increase for larger ones - Always test generated Verilog with simulations before synthesis - Synthesis provides accurate resource and timing estimates before hardware deployment - Both Verilog and C code generation are supported for maximum flexibility For questions or issues with hardware deployment, please refer to the [TorchLogix repository](https://github.com/your-org/torchlogix) or open an issue.