# Core Concepts

This guide explains the key architectural concepts and design principles behind TorchLogix. For detailed API documentation, see the [API Reference](../api/torchlogix.rst).

## Design Philosophy

TorchLogix is built around three core principles:

1. **PyTorch API Compatibility**: Layers follow standard `torch.nn.Module` conventions
2. **Separation of Concerns**: Each layer separates connection patterns from parametrization strategies
3. **Differentiable-to-Discrete**: Training uses continuous relaxations with multiple discretization strategies

---

## PyTorch API Compatibility

All TorchLogix layers follow standard PyTorch `nn.Module` conventions, making them compatible with existing PyTorch workflows. However, **logic layers are not drop-in replacements** for standard layers like `nn.Linear` or `nn.Conv2d` - they have fundamentally different computational properties and require significantly more neurons (or kernels) to achieve comparable expressiveness.

```python
import torch.nn as nn
from torchlogix.layers import LogicDense

# Standard PyTorch: 256 neurons
linear = nn.Linear(784, 256)

# TorchLogix: Need 1000+ neurons for similar capacity
logic = LogicDense(784, 4096)  # More neurons required!
```

### Standard PyTorch Conventions

Despite different computational characteristics, TorchLogix layers follow all standard PyTorch patterns:

- **Inherit from `nn.Module`**: All layers are proper PyTorch modules
- **Learnable parameters**: Registered via `nn.Parameter` for automatic gradient computation
- **Training/eval modes**: `.train()` and `.eval()` control behavior
- **Device management**: Standard `.to(device)` and `.cuda()` work as expected
- **Composition**: Use with `nn.Sequential`, custom `forward()` methods, etc.
- **State dict**: Save and load with `state_dict()` and `load_state_dict()`

This means you can use TorchLogix with standard PyTorch optimizers, learning rate schedulers, data loaders, and training loops - just be mindful that logic networks need different architectural scales (and typically much higher learning rates).

---

## Separation of Concerns

Each logic layer is composed of two independent, swappable components:

```
┌─────────────────────────────────────────┐
│          Logic Layer (nn.Module)        │
│                                         │
│  ┌─────────────┐  ┌──────────────────┐  │
│  │ Connections │  │ Parametrization  │  │
│  │             │  │                  │  │
│  │ Which       │  │ - How Boolean    │  │
│  │ inputs      │  │   functions are  │  │
│  │ connect?    │  │   represented?   │  │
│  │             │  │ - Forward        │  │
│  │             │  │   sampling mode  │  │
│  └─────────────┘  └──────────────────┘  │
└─────────────────────────────────────────┘
```

This modular design allows you to mix and match different strategies for each component.

### 1. Connections: Input Routing

The **Connections** component determines which inputs from the previous layer connect to each neuron or convolutional kernel.

**Why separate this?** Different connectivity patterns enable different trade-offs between expressiveness and parameter efficiency.

#### Types of Connections

**Fixed Connections** (`connections="fixed"`):
- Randomly select which inputs connect to each neuron
- Fixed after initialization (no learning overhead)
- Options: `"random"` (with replacement) or `"random-unique"` (without replacement)
- Used in most pre-configured models for efficiency

**Learnable Connections** (`connections="learnable"`):
- Learn which inputs to connect during training
- Uses Gumbel-Softmax for differentiable selection
- More flexible but adds computational cost
- Useful when input structure is unknown

#### Dense vs Convolutional Structure

**Dense Structure**:
- Each neuron selects `lut_rank` inputs from all previous layer outputs
- Fully connected within the selection

**Convolutional Structure**:
- Each kernel selects inputs from a spatial receptive field
- Organized as a binary tree with depth `tree_depth`
- Supports channel grouping via `channel_group_size`
- By default, channels balanced. E.g. a kernel w/ 8 inputs and and a channel group size of 2 would pick 4 inputs from each channel

### 2. Parametrization: Representing Boolean Functions

The **Parametrization** component defines how Boolean functions (look-up tables) are represented as learnable parameters and how they are sampled during forward passes.

**Why separate this?** Different parametrizations have different optimization landscapes and computational costs, and different sampling strategies enable different exploration-exploitation trade-offs.

Any Boolean function can be represented as a weighted sum over a basis. TorchLogix provides three parametrization strategies:

#### Raw Parametrization (`parametrization="raw"`)

Directly represents the truth table using all 16 possible 2-input Boolean functions as described in https://arxiv.org/abs/2210.08277

- **Weights**: 16 logits per neuron (one per Boolean function)
- **Sampling**: Softmax over the 16 functions
- **Limitation**: Only supports `lut_rank=2` (2 inputs)
- **Use case**: Baseline method, interpretable gate selection

```
Each neuron learns: [w₀, w₁, ..., w₁₅]
                      ↓
                   softmax
                      ↓
       Weighted sum of 16 Boolean gates
```

#### WARP Parametrization (`parametrization="warp"`)

Represents functions using Walsh-Hadamard basis coefficients (parity functions) as described in https://arxiv.org/abs/2602.03527

- **Weights**: 2^k coefficients per neuron (where k = `lut_rank`)
- **Sampling**: Sigmoid-based thresholding
- **Supports**: `lut_rank` ∈ {1, 2, 4, 6}
- **Use case**: Fewer parameters than raw, scales to higher-rank LUTs

```
Walsh basis: {1, x₁, x₂, x₁⊕x₂}  (for k=2)

Each neuron learns: [c₀, c₁, c₂, c₃]
                      ↓
          f(x₁,x₂) = Σᵢ cᵢ · basisᵢ(x₁,x₂)
```

**Benefits**:
- More parameter-efficient than raw (4 coefficients vs 16 for `lut_rank=2`)
- Scales to higher-rank LUTs (4-input, 6-input gates)

#### Light Parametrization (`parametrization="light"`)

Uses indicator polynomial basis (product terms) with all-positive coefficients as described in https://arxiv.org/abs/2510.03250

- **Weights**: 2^k sigmoid-mapped coefficients
- **Supports**: `lut_rank` ∈ {2, 4, 6}
- **Use case**: Alternative basis with different inductive bias

```
Light basis: {1, x₁, x₂, x₁·x₂}  (for k=2)

Each neuron learns: [w₀, w₁, w₂, w₃]
                      ↓
                  sigmoid(wᵢ) → positive
                      ↓
          f(x₁,x₂) = Σᵢ σ(wᵢ) · basisᵢ(x₁,x₂)
```

#### LUT Rank: Higher-Order Logic

The `lut_rank` parameter controls how many inputs each logic gate operates on. There are `2^(2^n)` boolean functions w/ `n` inputs:

- **`lut_rank=2`**: 2-input gates (16 possible functions)
- **`lut_rank=4`**: 4-input gates (65,536 possible functions)
- **`lut_rank=6`**: 6-input gates (18 quintillion possible functions)

Higher rank = more expressive gates but exponentially more parameters.

#### Weight Initialization

The `weight_init` parameter controls how parameters are initialized:

- **`"residual"`** (default): Initialize near identity/passthrough
  - Critical for training deep networks (>6 layers)
  - Prevents vanishing gradients at initialization
  - Typically results in more trivial identity gates are optimized away at compile-time

- **`"random"`**: Random initialization
  - Good for shallow networks
  - May struggle with depth >6

- **`"residual-catalog"`**: Mix of identity and random (WARP only)
  - Used in WARP-LUTs paper experiments

#### Forward Sampling Modes

The `forward_sampling` parameter (part of parametrization) controls how continuous relaxations are converted to outputs during the forward pass.

| Mode | Description | When to Use |
|------|-------------|-------------|
| **`"soft"`** | Continuous softmax/sigmoid relaxation | Default, stable gradients |
| **`"hard"`** | Straight-through estimator (STE) | Reduce train-test mismatch |
| **`"gumbel_soft"`** | Gumbel-Softmax/Sigmoid with noise | Exploration during training |
| **`"gumbel_hard"`** | Gumbel + STE | Exploration + discretization |


**Straight-Through Estimators (STE)**:

Hard sampling modes use the straight-through estimator trick:
- **Forward pass**: Discrete (argmax or threshold)
- **Backward pass**: Gradient of continuous relaxation

This allows training discrete models with gradient-based optimization:

```
Forward:  y = argmax(logits)           [discrete, non-differentiable]
Backward: ∂L/∂logits = ∂L/∂softmax    [continuous, differentiable]
```

**Gumbel Noise Injection**:

Gumbel sampling modes add noise to enable exploration:

```python
# Without Gumbel: deterministic selection
logits = [2.0, 1.0, 0.5, 0.1]
probs = softmax(logits)  # Always picks first

# With Gumbel: stochastic selection
logits = [2.0, 1.0, 0.5, 0.1]
noisy_logits = logits + Gumbel(0, 1)
probs = softmax(noisy_logits / temperature)  # Explores
```

**Training vs Evaluation Modes**:

The sampling mode only applies during training. At evaluation time, the model always discretizes:

```python
# Training mode
model.train()
output = model(x)  # Uses specified forward_sampling mode

# Evaluation mode
model.eval()
output = model(x)  # Always uses hard discretization (argmax/threshold)
```

This ensures that:
1. Training benefits from gradient flow through continuous relaxations
2. Inference uses fully discrete operations (fast, interpretable)

---

## Layer-Specific Concepts

### LogicDense: Fully-Connected Layers

`LogicDense` is the fundamental building block, analogous to `nn.Linear`.

**Key differences from `nn.Linear`**:
- No bias term (bias is a Boolean function: constant True/False)
- Neurons operate on `lut_rank` inputs (not all inputs)
- Output is bounded [0, 1] in eval mode (Boolean values)
- Requires many more neurons for comparable expressiveness

**Computation flow**:
```
Input x: (batch, in_dim)
   ↓
Connections: select lut_rank inputs per neuron
   ↓
Shape: (batch, lut_rank, out_dim)
   ↓
Parametrization: apply Boolean functions
   ↓
Output: (batch, out_dim)
```

### LogicConv2d/3d: Convolutional Layers

`LogicConv2d` and `LogicConv3d` are analogous to `nn.Conv2d` and `nn.Conv3d`.

**Key architectural difference**: Uses a **binary tree** of logic gates instead of a single large LUT.

#### Binary Tree Structure

Each convolutional kernel is organized as a binary tree:

```
                    Output
                      ↑
                  Level 2: 1 gate
                    ↗   ↖
              Level 1: 2 gates
              ↗  ↖      ↗  ↖
         Level 0: 4 positions in receptive field
```

- **Tree depth**: Controlled by `tree_depth` parameter
- **Receptive field**: `lut_rank^tree_depth` spatial positions
- **Example**: `lut_rank=2`, `tree_depth=2` → 4 positions in receptive field

**Why use a tree?** Enables hierarchical composition of local features without exponential parameter growth.

**Weight structure**: Each tree level has separate learnable parameters:
```python
# LogicConv2d with tree_depth=2
self.weight = nn.ParameterList([
    nn.Parameter(...),  # Level 0: 4 kernels
    nn.Parameter(...),  # Level 1: 2 kernels
    nn.Parameter(...)   # Level 2: 1 kernel (output)
])
```

#### Channel Grouping

The `channel_group_size` parameter restricts each kernel to a subset of input channels:

```python
layer = LogicConv2d(
    in_channels=32,
    num_kernels=64,
    channel_group_size=8  # Each kernel sees 8 input channels
)
```

This creates overlapping groups, reducing parameters while maintaining coverage.

---

## Supporting Components

### GroupSum: Classification Head

`GroupSum` aggregates neuron outputs into class logits:

```python
# Input: (batch, n_neurons)
# Output: (batch, num_classes=:k)

layer = GroupSum(
    k=10,
    tau=1.0,  # Temperature for normalization
    beta=0.0  # Offset, useful for regression tasks
)
```

**How it works**:
1. Reshape `(batch, num_classes × neurons_per_class)` → `(batch, num_classes, neurons_per_class)`
2. Sum over neuron groups
3. Divide by `tau` to normalize range
4. Shift by `beta` to desired range

**Why `tau`?** Each Boolean neuron outputs [0, 1], so a sum of 100 neurons is [0, 100]. Setting `tau=100` normalizes to [0, 1] range. And setting `beta` to `-42` would result in [-42, -41].

### Binarization Layers

Since logic gates operate on Boolean values, TorchLogix provides layers to convert continuous inputs:

- **`FixedBinarization`**: Fixed thresholds (e.g., 0.5)
- **`LearnableBinarization`**: Learn thresholds during training
- **`SoftBinarization`**: Differentiable sigmoid-based thresholding, but with fixed position. Can aid training but temperature should be annealed during training to close discretization gap.

Inputs can be binarized `per_feature`, or `global`. For image-like data, there is also the `per_channel` option.

---

## Advanced Topics

### Gradient Scaling for Deep Networks

The `grad_factor` parameter scales gradients during the backward pass:

```python
layer = LogicDense(
    in_dim=256,
    out_dim=256,
    grad_factor=2.0  # Scale gradients by 2x
)
```

**Why needed?** Logic gates can have very small gradients (e.g., AND gate: `∂(a·b)/∂a = b`). For deep networks (>6 layers), this causes vanishing gradients.

**How it works**: Uses a custom autograd function:
- **Forward pass**: Identity (`y = x`)
- **Backward pass**: Scale gradient (`∂L/∂x = grad_factor · ∂L/∂y`)

**When to use**: Set `grad_factor=2.0` or higher for networks deeper than 6 layers.

### Extracting Discrete Boolean Functions

After training, you can extract the learned Boolean functions:

```python
model.eval()  # Important: switch to eval mode

# Get truth tables
luts = model.get_luts()  # Shape: (num_neurons, 2^lut_rank)

# Get integer IDs (for lut_rank=2 only)
luts, ids = model.get_luts_and_ids()  # ids in [0, 15]
```

The `ids` map to the 16 Boolean functions (0=False, 1=AND, 7=OR, 14=NAND, 15=True).

---

## Deployment: From Training to Inference

TorchLogix provides multiple inference backends optimized for different use cases:

### 1. Standard PyTorch (Training & Validation)

```python
model = LogicDense(128, 64)
model.eval()  # Discretizes to Boolean operations
output = model(input)  # Standard PyTorch inference
```

**Use case**: Validation during training, small-scale inference.

### 2. PackBitsTensor (GPU Batch Inference)

```python
from torchlogix import PackBitsTensor

model.eval()
model.cuda()

# Pack boolean tensors to single bits
input_packed = PackBitsTensor.pack(input)  # Requires CUDA
output_packed = model(input_packed)
output = output_packed.unpack()
```

**Use case**: Large batch inference on GPU, up to 32× memory savings.

**Limitation**: Requires CUDA device.

### 3. CompiledLogicNet (CPU Production Inference)

```python
from torchlogix import CompiledLogicNet

# Compile to C code
compiled = CompiledLogicNet(model, compiler='gcc', optimization_level='-O3')

# Run inference
output = compiled(input)  # Calls compiled C code
```

**Use case**: Production deployment on CPU, edge devices.

**Benefits**: 10-100× faster than PyTorch, no Python/PyTorch dependencies at runtime.

---

## Common Patterns

### Building a Classifier

```python
import torch.nn as nn
from torchlogix.layers import LogicDense, GroupSum, LearnableBinarization

model = nn.Sequential(
    LearnableBinarization(num_thresholds=3),  # Input preprocessing
    nn.Flatten(),
    LogicDense(in_dim=28*28*3, out_dim=512),
    LogicDense(in_dim=512, out_dim=512),
    LogicDense(in_dim=512, out_dim=1000),
    GroupSum(num_classes=10, neurons_per_class=100, tau=100)
)
```

### Using Pre-configured Models

```python
from torchlogix.models import DlgnMediumMnist, ClgnMediumCifar10

# Dense model for MNIST
model = DlgnMediumMnist()

# Convolutional model for CIFAR-10
model = ClgnMediumCifar10()
```

Pre-configured models follow naming convention: `{D/C}lgn{Size}{Dataset}{Variant}`
- `D` = Dense, `C` = Convolutional
- Size: Tiny, Small, Medium, Large, Large2, Large4
- Dataset: Mnist, Cifar10
- Variant: (empty), Rank4, Rank6, Learn0, Learn1, Learn2

---

## Summary

TorchLogix's architecture is built on three key ideas:

1. **PyTorch Compatibility**: Seamless integration with existing PyTorch workflows (but not drop-in replacements)
2. **Modular Design**: Separate concerns (connections, parametrization) for maximum flexibility
3. **Differentiable-to-Discrete**: Train with continuous relaxations, deploy with discrete operations

This design allows you to:
- Use logic layers like any other PyTorch layer
- Mix and match different connection patterns, parametrization strategies, and sampling modes
- Train with gradients and deploy as efficient Boolean circuits

For detailed API documentation, see the [API Reference](../api/torchlogix.rst).

For hands-on examples, see the [Quickstart Guide](quickstart.md).