ComfyUI custom node for ComfyUI_WolfSpectrum
Find a file
Repository files (latest commit first)
Filename Latest commit message Latest commit date
Balazs Horvath 0bf42ab23e Enhance KV Cache Integration for FLUX.2 Klein Models
- Implemented KVCacheModelPatcher for monkey-patching ComfyUI's ModelPatcherDynamic, enabling robust KV cache support.
- Improved device detection in Flux2KVCacheModel to handle various model types effectively.
- Added comprehensive lifecycle management for patching, including logging for better traceability.
- Created a new kv_cache/patchers module to encapsulate patching logic.
- Integrated patcher into WolfSpectrumKVCacheUnified for seamless model call routing.

This update significantly enhances the KV cache functionality, ensuring efficient model interactions and improved performance.
2026-03-13 01:21:55 +01:00
assets blep 2026-03-12 00:39:59 +01:00
comfy Refactor KV Cache Implementation for FLUX.2 Klein Models 2026-03-13 00:47:07 +01:00
docs Refactor KV Cache Implementation for FLUX.2 Klein Models 2026-03-13 00:47:07 +01:00
kv_cache Enhance KV Cache Integration for FLUX.2 Klein Models 2026-03-13 01:21:55 +01:00
nodes Enhance KV Cache Integration for FLUX.2 Klein Models 2026-03-13 01:21:55 +01:00
spectrum Refactor KV Cache Implementation for FLUX.2 Klein Models 2026-03-13 00:47:07 +01:00
tests Refactor KV Cache Implementation for FLUX.2 Klein Models 2026-03-13 00:47:07 +01:00
.gitignore blep 2026-03-12 01:22:11 +01:00
__init__.py Implement KV Cache Control for FLUX.2 Klein models 2026-03-12 20:52:16 +01:00
CHANGELOG.md Enhance KV Cache Integration for FLUX.2 Klein Models 2026-03-13 01:21:55 +01:00
README.md Implement KV Cache Control for FLUX.2 Klein models 2026-03-12 20:52:16 +01:00
TAU_MAPPING_GUIDE.md blep 2026-03-12 01:22:11 +01:00
ty.toml the broken stuff never ends 2026-03-11 20:25:00 +01:00

ComfyUI_WolfSpectrum

ComfyUI_WolfSpectrum

Training-free diffusion sampling acceleration via Spectrum — adaptive spectral feature forecasting for ComfyUI. Skips full transformer forwards on selected steps by predicting intermediate features using Chebyshev polynomial regression, blended with a discrete Taylor predictor.

Reference: Han et al., "Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration", CVPR 2026, arXiv:2603.01623.


Table of Contents


How It Works

Standard diffusion sampling runs the full transformer denoiser at every step. Spectrum observes that the intermediate hidden states (the pre-output features of the final layer) evolve smoothly across timesteps. Instead of recomputing them from scratch each step, Spectrum fits a Chebyshev polynomial to the history of observed features and extrapolates forward — then runs only the cheap output head on the predicted feature.

flowchart TD
    A[Diffusion Step t] --> B{Warmup or<br/>Actual Step?}
    B -- Yes --> C[Full Transformer Forward]
    C --> D[Hook captures h at final_layer]
    D --> E[Update Chebyshev forecaster]
    E --> F[Return output]
    B -- No --> G[Predict h̃ via Chebyshev + Taylor blend]
    G --> H[Compute vec_orig<br/>timestep + guidance embeddings]
    H --> I[Run output head only<br/>head h̃ vec_orig]
    I --> J[Postprocess to latent shape]
    J --> F

The key insight is that final_layer is cheap — it's a single linear projection. The expensive part is the stack of double and single transformer blocks before it. Spectrum skips those on cached steps.


Installation

cd ComfyUI/custom_nodes
git clone <repo_url> ComfyUI_WolfSpectrum

Restart ComfyUI. The node Apply Spectrum will appear under model/spectrum.

No extra pip install is required if ComfyUI already runs Flux or Chroma.


Usage

  1. Load a supported model (Flux, Flux2, Flux2-Klein, or Chroma).
  2. Add the Apply Spectrum node and connect your model to it.
  3. Connect the output model to your sampler as usual.
  4. Set parameters (defaults are a safe starting point).
  5. Run — fewer full transformer passes will be executed after warmup.
graph LR
    A[Load Model] --> B[Apply Spectrum]
    B --> C[KSampler / SamplerCustomAdvanced]
    C --> D[VAE Decode]
    D --> E[Image]

Parameters

Parameter Default Range Description
w 0.5 0.01.0 Blend weight: 0 = pure Taylor, 1 = pure Chebyshev. Recommended 0.51.0.
lam 0.1 1e-51.0 Ridge regularisation λ for the Chebyshev least-squares fit.
m 4 110 Number of Chebyshev bases (polynomial order M; P = M+1 coefficients).
window_size 2.0 1.010.0 Initial window size \mathcal{N} — how many steps between actual forwards.
flex_window 0.75 0.02.0 \alpha: window size increment per actual step (adaptive scheduling).
warmup_steps 5 030 Number of initial steps that always run the full transformer.
taylor_order 1 13 Order of the discrete NewtonTaylor predictor used in the blend.

Quick tuning guide:

  • More speed, lower quality: increase window_size, decrease warmup_steps.
  • More quality, less speed: increase warmup_steps, decrease window_size.
  • Noisy/corrupted images: increase warmup_steps to at least M+2 (= m+2), or increase lam.
  • Short runs (< 10 steps): Spectrum is not beneficial below ~8 steps; use it only for 14+ step inference.

Step Schedule

The schedule determines which steps run the full transformer vs. the cached head.

Warmup phase

For the first warmup_steps steps, the full transformer always runs. This populates the Chebyshev sliding window with enough observations to fit a stable polynomial. A minimum of M+2 actual observations is required before any cached step is attempted.

Post-warmup: adaptive window

After warmup, the step is an actual forward if:

\text{actual\_forward} = \bigl(n_{\text{cached}} + 1\bigr) \bmod \lfloor \mathcal{N} \rfloor = 0

where n_{\text{cached}} is the number of consecutive cached steps since the last actual forward, and \mathcal{N} is the current window size.

After each actual forward, the window grows:

\mathcal{N} \leftarrow \mathcal{N} + \alpha

This means caching intervals lengthen as the run progresses (features change more slowly near the end of denoising).

Example: 14 steps, warmup=3, window=2.0, flex=0.75

gantt
    title Step Schedule (A = Actual Forward, C = Cached)
    dateFormat  X
    axisFormat %s

    section Steps
    A Warmup 0   :milestone, 0, 0
    A Warmup 1   :milestone, 1, 1
    A Warmup 2   :milestone, 2, 2
    C Cached 3   :done, 3, 4
    A Actual 4   :milestone, 4, 4
    C Cached 5   :done, 5, 6
    A Actual 6   :milestone, 6, 6
    C Cached 7   :done, 7, 8
    C Cached 8   :done, 8, 9
    A Actual 9   :milestone, 9, 9
    C Cached 10  :done, 10, 11
    C Cached 11  :done, 11, 12
    C Cached 12  :done, 12, 13
    A Actual 13  :milestone, 13, 13

Result: 7 actual forwards out of 14 steps ≈ 2× speedup on transformer compute.


Mathematical Specification

Feature Observation

At each actual step i, the forward pre-hook on final_layer captures:

h_i = \text{final\_layer\_input}(\mathbf{x}_{t_i}) \in \mathbb{R}^{B \times L \times C}

This is stored flattened as \bar{h}_i \in \mathbb{R}^{B \times F} where F = L \cdot C.

Chebyshev Basis

Diffusion timesteps t \in [0, 50] are mapped to \tau \in [-1, 1]:

\tau = \frac{2(t - t_{\min})}{t_{\max} - t_{\min}} - 1, \quad t_{\min} = 0,\ t_{\max} = 50

The Chebyshev basis evaluates M+1 polynomials via the recurrence:

T_0(\tau) = 1, \quad T_1(\tau) = \tau, \quad T_m(\tau) = 2\tau T_{m-1}(\tau) - T_{m-2}(\tau)

The design matrix for K observed timesteps is:

\mathbf{X} = \begin{bmatrix} T_0(\tau_1) & T_1(\tau_1) & \cdots & T_M(\tau_1) \\ \vdots & & & \vdots \\ T_0(\tau_K) & T_1(\tau_K) & \cdots & T_M(\tau_K) \end{bmatrix} \in \mathbb{R}^{K \times P}, \quad P = M+1

Ridge Regression

Coefficients are fit by ridge regression (regularisation \lambda):

\hat{\mathbf{C}} = \bigl(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}_P\bigr)^{-1} \mathbf{X}^\top \mathbf{H}

where \mathbf{H} \in \mathbb{R}^{K \times B \times F} is the history buffer (reshaped to K \times BF for the solve), and \hat{\mathbf{C}} \in \mathbb{R}^{P \times B \times F}.

The solve uses a Cholesky factorisation for numerical stability:

\mathbf{L}\mathbf{L}^\top = \mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}_P, \quad \hat{\mathbf{C}} = \text{cholesky\_solve}(\mathbf{X}^\top \mathbf{H},\ \mathbf{L})

Chebyshev Prediction

At a target step t^*, the Chebyshev prediction is:

\hat{h}^{\text{cheb}}(t^*) = \mathbf{x}^* \hat{\mathbf{C}}, \quad \mathbf{x}^* = \bigl[T_0(\tau^*),\ T_1(\tau^*),\ \ldots,\ T_M(\tau^*)\bigr] \in \mathbb{R}^{1 \times P}

Discrete Taylor Predictor

The Newton forward-difference predictor of order d uses the last d+1 observations:

First order (d=1):

\hat{h}^{\text{taylor}}(t^*) = h_i + k \cdot \Delta h_i, \quad k = \frac{t^* - t_i}{t_i - t_{i-1}}, \quad \Delta h_i = h_i - h_{i-1}

Second order (d=2):

\hat{h}^{\text{taylor}}(t^*) = h_i + k \Delta h_i + \frac{k(k-1)}{2} \Delta^2 h_i, \quad \Delta^2 h_i = h_i - 2h_{i-1} + h_{i-2}

Third order (d=3):

\hat{h}^{\text{taylor}}(t^*) = \hat{h}^{(2)}(t^*) + \frac{k(k-1)(k-2)}{6} \Delta^3 h_i

Blended Prediction

The final prediction blends Chebyshev and Taylor with weight w \in [0,1]:

\hat{h}(t^*) = (1-w)\,\hat{h}^{\text{taylor}}(t^*) + w\,\hat{h}^{\text{cheb}}(t^*)

This is then unflattened back to (B, L, C) and fed to the output head.

Cached Forward

Given \hat{h}(t^*), the cached step computes:

\mathbf{v} = \text{time\_embed}(t^*) + \text{guidance\_embed}(g) + \text{vector\_embed}(\mathbf{y})
\text{output} = \text{postprocess}\bigl(\text{final\_layer}(\hat{h},\ \mathbf{v})\bigr)

This replaces the full transformer forward at a fraction of the cost.


Supported Models

Model Adapter Hook Target Notes
Flux.1-dev FluxAdapter model.final_layer Guidance embedding enabled
Flux.1-schnell FluxAdapter model.final_layer Guidance embedding skipped (Identity)
Flux2 FluxAdapter model.final_layer Same layout as Flux
Flux2-Klein FluxAdapter (via alias) model.final_layer Same layout as Flux2
Chroma ChromaAdapter model.final_layer Distilled guidance; get_modulations()

KV Cache Control

The KV Cache Control node enables accelerated multi-reference image editing for FLUX.2 Klein models using KV-caching technology.

Key Features:

  • Works with both 4B and 9B variants using existing models
  • Up to 2.66× speedup for workflows with 4 reference images
  • Automatic cache extraction on step 0, reuse on subsequent steps
  • Memory efficient: ~2-4 MB per reference token

How It Works:

sequenceDiagram
    participant User
    participant Node as KV Cache Control
    participant Model as Flux2 Model
    participant Cache as KV Cache

    User->>Node: Provide model + reference images
    Node->>Model: forward_kv_extract (Step 0)
    Model->>Cache: Store K/V pairs for ref tokens
    Cache-->>Node: Return prediction
    Node->>User: Return intermediate state

    loop Steps 1 to N
        User->>Node: Continue denoising
        Node->>Model: forward_kv_cached (Steps 1+)
        Model->>Cache: Read cached K/V pairs
        Cache-->>Model: Inject cached K/V
        Model->>Node: Return prediction
        Node->>User: Return intermediate state
    end

Usage:

  1. Load FLUX.2 Klein model (4B or 9B)
  2. Load reference images (optional)
  3. Add KV Cache Control node from model/kv_cache category
  4. Connect:
    • model: From loaded model
    • reference_images: Optional batch of reference images
    • enable_cache: Set to True
  5. Connect to KSampler as usual

For complete documentation, see KV Cache Control Documentation.


Architecture

classDiagram
    class ApplySpectrum {
        +apply(model, w, lam, m, ...) MODEL
        -clone model
        -create SpectrumRunState
        -install wrappers
    }

    class SpectrumRunState {
        +w, lam, m, window_size, flex_window
        +warmup_steps, taylor_order
        +cnt, num_consecutive_cached_steps
        +curr_ws, forecaster, adapter
        +init_for_run(num_steps)
        +is_actual_step() bool
        +after_step(actual_forward)
        +update_forecaster(t, h)
        +predict(step) Tensor
    }

    class Spectrum {
        +cheb ChebyshevForecaster
        +taylor_order, w
        +predict(t_star) Tensor
        +update(t, h)
        -_local_taylor_discrete(t_star) Tensor
    }

    class ChebyshevForecaster {
        +M, K, lam
        +t_buf, _H_buf, _coef
        +update(t, h)
        +predict(t_star) Tensor
        -_fit_if_needed()
        -_taus(t) Tensor
        -_build_design(taus) Tensor
    }

    class BaseSpectrumAdapter {
        <<abstract>>
        +supports(model) bool
        +hook_target(model) Module
        +compute_vec_orig(...) Tensor
        +run_head(model, h, vec) Tensor
        +postprocess_output(...) Tensor
    }

    class FluxAdapter {
        +supports(model) bool
        +hook_target(model) final_layer
        +compute_vec_orig(...) Tensor
        +run_head(model, h, vec) Tensor
        +postprocess_output(...) Tensor
    }

    class ChromaAdapter {
        +MOD_INDEX_LENGTH 344
        +supports(model) bool
        +compute_vec_orig(...) Tensor
    }

    ApplySpectrum --> SpectrumRunState
    SpectrumRunState --> Spectrum
    SpectrumRunState --> BaseSpectrumAdapter
    Spectrum --> ChebyshevForecaster
    FluxAdapter --|> BaseSpectrumAdapter
    ChromaAdapter --|> BaseSpectrumAdapter

Data flow on an actual step

sequenceDiagram
    participant S as Sampler
    participant W as DiffusionWrapper
    participant M as Transformer
    participant FL as final_layer
    participant FC as ChebyshevForecaster

    S->>W: invoke(x, t, ...)
    W->>W: is_actual_step? → True
    W->>FL: register_forward_pre_hook
    W->>M: executor(x, t, ...)
    M->>FL: forward(h, vec)
    FL-->>W: hook captures h
    M-->>W: out
    W->>FL: hook.remove()
    W->>FC: update(t, h)
    W-->>S: out

Data flow on a cached step

sequenceDiagram
    participant S as Sampler
    participant W as DiffusionWrapper
    participant FC as Spectrum forecaster
    participant A as FluxAdapter
    participant FL as final_layer

    S->>W: invoke(x, t, ...)
    W->>W: is_actual_step? → False
    W->>FC: predict(t)
    FC->>FC: _fit_if_needed (Cholesky)
    FC->>FC: Chebyshev + Taylor blend
    FC-->>W: ĥ (B, L, C)
    W->>A: compute_vec_orig(t, y, guidance)
    A-->>W: vec_orig
    W->>A: run_head(model, ĥ, vec_orig)
    A->>FL: forward(ĥ, vec_orig)
    FL-->>A: head_out
    A->>A: postprocess_output (rearrange)
    A-->>W: out (B, C, H, W)
    W-->>S: out

Performance

Measured speedup depends heavily on warmup_steps, window_size, and total step count. Representative results on Flux.1-dev:

Steps Warmup Window Actual Fwds Speedup
14 3 2.0 7 / 14 ~1.8×
20 5 2.0 9 / 20 ~2.0×
28 5 2.0 10 / 28 ~2.5×
28 3 3.0 7 / 28 ~3.2×

Note: Spectrum is not beneficial for fewer than ~8 steps. The warmup overhead dominates short runs.


Troubleshooting

UNSTABLE PREDICTION DETECTED / NaN every cached step

The Chebyshev fit is numerically degenerate when fewer than P+1 = M+2 actual observations are available. This happens when warmup_steps is too small relative to m.

Fix: Set warmup_stepsm + 2 (default m=4 → minimum warmup_steps=6).

Corrupted / blurry images despite no NaN

The Chebyshev extrapolation is drifting from the true features. Try:

  • Increasing lam (e.g. 0.1 → 0.5) to regularise the polynomial fit.
  • Decreasing window_size to force more actual forwards.
  • Setting w=1.0 (pure Chebyshev) or w=0.0 (pure Taylor) to isolate which predictor is misbehaving.

torch._dynamo hit config.recompile_limit

The wrapper or predictor code is being recompiled by torch.compile on each new step index. Ensure:

  • @torch._dynamo.disable is applied to the sampler and apply_model wrappers in step_injector.py.
  • No Python .item() calls appear in any code path that runs inside a torch.compile region.
  • t_star is passed as a float tensor, not a raw Python int.

Selected FluxAdapter for this model printed every step

The adapter is being re-detected each step instead of being cached. This was fixed in the March 2026 update — ensure SpectrumRunState.adapter is set and reused.

Very slow first step / model loading

This is normal: the first step loads the model onto GPU. Spectrum does not affect load time.


Adding New Models

Implement a subclass of BaseSpectrumAdapter in comfy/adapters/:

from .base import BaseSpectrumAdapter

class MyModelAdapter(BaseSpectrumAdapter):

    @classmethod
    def supports(cls, diffusion_model) -> bool:
        # Return True if this is your model type
        return hasattr(diffusion_model, "my_final_layer")

    def hook_target(self, model):
        # Return the nn.Module whose first input is the feature to cache
        return model.my_final_layer

    def compute_vec_orig(self, model, timestep, y, guidance, device, dtype):
        # Reconstruct the conditioning vector (time + guidance + vector embed)
        # that your final_layer expects as its second argument
        t_emb = my_timestep_embedding(timestep, model.hidden_size)
        return model.time_proj(t_emb)

    def run_head(self, model, hidden, vec_orig):
        # Run only the output head on the predicted feature
        return model.my_final_layer(hidden, vec_orig)

    def postprocess_output(self, model, head_output, x, img_tokens):
        # Convert head_output to the same shape as full forward output
        # For patch-based models, rearrange from (B, tokens, C) to (B, C, H, W)
        from einops import rearrange
        bs, c, h, w = x.shape
        p = model.patch_size
        out = head_output[:, :img_tokens]
        return rearrange(out, "b (h w) (p q c) -> b c (h p) (w q)",
                         h=h//p, w=w//p, p=p, q=p)

Then register it in comfy/adapters/__init__.py:

from .my_model import MyModelAdapter

def get_adapter_for_model(diffusion_model):
    if MyModelAdapter.supports(diffusion_model):
        return MyModelAdapter()
    if ChromaAdapter.supports(diffusion_model):
        return ChromaAdapter()
    if FluxAdapter.supports(diffusion_model):
        return FluxAdapter()
    return None

See docs/MODELS.md for per-model notes on feature location and head structure.


Dependencies

  • ComfyUI with Flux/Flux2/Chroma support
  • PyTorch ≥ 2.0 (bfloat16, torch.linalg.cholesky)
  • einops (for spatial rearrangement in adapters)

Both torch and einops are standard in ComfyUI environments — no additional installation needed.


License

MIT