- Python 100%
| Filename | Latest commit message | Latest commit date |
|---|---|---|
- Implemented KVCacheModelPatcher for monkey-patching ComfyUI's ModelPatcherDynamic, enabling robust KV cache support. - Improved device detection in Flux2KVCacheModel to handle various model types effectively. - Added comprehensive lifecycle management for patching, including logging for better traceability. - Created a new kv_cache/patchers module to encapsulate patching logic. - Integrated patcher into WolfSpectrumKVCacheUnified for seamless model call routing. This update significantly enhances the KV cache functionality, ensuring efficient model interactions and improved performance. |
||
| assets | ||
| comfy | ||
| docs | ||
| kv_cache | ||
| nodes | ||
| spectrum | ||
| tests | ||
| .gitignore | ||
| __init__.py | ||
| CHANGELOG.md | ||
| README.md | ||
| TAU_MAPPING_GUIDE.md | ||
| ty.toml | ||
ComfyUI_WolfSpectrum
Training-free diffusion sampling acceleration via Spectrum — adaptive spectral feature forecasting for ComfyUI. Skips full transformer forwards on selected steps by predicting intermediate features using Chebyshev polynomial regression, blended with a discrete Taylor predictor.
Reference: Han et al., "Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration", CVPR 2026, arXiv:2603.01623.
Table of Contents
- How It Works
- Installation
- Usage
- Parameters
- Step Schedule
- Mathematical Specification
- Supported Models
- KV Cache Control
- Architecture
- Performance
- Troubleshooting
- Adding New Models
- Dependencies
How It Works
Standard diffusion sampling runs the full transformer denoiser at every step. Spectrum observes that the intermediate hidden states (the pre-output features of the final layer) evolve smoothly across timesteps. Instead of recomputing them from scratch each step, Spectrum fits a Chebyshev polynomial to the history of observed features and extrapolates forward — then runs only the cheap output head on the predicted feature.
flowchart TD
A[Diffusion Step t] --> B{Warmup or<br/>Actual Step?}
B -- Yes --> C[Full Transformer Forward]
C --> D[Hook captures h at final_layer]
D --> E[Update Chebyshev forecaster]
E --> F[Return output]
B -- No --> G[Predict h̃ via Chebyshev + Taylor blend]
G --> H[Compute vec_orig<br/>timestep + guidance embeddings]
H --> I[Run output head only<br/>head h̃ vec_orig]
I --> J[Postprocess to latent shape]
J --> F
The key insight is that final_layer is cheap — it's a single linear projection. The expensive part is the stack of double and single transformer blocks before it. Spectrum skips those on cached steps.
Installation
cd ComfyUI/custom_nodes
git clone <repo_url> ComfyUI_WolfSpectrum
Restart ComfyUI. The node Apply Spectrum will appear under model/spectrum.
No extra pip install is required if ComfyUI already runs Flux or Chroma.
Usage
- Load a supported model (Flux, Flux2, Flux2-Klein, or Chroma).
- Add the Apply Spectrum node and connect your model to it.
- Connect the output model to your sampler as usual.
- Set parameters (defaults are a safe starting point).
- Run — fewer full transformer passes will be executed after warmup.
graph LR
A[Load Model] --> B[Apply Spectrum]
B --> C[KSampler / SamplerCustomAdvanced]
C --> D[VAE Decode]
D --> E[Image]
Parameters
| Parameter | Default | Range | Description |
|---|---|---|---|
| w | 0.5 | 0.0–1.0 | Blend weight: 0 = pure Taylor, 1 = pure Chebyshev. Recommended 0.5–1.0. |
| lam | 0.1 | 1e-5–1.0 | Ridge regularisation λ for the Chebyshev least-squares fit. |
| m | 4 | 1–10 | Number of Chebyshev bases (polynomial order M; P = M+1 coefficients). |
| window_size | 2.0 | 1.0–10.0 | Initial window size \mathcal{N} — how many steps between actual forwards. |
| flex_window | 0.75 | 0.0–2.0 | \alpha: window size increment per actual step (adaptive scheduling). |
| warmup_steps | 5 | 0–30 | Number of initial steps that always run the full transformer. |
| taylor_order | 1 | 1–3 | Order of the discrete Newton–Taylor predictor used in the blend. |
Quick tuning guide:
- More speed, lower quality: increase
window_size, decreasewarmup_steps. - More quality, less speed: increase
warmup_steps, decreasewindow_size. - Noisy/corrupted images: increase
warmup_stepsto at least M+2 (=m+2), or increaselam. - Short runs (< 10 steps): Spectrum is not beneficial below ~8 steps; use it only for 14+ step inference.
Step Schedule
The schedule determines which steps run the full transformer vs. the cached head.
Warmup phase
For the first warmup_steps steps, the full transformer always runs. This populates the Chebyshev sliding window with enough observations to fit a stable polynomial. A minimum of M+2 actual observations is required before any cached step is attempted.
Post-warmup: adaptive window
After warmup, the step is an actual forward if:
\text{actual\_forward} = \bigl(n_{\text{cached}} + 1\bigr) \bmod \lfloor \mathcal{N} \rfloor = 0
where n_{\text{cached}} is the number of consecutive cached steps since the last actual forward, and \mathcal{N} is the current window size.
After each actual forward, the window grows:
\mathcal{N} \leftarrow \mathcal{N} + \alpha
This means caching intervals lengthen as the run progresses (features change more slowly near the end of denoising).
Example: 14 steps, warmup=3, window=2.0, flex=0.75
gantt
title Step Schedule (A = Actual Forward, C = Cached)
dateFormat X
axisFormat %s
section Steps
A Warmup 0 :milestone, 0, 0
A Warmup 1 :milestone, 1, 1
A Warmup 2 :milestone, 2, 2
C Cached 3 :done, 3, 4
A Actual 4 :milestone, 4, 4
C Cached 5 :done, 5, 6
A Actual 6 :milestone, 6, 6
C Cached 7 :done, 7, 8
C Cached 8 :done, 8, 9
A Actual 9 :milestone, 9, 9
C Cached 10 :done, 10, 11
C Cached 11 :done, 11, 12
C Cached 12 :done, 12, 13
A Actual 13 :milestone, 13, 13
Result: 7 actual forwards out of 14 steps ≈ 2× speedup on transformer compute.
Mathematical Specification
Feature Observation
At each actual step i, the forward pre-hook on final_layer captures:
h_i = \text{final\_layer\_input}(\mathbf{x}_{t_i}) \in \mathbb{R}^{B \times L \times C}
This is stored flattened as \bar{h}_i \in \mathbb{R}^{B \times F} where F = L \cdot C.
Chebyshev Basis
Diffusion timesteps t \in [0, 50] are mapped to \tau \in [-1, 1]:
\tau = \frac{2(t - t_{\min})}{t_{\max} - t_{\min}} - 1, \quad t_{\min} = 0,\ t_{\max} = 50
The Chebyshev basis evaluates M+1 polynomials via the recurrence:
T_0(\tau) = 1, \quad T_1(\tau) = \tau, \quad T_m(\tau) = 2\tau T_{m-1}(\tau) - T_{m-2}(\tau)
The design matrix for K observed timesteps is:
\mathbf{X} = \begin{bmatrix} T_0(\tau_1) & T_1(\tau_1) & \cdots & T_M(\tau_1) \\ \vdots & & & \vdots \\ T_0(\tau_K) & T_1(\tau_K) & \cdots & T_M(\tau_K) \end{bmatrix} \in \mathbb{R}^{K \times P}, \quad P = M+1
Ridge Regression
Coefficients are fit by ridge regression (regularisation \lambda):
\hat{\mathbf{C}} = \bigl(\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}_P\bigr)^{-1} \mathbf{X}^\top \mathbf{H}
where \mathbf{H} \in \mathbb{R}^{K \times B \times F} is the history buffer (reshaped to K \times BF for the solve), and \hat{\mathbf{C}} \in \mathbb{R}^{P \times B \times F}.
The solve uses a Cholesky factorisation for numerical stability:
\mathbf{L}\mathbf{L}^\top = \mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}_P, \quad \hat{\mathbf{C}} = \text{cholesky\_solve}(\mathbf{X}^\top \mathbf{H},\ \mathbf{L})
Chebyshev Prediction
At a target step t^*, the Chebyshev prediction is:
\hat{h}^{\text{cheb}}(t^*) = \mathbf{x}^* \hat{\mathbf{C}}, \quad \mathbf{x}^* = \bigl[T_0(\tau^*),\ T_1(\tau^*),\ \ldots,\ T_M(\tau^*)\bigr] \in \mathbb{R}^{1 \times P}
Discrete Taylor Predictor
The Newton forward-difference predictor of order d uses the last d+1 observations:
First order (d=1):
\hat{h}^{\text{taylor}}(t^*) = h_i + k \cdot \Delta h_i, \quad k = \frac{t^* - t_i}{t_i - t_{i-1}}, \quad \Delta h_i = h_i - h_{i-1}
Second order (d=2):
\hat{h}^{\text{taylor}}(t^*) = h_i + k \Delta h_i + \frac{k(k-1)}{2} \Delta^2 h_i, \quad \Delta^2 h_i = h_i - 2h_{i-1} + h_{i-2}
Third order (d=3):
\hat{h}^{\text{taylor}}(t^*) = \hat{h}^{(2)}(t^*) + \frac{k(k-1)(k-2)}{6} \Delta^3 h_i
Blended Prediction
The final prediction blends Chebyshev and Taylor with weight w \in [0,1]:
\hat{h}(t^*) = (1-w)\,\hat{h}^{\text{taylor}}(t^*) + w\,\hat{h}^{\text{cheb}}(t^*)
This is then unflattened back to (B, L, C) and fed to the output head.
Cached Forward
Given \hat{h}(t^*), the cached step computes:
\mathbf{v} = \text{time\_embed}(t^*) + \text{guidance\_embed}(g) + \text{vector\_embed}(\mathbf{y})
\text{output} = \text{postprocess}\bigl(\text{final\_layer}(\hat{h},\ \mathbf{v})\bigr)
This replaces the full transformer forward at a fraction of the cost.
Supported Models
| Model | Adapter | Hook Target | Notes |
|---|---|---|---|
| Flux.1-dev | FluxAdapter |
model.final_layer |
Guidance embedding enabled |
| Flux.1-schnell | FluxAdapter |
model.final_layer |
Guidance embedding skipped (Identity) |
| Flux2 | FluxAdapter |
model.final_layer |
Same layout as Flux |
| Flux2-Klein | FluxAdapter (via alias) |
model.final_layer |
Same layout as Flux2 |
| Chroma | ChromaAdapter |
model.final_layer |
Distilled guidance; get_modulations() |
KV Cache Control
The KV Cache Control node enables accelerated multi-reference image editing for FLUX.2 Klein models using KV-caching technology.
Key Features:
- Works with both 4B and 9B variants using existing models
- Up to 2.66× speedup for workflows with 4 reference images
- Automatic cache extraction on step 0, reuse on subsequent steps
- Memory efficient: ~2-4 MB per reference token
How It Works:
sequenceDiagram
participant User
participant Node as KV Cache Control
participant Model as Flux2 Model
participant Cache as KV Cache
User->>Node: Provide model + reference images
Node->>Model: forward_kv_extract (Step 0)
Model->>Cache: Store K/V pairs for ref tokens
Cache-->>Node: Return prediction
Node->>User: Return intermediate state
loop Steps 1 to N
User->>Node: Continue denoising
Node->>Model: forward_kv_cached (Steps 1+)
Model->>Cache: Read cached K/V pairs
Cache-->>Model: Inject cached K/V
Model->>Node: Return prediction
Node->>User: Return intermediate state
end
Usage:
- Load FLUX.2 Klein model (4B or 9B)
- Load reference images (optional)
- Add KV Cache Control node from
model/kv_cachecategory - Connect:
model: From loaded modelreference_images: Optional batch of reference imagesenable_cache: Set toTrue
- Connect to KSampler as usual
For complete documentation, see KV Cache Control Documentation.
Architecture
classDiagram
class ApplySpectrum {
+apply(model, w, lam, m, ...) MODEL
-clone model
-create SpectrumRunState
-install wrappers
}
class SpectrumRunState {
+w, lam, m, window_size, flex_window
+warmup_steps, taylor_order
+cnt, num_consecutive_cached_steps
+curr_ws, forecaster, adapter
+init_for_run(num_steps)
+is_actual_step() bool
+after_step(actual_forward)
+update_forecaster(t, h)
+predict(step) Tensor
}
class Spectrum {
+cheb ChebyshevForecaster
+taylor_order, w
+predict(t_star) Tensor
+update(t, h)
-_local_taylor_discrete(t_star) Tensor
}
class ChebyshevForecaster {
+M, K, lam
+t_buf, _H_buf, _coef
+update(t, h)
+predict(t_star) Tensor
-_fit_if_needed()
-_taus(t) Tensor
-_build_design(taus) Tensor
}
class BaseSpectrumAdapter {
<<abstract>>
+supports(model) bool
+hook_target(model) Module
+compute_vec_orig(...) Tensor
+run_head(model, h, vec) Tensor
+postprocess_output(...) Tensor
}
class FluxAdapter {
+supports(model) bool
+hook_target(model) final_layer
+compute_vec_orig(...) Tensor
+run_head(model, h, vec) Tensor
+postprocess_output(...) Tensor
}
class ChromaAdapter {
+MOD_INDEX_LENGTH 344
+supports(model) bool
+compute_vec_orig(...) Tensor
}
ApplySpectrum --> SpectrumRunState
SpectrumRunState --> Spectrum
SpectrumRunState --> BaseSpectrumAdapter
Spectrum --> ChebyshevForecaster
FluxAdapter --|> BaseSpectrumAdapter
ChromaAdapter --|> BaseSpectrumAdapter
Data flow on an actual step
sequenceDiagram
participant S as Sampler
participant W as DiffusionWrapper
participant M as Transformer
participant FL as final_layer
participant FC as ChebyshevForecaster
S->>W: invoke(x, t, ...)
W->>W: is_actual_step? → True
W->>FL: register_forward_pre_hook
W->>M: executor(x, t, ...)
M->>FL: forward(h, vec)
FL-->>W: hook captures h
M-->>W: out
W->>FL: hook.remove()
W->>FC: update(t, h)
W-->>S: out
Data flow on a cached step
sequenceDiagram
participant S as Sampler
participant W as DiffusionWrapper
participant FC as Spectrum forecaster
participant A as FluxAdapter
participant FL as final_layer
S->>W: invoke(x, t, ...)
W->>W: is_actual_step? → False
W->>FC: predict(t)
FC->>FC: _fit_if_needed (Cholesky)
FC->>FC: Chebyshev + Taylor blend
FC-->>W: ĥ (B, L, C)
W->>A: compute_vec_orig(t, y, guidance)
A-->>W: vec_orig
W->>A: run_head(model, ĥ, vec_orig)
A->>FL: forward(ĥ, vec_orig)
FL-->>A: head_out
A->>A: postprocess_output (rearrange)
A-->>W: out (B, C, H, W)
W-->>S: out
Performance
Measured speedup depends heavily on warmup_steps, window_size, and total step count. Representative results on Flux.1-dev:
| Steps | Warmup | Window | Actual Fwds | Speedup |
|---|---|---|---|---|
| 14 | 3 | 2.0 | 7 / 14 | ~1.8× |
| 20 | 5 | 2.0 | 9 / 20 | ~2.0× |
| 28 | 5 | 2.0 | 10 / 28 | ~2.5× |
| 28 | 3 | 3.0 | 7 / 28 | ~3.2× |
Note: Spectrum is not beneficial for fewer than ~8 steps. The warmup overhead dominates short runs.
Troubleshooting
UNSTABLE PREDICTION DETECTED / NaN every cached step
The Chebyshev fit is numerically degenerate when fewer than P+1 = M+2 actual observations are available. This happens when warmup_steps is too small relative to m.
Fix: Set warmup_steps ≥ m + 2 (default m=4 → minimum warmup_steps=6).
Corrupted / blurry images despite no NaN
The Chebyshev extrapolation is drifting from the true features. Try:
- Increasing
lam(e.g. 0.1 → 0.5) to regularise the polynomial fit. - Decreasing
window_sizeto force more actual forwards. - Setting
w=1.0(pure Chebyshev) orw=0.0(pure Taylor) to isolate which predictor is misbehaving.
torch._dynamo hit config.recompile_limit
The wrapper or predictor code is being recompiled by torch.compile on each new step index. Ensure:
@torch._dynamo.disableis applied to the sampler and apply_model wrappers instep_injector.py.- No Python
.item()calls appear in any code path that runs inside atorch.compileregion. t_staris passed as a float tensor, not a raw Python int.
Selected FluxAdapter for this model printed every step
The adapter is being re-detected each step instead of being cached. This was fixed in the March 2026 update — ensure SpectrumRunState.adapter is set and reused.
Very slow first step / model loading
This is normal: the first step loads the model onto GPU. Spectrum does not affect load time.
Adding New Models
Implement a subclass of BaseSpectrumAdapter in comfy/adapters/:
from .base import BaseSpectrumAdapter
class MyModelAdapter(BaseSpectrumAdapter):
@classmethod
def supports(cls, diffusion_model) -> bool:
# Return True if this is your model type
return hasattr(diffusion_model, "my_final_layer")
def hook_target(self, model):
# Return the nn.Module whose first input is the feature to cache
return model.my_final_layer
def compute_vec_orig(self, model, timestep, y, guidance, device, dtype):
# Reconstruct the conditioning vector (time + guidance + vector embed)
# that your final_layer expects as its second argument
t_emb = my_timestep_embedding(timestep, model.hidden_size)
return model.time_proj(t_emb)
def run_head(self, model, hidden, vec_orig):
# Run only the output head on the predicted feature
return model.my_final_layer(hidden, vec_orig)
def postprocess_output(self, model, head_output, x, img_tokens):
# Convert head_output to the same shape as full forward output
# For patch-based models, rearrange from (B, tokens, C) to (B, C, H, W)
from einops import rearrange
bs, c, h, w = x.shape
p = model.patch_size
out = head_output[:, :img_tokens]
return rearrange(out, "b (h w) (p q c) -> b c (h p) (w q)",
h=h//p, w=w//p, p=p, q=p)
Then register it in comfy/adapters/__init__.py:
from .my_model import MyModelAdapter
def get_adapter_for_model(diffusion_model):
if MyModelAdapter.supports(diffusion_model):
return MyModelAdapter()
if ChromaAdapter.supports(diffusion_model):
return ChromaAdapter()
if FluxAdapter.supports(diffusion_model):
return FluxAdapter()
return None
See docs/MODELS.md for per-model notes on feature location and head structure.
Dependencies
- ComfyUI with Flux/Flux2/Chroma support
- PyTorch ≥ 2.0 (bfloat16,
torch.linalg.cholesky) - einops (for spatial rearrangement in adapters)
Both torch and einops are standard in ComfyUI environments — no additional installation needed.
License
MIT
