Attention Gating: From Broadbent's Filter to Sparse Transformers
The cocktail party problem
You walk into a loud party. Dozens of overlapping conversations, music in the background, glasses clinking. Somehow, you can lock onto the person in front of you and tune the rest out. Or, if someone across the room says your name, you snap to it instantly. The brain has a filter sitting between sensory memory and working memory, and what passes through it is what you actually experience as "paying attention."
The biology
The first formal model came from Donald Broadbent in 1958. His filter model proposed an early, all-or-nothing selection: incoming sensory information is filtered based on simple physical features (which ear it came in, what pitch, what direction), and only the selected channel reaches conscious processing. Broadbent's model was elegant but too strict. It could not explain the cocktail-party effect (why your name still grabs your attention when it is on a "rejected" channel).
Treisman's attenuation model (1964) refined Broadbent: rejected information is not blocked, just turned down in volume. It can still bubble up if it is highly salient.
Mechanistically, neural oscillations seem to do the work. Theta rhythms at 4 to 8 Hz coordinate communication between hippocampus and prefrontal cortex. Gamma cycles at higher frequencies are nested inside theta, with about 4 to 7 gamma cycles per theta cycle, which lines up neatly with the roughly four-item working memory limit. Each gamma cycle is thought to encode one "item." Chunking, identified by George Miller in 1956, lets you cheat the limit: if you can group raw items into a meaningful unit (a phone number's area code, a familiar chess pattern), the chunk becomes one item instead of seven.
The takeaway: attention is not magic. It is a filter that uses cheap physical features first, lets salient content sneak through, and works hand in hand with working memory to choose what gets a chance to be remembered at all.
The technology
The 2017 transformer paper "Attention Is All You Need" (Vaswani et al.) introduced scaled dot-product attention: every token computes a relevance score against every other token using softmax(QK^T / sqrt(d)) V. That is a global, dense filter. It is also the fundamental bottleneck, because it scales as O(n^2) in sequence length. Real biological attention is sparse for very good reasons. So a generation of "sparse attention" mechanisms has emerged to bring transformers closer to the biological story:
- BigBird (NeurIPS 2020) combines random tokens, sliding-window attention, and global "anchor" tokens. It is provably Turing complete and handles 8x longer sequences than vanilla attention on equivalent hardware.
- Longformer achieves O(n) complexity by combining a sliding window with a small set of global tokens. Inside transformers, it is structurally similar to working-memory rehearsal: each token attends to a fixed neighborhood plus a few "summary" anchors.
- Google Infini-attention (2024) splits attention into local (current context) and global compressive memory of everything older.
Outside the transformer block, context compression is the technological version of cognitive chunking. LLMLingua (EMNLP 2023) achieves 20x compression with only 1.5% performance loss by using a small language model's perplexity to identify and remove low-information tokens. AutoCompressors (Chevalier et al., 2023) compress long contexts into compact "summary vectors" at a 40x compression rate. AttentionRAG (2025) uses attention scores to prune retrieved context.
RAG systems themselves function as the technological attention gate at a larger scale. RAGate (2024) implements explicit binary gating, predicting whether external retrieval is even needed for a given query. That is a near-direct port of Broadbent's filter logic.
Where the gap is
Sparse attention and context compression are well-developed. Production systems regularly handle hundreds of thousands of tokens efficiently. But there is a category of biological attention that AI has barely touched: spike-timing-dependent plasticity (STDP) attention. Real neurons gate information by precise spike timing, and the strength of a connection updates based on whether one neuron fired just before or just after another. STDP-based spiking transformers exist in research labs as of 2025, but they are nowhere near production.
Practical implication: you almost certainly do not need spiking attention. You probably do need to think harder about what your model is filtering out. If your context is too long and the model is missing the obvious, the right move is rarely "give it more tokens." It is "build a better gate."
Series footer
← Previous: Working Memory · Series anchor · Next: Episodic Memory →