Close Menu
Techora News HubTechora News Hub
    Facebook X (Twitter) Instagram
    Techora News HubTechora News Hub
    • Home
    • Crypto News
      • Bitcoin
      • Ethereum
      • Altcoins
      • Blockchain
      • DeFi
    • AI News
    • Stock News
    • Learn
      • AI for Beginners
      • AI Tips
      • Make Money with AI
    • Reviews
    • Tools
      • Best AI Tools
      • Crypto Market Cap List
      • Stock Market Overview
      • Market Heatmap
    • Contact
    Techora News HubTechora News Hub
    Home»AI News»Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
    AI News

    Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

    May 25, 2026
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email
    kraken


    Long-context inference makes the KV cache one of the main costs of serving LLMs. During autoregressive decoding, the cache grows with context length, batch size, and model depth. At high batch sizes and long contexts with 100K tokens across dozens of concurrent requests the KV cache consumes a large fraction of GPU memory. Compressing it is a direct way to increase batch size and reduce memory traffic.

    The obvious approach is quantization. But pushing KV caches to INT2 (2-bit) precision has been largely impractical. Prior methods either collapse in accuracy or require custom serving layouts incompatible with paged KV-cache systems. Together AI’s OSCAR (Offline Spectral Covariance-Aware Rotation) addresses both problems.

    Why INT2 KV Cache Quantization is Hard

    KV activations contain channel-wise outliers. A small subset of channels holds extremely large values. Most channels are well-behaved. When you apply INT2 quantization which has only four representable levels and those outliers dominate the scale factor. The quantizer wastes most of its range on rare spikes. Normal values get compressed into just one or two effective levels. This degrades attention quality substantially.

    Rotation-based quantization addresses this by applying a fixed orthogonal transform, typically a Hadamard transform, to redistribute outlier energy across all channels. This approach works reasonably well at INT4. At INT2, a deeper problem remains: the rotation is data-oblivious. It can smooth activation ranges, but it does not know which directions the attention mechanism actually reads. Spreading quantization error uniformly is not the same as pushing it into low-importance directions. At INT2, with only four levels, that distinction determines whether the model works at all.

    synthesia
    https://arxiv.org/pdf/2605.17757v1

    What OSCAR Does Differently

    OSCAR’s key observation is that the rotation applied before quantization should be derived from attention statistics themselves — not from the raw distribution of KV activations.

    For keys, the downstream error that matters is not the Euclidean reconstruction error of K. It is the error in attention logits. The research team showed this error is: ‖QK⊤ − QK̂⊤‖²F = tr((K − K̂)Q⊤Q(K − K̂)⊤). The weighting matrix is the query covariance Q⊤Q, not K⊤K. Directions where queries have large energy amplify quantization errors in logits. OSCAR estimates the empirical query covariance CQ = (1/N) Σ qn⊤qn from a calibration set, eigen-decomposes it, and uses the eigenvectors UQ as the key rotation basis.

    For values, the relevant error is in the attention output SV. This depends on how the attention score matrix S weights each value row. The research team defines the score-weighted value covariance CS = (1/N) V⊤S⊤SV. Directions that remain large after aggregation by S are the ones quantization error propagates through. OSCAR uses the eigenvectors US of CS as the value rotation basis.

    The final composed rotations are:

    RK = UQ · HHad · PbrRV = US · HHad · Pbr

    Each of the three factors addresses a distinct failure mode of per-group low-bit quantization:

    • UQ / US aligns channels with attention-importance directions. This diagonalizes the error-weighting matrix so the most important directions are identifiable.
    • HHad (Walsh-Hadamard transform) then equalizes channel importance exactly. Lemma 1 in the research paper proves every diagonal entry of HHad⊤ Λ HHad equals tr(Λ)/d — the peaky eigenspectrum exposed by UQ is compressed to a uniform value across all channels.
    • Pbr (permuted bit-reversal) reorders channels so that for any power-of-two quantization group size, each group receives one representative from each level of the importance hierarchy.

    The research team provides Theorem 1 proving UQ and US are optimal under a frozen-error surrogate objective with diagonal residual assumptions.

    The Serving System: Mixed-Precision Cache Layout

    OSCAR integrates into SGLang’s production serving stack as an INT2 KV-cache mode with full compatibility with paged attention.

    The KV cache layout uses three regions per request:

    • Sink tokens (first S0 = 64 tokens): stored in BF16. These function as attention sinks.
    • Recent tokens (last W = 256 tokens before current position): stored in BF16.
    • History tokens (everything in between): stored as INT2 after OSCAR rotation and clipping.

    At 128K context length, the BF16 sink and recent windows represent only 0.24% of total tokens. The ablation (Table 5 in the research paper) shows (S=64, R=256) is the accuracy-efficiency knee: smaller windows noticeably hurt accuracy; larger windows give negligible additional benefit at higher BF16 memory cost.

    https://arxiv.org/pdf/2605.17757

    Write and read paths use fused Triton kernels. On the write path, each token is rotated, clipped to a calibration-derived percentile threshold (typical values: cK = 0.96, cV = 0.92), then quantized with per-token asymmetric INT2 at a default group size of GK = 64 channels per group. On the read path, the INT2 kernel unpacks bytes, dequantizes, inverse-rotates, and passes results to the attention kernel — all in one fused pass without extra memory traffic. The value rotation RV is absorbed into the model’s projection weights offline, eliminating its online compute cost.

    Outcome

    The research team evaluated OSCAR on four model configurations: Qwen3-4B-Thinking-2507, Qwen3-8B, Qwen3-32B, and GLM-4.7-FP8 (358B parameters). Benchmarks include AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500, all at 32K maximum generation length.

    Accuracy (at 2.28 bits per KV element):

    ModelBF16 MeanOSCAR MeanGap to BF16Qwen3-4B-Thinking-250775.6471.86−3.78Qwen3-8B70.8469.42−1.42Qwen3-32B74.1974.17−0.02GLM-4.7-FP8 (358B)77.8978.16+0.27

    For context on how competing methods compare: naive INT2 (no rotation) scores 0.00 on both Qwen3-4B and Qwen3-8B. QuaRot-INT2 (Hadamard-only rotation) scores 1.40 on Qwen3-4B and 10.14 on Qwen3-8B. TurboQuant at 3.25 bits drops 43.90 points on Qwen3-4B-Thinking. Saw-INT4 at 4.25 bits reaches 73.11 on Qwen3-4B — OSCAR at 2.28 bits reaches 71.86.

    https://arxiv.org/pdf/2605.17757

    The research team also compared against channel-wise methods on AIME25 (Table 1). On Qwen3-8B, OSCAR at 2.38 BPE achieves 66.67±3.33 — above KIVI-KV2* at 57.67 (2.26 BPE) and Kitty at 59.67 (2.39 BPE). Note that channel-wise methods require residual buffers or custom page layouts that do not fit standard paged-attention serving, so this comparison is limited to the single shared benchmark where results were available.

    Long-context robustness (RULER-NIAH):

    ModelMethod16K32K64K128KQwen3-4B-ThinkingBF1699.799.385.381.0Qwen3-4B-ThinkingQuaRot-INT20.00.015.60.0Qwen3-4B-ThinkingOSCAR97.887.661.939.5Qwen3-8BBF1698.997.379.278.2Qwen3-8BQuaRot-INT219.09.80.00.0Qwen3-8BOSCAR93.986.361.945.0

    On GLM-4.7-FP8, OSCAR matches the BF16 curve through 128K.

    Throughput (H100, 100K context, batch size 1):

    Decode throughput speedup relative to BF16, at increasing context lengths:

    Model30K60K100KQwen3-4B-Thinking1.98×2.52×3.08×Qwen3-8B1.84×2.29×2.88×GLM-4.7-FP81.98×2.49×2.83×

    At batch size 32, job-level throughput at 100K context reaches 6.17× over BF16 on Qwen3-4B-Thinking and 7.83× on GLM-4.7-FP8. The speedup increases with context length because decoding becomes increasingly KV-bandwidth-bound. Reducing KV memory by 8× directly reduces that bottleneck. The online rotation overhead is absorbed into the decode kernels.

    Marktechpost’s Visual Explainer

    OSCAR — How-To Guide
    01 / 08

    01

    Overview

    What is OSCAR?

    OSCAR (Offline Spectral Covariance-Aware Rotation) is a 2-bit KV cache quantization system from Together AI for long-context LLM serving.

    Instead of applying a generic Hadamard rotation, OSCAR derives attention-aware rotations from a one-time offline calibration pass — aligning quantization noise with directions that attention is least sensitive to.

    The result: INT2 precision with near-BF16 accuracy and full compatibility with paged KV-cache serving.

    8×
    KV Memory Reduction

    3×
    Decode Speedup

    2.28
    Bits Per KV Element

    02

    Setup

    Prerequisites

    Before getting started, make sure you have the following in place:

    • 01
      Hardware: NVIDIA H100 GPU (80 GB) recommended. A100 may work for smaller models.
    • 02
      SGLang installed: OSCAR is integrated into the SGLang serving framework. Install the latest version from source.
    • 03
      Triton: Custom fused kernels are written in Triton. Triton ships with most recent PyTorch / SGLang installs.
    • 04
      A supported model: Qwen3-4B, Qwen3-8B, Qwen3-32B, GLM-4.7-FP8, or MiniMax-M2.7. Pre-computed rotations are available for all of these.

    pip install sglang[all] –upgrade
    pip install triton

    03

    Step 1

    Download Pre-Computed Rotations via RotationZoo

    Together AI publishes pre-computed rotation matrices and clip thresholds for supported models in RotationZoo on ModelScope. No recalibration needed.

    from modelscope import snapshot_download

    # Download RotationZoo for your model
    rotation_path = snapshot_download(
    ‘togethercomputer/OSCAR-RotationZoo’
    )

    The downloaded artifact contains per-layer RK, RV rotation matrices and clip thresholds cK, cV for each supported model. These are fixed offline parameters — they are not updated at runtime.

    Qwen3-4B / 8B / 32B2.28 BPE

    GLM-4.7-FP8 (358B)2.28 BPE

    MiniMax-M2.72.28 BPE

    Custom (run calibration)any model

    04

    Step 2 (Optional)

    Run Offline Calibration for a Custom Model

    If your model is not in RotationZoo, run the one-time calibration pass. OSCAR dumps Q, K, V activations from a small dataset, estimates attention-aware covariance, and writes out rotation matrices and clip thresholds.

    python calibrate_oscar.py \
    –model-path /path/to/your-model \
    –calib-data gpqa_diamond \
    –calib-tokens 8192 \
    –output-dir ./oscar_rotations/

    Calibration is not task-specific. The paper shows that results are low-sensitivity to domain (MMLU, WikiText, GPQA-Diamond all produce similar accuracy). Run it once and reuse across all tasks.

    Typical values produced: cK ≈ 0.96, cV ≈ 0.92 per layer.

    05

    Step 3

    Launch SGLang with INT2 KV Cache Enabled

    Pass the rotation path and enable INT2 KV mode when launching the SGLang server.

    python -m sglang.launch_server \
    –model-path Qwen/Qwen3-8B \
    –kv-cache-dtype int2 \
    –oscar-rotation-path ./oscar_rotations/ \
    –oscar-sink-size 64 \
    –oscar-recent-size 256 \
    –tp 1 \
    –port 30000

    Tensor parallelism is supported. For Qwen3-32B use –tp 2 (2×H100). For GLM-4.7-FP8 use –tp 8 (8×H100).

    The server exposes a standard OpenAI-compatible API. No client-side changes are needed.

    06

    Step 4

    Key Configuration Parameters

    Parameter
    Default
    What it controls

    –oscar-sink-size
    64
    First N tokens kept in BF16 as attention sinks

    –oscar-recent-size
    256
    Last N tokens kept in BF16 before current position

    cK (clip ratio)
    0.96
    Percentile clip for rotated key activations

    cV (clip ratio)
    0.92
    Percentile clip for rotated value activations

    Group size GK
    64
    Channels per INT2 quantization group (head dim)

    The paper identifies (sink=64, recent=256) as the accuracy-efficiency knee. Smaller windows reduce accuracy noticeably; larger windows add BF16 memory overhead with negligible gain.

    07

    Step 5

    Run Inference and Verify

    Once the server is running, query it with the standard OpenAI client:

    from openai import OpenAI

    client = OpenAI(
    base_url=”http://localhost:30000/v1″,
    api_key=”none”
    )

    response = client.chat.completions.create(
    model=”Qwen/Qwen3-8B”,
    messages=[{“role”: “user”,
    “content”: “Your long-context prompt here”}],
    max_tokens=1024
    )
    print(response.choices[0].message.content)

    Prefix caching works out of the box. OSCAR preserves the standard paged KV-cache abstraction, so SGLang’s radix cache and prefix reuse function normally. No application-level changes are needed.

    08

    Results

    Accuracy vs BF16 Baseline

    Averaged across AIME25, GPQA-Diamond, HumanEval, LiveCodeBench v6, and MATH500 at 32K generation length.

    Qwen3-4B-Thinking

    −3.78

    Paper: arXiv:2605.17757   RotationZoo: modelscope.cn/models/togethercomputer/OSCAR-RotationZoo

    Key Takeaways

    • OSCAR quantizes LLM KV caches to 2-bit precision by rotating activations using attention-aware covariance matrices, not generic Hadamard transforms.
    • At 2.28 bits per KV element, OSCAR stays within 3.78 points of BF16 accuracy on Qwen3-4B-Thinking while naive INT2 collapses to zero.
    • KV cache memory drops approximately 8×, decode speed improves up to 3× at 100K context, and job-level throughput reaches up to 7.83× at large batch sizes.
    • Pre-computed rotation matrices for Qwen3-4B/8B/32B, GLM-4.7-FP8, and MiniMax-M2.7 are available in RotationZoo — no recalibration needed.
    • OSCAR integrates directly into SGLang with full paged KV-cache and prefix cache compatibility, requiring no changes to the inference client.

    Check out the Repo on GitHub, Modelscope and Research Paper. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

    Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us



    Source link

    frase
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email

    Related Posts

    MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost

    May 27, 2026

    Autonomous AI systems test governance in physical environments

    May 26, 2026

    Technology usually creates jobs for young, skilled workers. Will AI do the same? | MIT News

    May 24, 2026

    Valid certificates, stolen accounts: how attackers broke npm's last trust signal

    May 23, 2026

    AI gave China a god’s-eye view of its energy grid. No one else has this mapping.

    May 22, 2026

    Qwen Introduces Qwen3.7-Max: A Reasoning Agent Model With a 1M-Token Context Window

    May 21, 2026
    notion
    Latest Posts

    UK Sanctions Strike Russia-Linked Crypto Networks in Sweeping Crackdown

    May 28, 2026

    StakeDAO vsdCRV Attacker Limited to $91K By Thin Liquidity

    May 28, 2026

    What’s the Deal With Telus’s Dividend?

    May 28, 2026

    MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost

    May 27, 2026

    AI engineer salary – What to expect: Junior to senior

    May 27, 2026
    notion
    LEGAL INFORMATION
    • Privacy Policy
    • Terms Of Service
    • Social Media Disclaimer
    • DMCA Compliance
    • Anti-Spam Policy
    Top Insights

    Avalanche hits RWA milestone as AVAX price holds key level

    May 28, 2026

    BIS Project Agorá Shows Tokenized Payments Cut Settlement Risk

    May 28, 2026
    binance
    Facebook X (Twitter) Instagram Pinterest
    © 2026 TechoraNewsHub.com - All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.

    bitcoin
    Bitcoin (BTC) $ 73,778.00
    ethereum
    Ethereum (ETH) $ 2,020.42
    tether
    Tether (USDT) $ 0.998619
    bnb
    BNB (BNB) $ 641.52
    xrp
    XRP (XRP) $ 1.32
    usd-coin
    USDC (USDC) $ 0.999583
    solana
    Solana (SOL) $ 82.47
    tron
    TRON (TRX) $ 0.352769
    figure-heloc
    Figure Heloc (FIGR_HELOC) $ 1.03
    staked-ether
    Lido Staked Ether (STETH) $ 2,265.05