Introduction

Compresr shrinks the context you send to your LLM without losing the answer.

Compresr is a context-compression API for LLM developers. You send the long text you would otherwise pass to your model - RAG chunks, document bodies, chat history, tool output - together with the query you want answered. Compresr scores every span of your input against that query, keeps the spans that carry the answer, drops the rest, and returns a shorter context you can forward to your LLM exactly as you would the original.

The result is the same answer with fewer input tokens. That means lower cost per call, a longer effective context window for the same model, and faster inference on every downstream request. Compresr sits in front of whatever LLM stack you already use - it does not replace your model, your prompt, or your retrieval layer.

Where it fits

The shape of "long text + a query about it" shows up in most production LLM workloads. A few places teams plug Compresr in:

RAG: shrink retrieved chunks before they reach GPT, Claude, or Gemini, so you only pay for tokens that actually answer the user.
Agent loops: compress accumulating chat history, scratchpads, and tool transcripts so long-running agents stop drifting into the context window.
Tool output: compress noisy API responses, search results, or file contents before they re-enter the prompt.
Long-context Q&A: feed compressed documents into smaller, cheaper models without losing the parts that carry the answer.

The models: `latte_v1` and `latte_v2`

Compresr exposes two query-specific compression models on the public API:

latte_v1 — query-specific compression with structural knobs (coarse, heuristic_chunking, disable_placeholders).
latte_v2 — up to 5x faster than latte_v1 at the same compression quality. A single relevance pass per request, no structural knobs.

Both are query-specific: every call requires a query, and the model keeps the spans of context that answer it. This is the right default whenever the downstream LLM call already has a clear question, instruction, or retrieval intent.

When to use these models

Use either latte_v1 or latte_v2 whenever you know what the downstream LLM is being asked. RAG pipelines, agent tool calls with a concrete goal, long-context Q&A, and chat turns with a fresh user message all qualify. If the input is short enough that compression is not worth the round trip, skip it.

Supported knobs on latte_v1:

query (required) - the question or instruction Compresr scores spans against. Without it, the model has nothing to keep against.
coarse - skip span-level scoring and compress at a coarser granularity. Faster and cheaper on very long inputs where sentence-level precision is not needed.
heuristic_chunking - chunk the input with a heuristic splitter before scoring. Helps on structured inputs (logs, transcripts, tables) where the default chunker over- or under-splits.
disable_placeholders - return only the kept spans, with no placeholder markers between dropped regions. Useful when the downstream LLM is sensitive to gap markers.

Supported knobs on latte_v2:

query (required) - same semantics as on latte_v1.

Full parameter semantics, defaults, and trade-offs - including target_compression_ratio, which controls how aggressively the model drops spans - live in the Models reference.

Start with

Pick your language and send the first request. The shape of the call is identical across all three; only the syntax differs.

Pick a language

Python - pip install compresr, then call client.compress(...).
TypeScript - npm install @compresr/sdk, then call client.compress({...}).
cURL - one POST request, no install required.

Framework integrations

First-party integrations ship in both SDKs — drop them into existing pipelines without rewriting the surrounding code.

LangChain — three middlewares (tool output, history summarization, prompt budget) + RAG document compressor + single-tool wrapper, for create_agent and ContextualCompressionRetriever.
LangGraph — adds make_compresr_node for custom state graphs, lossy CompresrCheckpointSerializer + CompresrStore for at-rest compression, and compresr_handoff_tool for supervisor → sub-agent transfers.
LlamaIndex — CompresrNodePostprocessor for query engines, wrap_tool_with_compresr for FunctionTools, and CompresrMemoryBlock for the Memory API.
LiteLLM — Python-only pre_call guardrail that auto-compresses tool/function messages before they go upstream — works against every LiteLLM provider.
LLM provider recipes — manual pattern called directly against OpenAI, Anthropic, Gemini, or local Ollama.

Quick start - the same 30-second example in Python, TypeScript, and cURL.
Authentication - how cmp_ keys are issued, rotated, and revoked.
API reference - every endpoint, parameter, and response field.

Where it fits

The models: latte_v1 and latte_v2

Start with

Pick a language

Framework integrations

Related reading

The models: `latte_v1` and `latte_v2`