API reference
Models
The Compresr compression model surface — `latte_v1` and `latte_v2`, shared parameters, and the canonical meaning of target_compression_ratio.
Compresr exposes two query-specific compression models on the public API: latte_v1 and latte_v2 (up to 5x faster than latte_v1 at the same compression quality). Both consume a context plus a query and return only the spans of context that carry signal for the query. This page is the canonical reference for both models' parameters and for target_compression_ratio. Every endpoint and SDK that accepts a compression ratio follows the value semantics defined below — no other page redefines the bounds.
Both models are exposed on the same endpoints:
POST /compress/question-specific/— single compressionPOST /compress/question-specific/stream— SSE streamPOST /compress/question-specific/batch— up to 100 rows per call
Swapping between them is a single string change to compression_model_name.
latte_v1
Query-specific compression. Scores spans of the context against the query and keeps only the spans that carry signal for it. Tokens that don't help answer the query are dropped. Supports the structural knobs coarse, heuristic_chunking, and disable_placeholders.
Surfaced as compression_model_name="latte_v1" (Python) / compressionModelName: 'latte_v1' (TypeScript) in both official SDKs.
Parameters (latte_v1)
These are the parameters accepted by latte_v1 across the Python SDK, the TypeScript SDK, and the raw HTTP API. The wire format is always snake_case; the TypeScript SDK accepts the camelCase form shown via tsName.
contextstringRequiredquerystringRequiredlatte_v1 keeps only spans of context that help answer this query, so it cannot be empty.compression_model_name"latte_v1"Required"latte_v1" to route the call to GemFilter. Any unknown value is rejected with 422 Unprocessable Entity.target_compression_rationumberOptionalmodel default0 < r ≤ 1 and as an Nx target when r > 1. See target_compression_ratio below for the canonical bounds. Omit to let the model pick a ratio appropriate for the input.coarsebooleanOptionaltruelatte_v1 only. Paragraph-level scoring (the default). Faster and cheaper than the token-level pass. Set to false to opt into token-level precision at the cost of latency.heuristic_chunkingbooleanOptionalfalselatte_v1 only. Use a structure-aware splitter (paragraphs, code blocks, markdown sections) instead of the default fixed-size chunker. Helps when input has strong structural boundaries.disable_placeholdersbooleanOptionalfalselatte_v1 only. Skip the [...] placeholders the model normally inserts where content was dropped. Useful when you want the output to read as continuous prose.latte_v2
Query-specific compression. Same input contract as latte_v1, up to 5x faster at the same compression quality. A single relevance pass per request — no structural knobs.
Surfaced as compression_model_name="latte_v2" (Python) / compressionModelName: 'latte_v2' (TypeScript) on the same endpoints as latte_v1.
Parameters (latte_v2)
contextstringRequiredquerystringRequiredlatte_v2 keeps only spans that score above its relevance threshold against this query.compression_model_name"latte_v2"Required"latte_v2" to route the call to latte_v2.target_compression_rationumberOptionalmodel default0 < r ≤ 1, Nx target when r > 1. See target_compression_ratio below for the canonical bounds.target_compression_ratio
target_compression_ratio controls how aggressive the compression is. It is interpreted two different ways depending on the value you pass. Every page in this documentation that mentions a ratio refers back to this table, and both models share these semantics.
| Value | Meaning | Example |
|---|---|---|
0 < r ≤ 1 | Removal strength | 0.5 removes ~50% of tokens |
r > 1 | Nx target (max 200) | 4 → ~¼ original |
| omit | Model default | – |
Pick a single mental model and stick to it inside a project. The removal-strength form reads more naturally for "compress by X%"; the Nx form is more natural when you have a hard target token budget.
Bounds
r = 0 is rejected with 422 Unprocessable Entity. Values above 200 are rejected at the same status — the API does not silently clamp. Omitting the field lets the model pick a ratio appropriate for the input.
Not a keep-fraction
target_compression_ratio is removal strength (when 0 < r ≤ 1) or an Nx target (when r > 1) — never a keep-fraction. 0.3 does not mean "keep 30%"; it means "remove ~30%". Keep-fraction is a benchmark-wrapper convention used elsewhere in the ecosystem; the SDK's surface does not follow it.
Examples
The only thing that changes between latte_v1 and latte_v2 is the compression_model_name string.
When to use these models
Both models shine on query-shaped tasks — workloads where you can name the intent the compressed output has to serve:
- RAG: trim retrieved chunks to the spans that actually answer the user's question before sending them to the LLM.
- Agent retrieval: shrink tool descriptions, observations, and intermediate steps against the current agent goal.
- Search-result trimming: collapse a list of long results to the parts relevant to the search query.
For deeper patterns and end-to-end examples, see the query-specific compression guide.