Skip to content
Compresr docs

API reference

Models

The Compresr compression model surface — `latte_v1` and `latte_v2`, shared parameters, and the canonical meaning of target_compression_ratio.

Compresr exposes two query-specific compression models on the public API: latte_v1 and latte_v2 (up to 5x faster than latte_v1 at the same compression quality). Both consume a context plus a query and return only the spans of context that carry signal for the query. This page is the canonical reference for both models' parameters and for target_compression_ratio. Every endpoint and SDK that accepts a compression ratio follows the value semantics defined below — no other page redefines the bounds.

Both models are exposed on the same endpoints:

Swapping between them is a single string change to compression_model_name.

latte_v1

Query-specific compression. Scores spans of the context against the query and keeps only the spans that carry signal for it. Tokens that don't help answer the query are dropped. Supports the structural knobs coarse, heuristic_chunking, and disable_placeholders.

Surfaced as compression_model_name="latte_v1" (Python) / compressionModelName: 'latte_v1' (TypeScript) in both official SDKs.

Parameters (latte_v1)

These are the parameters accepted by latte_v1 across the Python SDK, the TypeScript SDK, and the raw HTTP API. The wire format is always snake_case; the TypeScript SDK accepts the camelCase form shown via tsName.

contextstringRequired
The source text you want compressed: RAG chunks, document body, chat history, tool output — anything you would otherwise pay tokens to send to the LLM. Passing an empty string returns an empty result with no billing.
querystringRequired
The user question (or intent) that grounds the relevance signal. latte_v1 keeps only spans of context that help answer this query, so it cannot be empty.
compression_model_name"latte_v1"Required
Set to "latte_v1" to route the call to GemFilter. Any unknown value is rejected with 422 Unprocessable Entity.
target_compression_rationumberOptional
Default: model default
Compression strength. Interpreted as a removal fraction when 0 < r ≤ 1 and as an Nx target when r > 1. See target_compression_ratio below for the canonical bounds. Omit to let the model pick a ratio appropriate for the input.
coarsebooleanOptional
Default: true
latte_v1 only. Paragraph-level scoring (the default). Faster and cheaper than the token-level pass. Set to false to opt into token-level precision at the cost of latency.
heuristic_chunkingbooleanOptional
Default: false
latte_v1 only. Use a structure-aware splitter (paragraphs, code blocks, markdown sections) instead of the default fixed-size chunker. Helps when input has strong structural boundaries.
disable_placeholdersbooleanOptional
Default: false
latte_v1 only. Skip the [...] placeholders the model normally inserts where content was dropped. Useful when you want the output to read as continuous prose.

latte_v2

Query-specific compression. Same input contract as latte_v1, up to 5x faster at the same compression quality. A single relevance pass per request — no structural knobs.

Surfaced as compression_model_name="latte_v2" (Python) / compressionModelName: 'latte_v2' (TypeScript) on the same endpoints as latte_v1.

Parameters (latte_v2)

contextstringRequired
The source text you want compressed. Empty string returns an empty result with no billing.
querystringRequired
The user question or intent. latte_v2 keeps only spans that score above its relevance threshold against this query.
compression_model_name"latte_v2"Required
Set to "latte_v2" to route the call to latte_v2.
target_compression_rationumberOptional
Default: model default
Compression strength. Removal fraction when 0 < r ≤ 1, Nx target when r > 1. See target_compression_ratio below for the canonical bounds.

target_compression_ratio

target_compression_ratio controls how aggressive the compression is. It is interpreted two different ways depending on the value you pass. Every page in this documentation that mentions a ratio refers back to this table, and both models share these semantics.

ValueMeaningExample
0 < r ≤ 1Removal strength0.5 removes ~50% of tokens
r > 1Nx target (max 200)4 → ~¼ original
omitModel default

Pick a single mental model and stick to it inside a project. The removal-strength form reads more naturally for "compress by X%"; the Nx form is more natural when you have a hard target token budget.

Bounds

r = 0 is rejected with 422 Unprocessable Entity. Values above 200 are rejected at the same status — the API does not silently clamp. Omitting the field lets the model pick a ratio appropriate for the input.

Not a keep-fraction

target_compression_ratio is removal strength (when 0 < r ≤ 1) or an Nx target (when r > 1) — never a keep-fraction. 0.3 does not mean "keep 30%"; it means "remove ~30%". Keep-fraction is a benchmark-wrapper convention used elsewhere in the ecosystem; the SDK's surface does not follow it.

Examples

The only thing that changes between latte_v1 and latte_v2 is the compression_model_name string.

python

When to use these models

Both models shine on query-shaped tasks — workloads where you can name the intent the compressed output has to serve:

  • RAG: trim retrieved chunks to the spans that actually answer the user's question before sending them to the LLM.
  • Agent retrieval: shrink tool descriptions, observations, and intermediate steps against the current agent goal.
  • Search-result trimming: collapse a list of long results to the parts relevant to the search query.

For deeper patterns and end-to-end examples, see the query-specific compression guide.