AI Token Limits by Model: Per-Team Spend Control

Your AI spend is probably a black box

Most teams using OpenAI, Anthropic, or similar vendors have some visibility into total spend from the vendor’s billing dashboard. What they rarely have is a breakdown by model. Which team is burning through o3 calls? How many tokens did gpt-4o-mini consume this week versus gpt-4o? Is anyone quietly calling o1 in a workflow that only needs gpt-4o-mini?

The vendor tells you the bill. It doesn’t tell you which model, which team, or which workflow drove it.

A gateway proxy sitting between your team and the AI vendor can give you that breakdown — and enforce per-model limits — without requiring changes to any application code.

One proxy, multiple models

The architecture is simpler than it sounds. You create one proxy pointing at the AI vendor’s base URL. Your team’s agents and applications call that proxy with a scoped credential. The proxy forwards requests to the vendor with the real API key injected.

Because every request from every application passes through this single proxy, you have a single place to add meters — and you can stack as many meters as you need, each targeting a different model.

POST /clients/{clientId}/proxies
{
  "proxyName": "openai-team",
  "proxyRegion": "us-east-1",
  "proxyProxyCredentialId": "<team-proxy-credential-id>",
  "proxyTargetId": "<openai-target-id>",
  "proxyTargetCredentialId": "<openai-api-key-credential-id>",
  "proxyDefaultRuleEffect": "allow"
}

The proxy is intentionally permissive at the rule level here — the controls live in the meters, not in path-based allow rules. That said, you can always combine model-level meters with path-based rules if you also want to restrict which endpoints the team can call.

How conditional meters target specific models

A meter with meterMode: "conditional" only counts requests that match the predicates you specify. The body predicate checks against the inbound request body — which, for OpenAI-style APIs, is where the model field lives.

Setting effect: "include" means the meter counts only requests where the predicates match. Requests where the model field contains a different value pass through the meter uncounted.

This is the mechanism that lets you put multiple meters on one proxy and have each one independently count only its target model.

Meter 1: Token consumption for gpt-4o-mini

The cheapest model in the stable gets the most generous limit. A response_value meter extracts usage.total_tokens from the response body after each successful call:

POST /clients/{clientId}/proxies/{proxyId}/meters
{
  "meterType": "response_value",
  "meterActive": true,
  "meterMode": "conditional",
  "effect": "include",
  "methods": ["POST"],
  "body": [
    {
      "jsonPath": { "pattern": "^model$" },
      "matchValue": { "pattern": "^gpt-4o-mini" },
      "presence": "must_exist"
    }
  ],
  "extraction": {
    "location": "body",
    "path": "usage.total_tokens",
    "defaultValue": 0
  },
  "limits": {
    "day": 5000000,
    "month": 50000000
  },
  "notes": "Token consumption — gpt-4o-mini, daily and monthly caps"
}

The matchValue.pattern: "^gpt-4o-mini" matches any gpt-4o-mini variant (e.g. gpt-4o-mini-2024-07-18). When the daily limit of 5 million tokens is reached, the gateway returns a rate-limit response and the request is not forwarded to OpenAI.

Meter 2: Token consumption for gpt-4o

The full model gets a tighter daily cap. Same structure, different pattern and limits:

POST /clients/{clientId}/proxies/{proxyId}/meters
{
  "meterType": "response_value",
  "meterActive": true,
  "meterMode": "conditional",
  "effect": "include",
  "methods": ["POST"],
  "body": [
    {
      "jsonPath": { "pattern": "^model$" },
      "matchValue": { "pattern": "^gpt-4o(?!-mini)" },
      "presence": "must_exist"
    }
  ],
  "extraction": {
    "location": "body",
    "path": "usage.total_tokens",
    "defaultValue": 0
  },
  "limits": {
    "day": 1000000,
    "month": 10000000
  },
  "notes": "Token consumption — gpt-4o (excludes mini), daily and monthly caps"
}

The negative lookahead (?!-mini) matches gpt-4o and gpt-4o-2024-11-20 but not gpt-4o-mini — so this meter and the previous one don’t overlap.

Meter 3: Token consumption for o3 and o1

Reasoning models are expensive. Give them the tightest limits and add a per-hour cap to prevent burst usage from a single runaway workflow:

POST /clients/{clientId}/proxies/{proxyId}/meters
{
  "meterType": "response_value",
  "meterActive": true,
  "meterMode": "conditional",
  "effect": "include",
  "methods": ["POST"],
  "body": [
    {
      "jsonPath": { "pattern": "^model$" },
      "matchValue": { "pattern": "^(o1|o3)" },
      "presence": "must_exist"
    }
  ],
  "extraction": {
    "location": "body",
    "path": "usage.total_tokens",
    "defaultValue": 0
  },
  "limits": {
    "hour": 100000,
    "day": 500000,
    "month": 3000000
  },
  "notes": "Token consumption — o1 and o3 reasoning models, hourly + daily + monthly caps"
}

The hourly limit prevents a single multi-turn reasoning workflow from consuming the entire day’s budget in one session.

Adding request count meters alongside token meters

Token consumption tells you how much you used. Request count tells you how frequently. You can run both on the same proxy simultaneously — they accumulate independently.

A request count meter for all reasoning model calls (useful for detecting unexpectedly frequent calls even if each call is small):

POST /clients/{clientId}/proxies/{proxyId}/meters
{
  "meterType": "request_count",
  "meterActive": true,
  "meterMode": "conditional",
  "effect": "include",
  "methods": ["POST"],
  "body": [
    {
      "jsonPath": { "pattern": "^model$" },
      "matchValue": { "pattern": "^(o1|o3)" },
      "presence": "must_exist"
    }
  ],
  "limits": {
    "hour": 20,
    "day": 100
  },
  "notes": "Request count — o1 and o3 calls per hour and day"
}

With both meters active, you enforce two independent limits on reasoning model usage: requests per hour, and total tokens per month. A call that exceeds either limit is blocked.

Reading meter usage

Query the current state of any meter to see accumulated consumption against configured limits:

GET /clients/{clientId}/proxies/{proxyId}/meters/{meterId}/usage

The response shows each configured window:

{
  "meterId": "...",
  "usage": {
    "hour": {
      "current": 47823,
      "limit": 100000,
      "windowKey": "2026-06-03-14"
    },
    "day": {
      "current": 284910,
      "limit": 500000,
      "windowKey": "2026-06-03"
    },
    "month": {
      "current": 1842310,
      "limit": 3000000,
      "windowKey": "2026-06"
    }
  }
}

At 284,910 of 500,000 for the day, your reasoning model budget is 57% consumed. At 1.8M of 3M for the month, you have roughly a third of the month’s allowance left. Both are visible without opening the vendor’s billing console.

What this doesn’t cover

A few limits worth naming:

Token counting is vendor-dependent. This approach relies on the vendor returning usage.total_tokens in the response body. If the vendor doesn’t include usage data in streaming responses, the meter will fall back to defaultValue (set to 0 in the examples above) and won’t accumulate correctly for those requests. Check whether your vendor includes usage in streaming chunks or only in the final response.
Model name patterns need maintenance. Vendors add new model versions regularly. If OpenAI releases gpt-4o-2026-09-01, it will match the ^gpt-4o(?!-mini) pattern automatically — but if they introduce a new family like gpt-5, you’ll need a new meter for it.
These meters don’t enforce which model a caller is allowed to use. They limit consumption once a model is called. If you want to restrict which models are accessible at all, that requires an authorization rule with a body predicate that denies requests for disallowed model names — not meters.

Next steps

All the meters in this post can be added to any existing proxy without touching the proxy itself or the applications routing through it. Read the RequestRocket documentation to see the full meter schema, or start for free.

AI Token Limits by Model: Per-Team Spend Control

Your AI spend is probably a black box

One proxy, multiple models

How conditional meters target specific models

Meter 1: Token consumption for gpt-4o-mini

Meter 2: Token consumption for gpt-4o

Meter 3: Token consumption for o3 and o1

Adding request count meters alongside token meters

Reading meter usage

What this doesn’t cover

Next steps

Related posts

Local AI Agent Monitoring: Visibility Into API Usage

AI Agent Identity Isn't Enough: Enforce Access at Runtime

AI Agent API Access: From Authentication to Downscoping

Add outbound API security without changing code

Add outbound API security
without changing code