LLM Cost Calculator and Token Budget Tools
Tools for calculating, monitoring, and capping LLM API costs to defend against unbounded consumption attacks.
License: MIT
Overview
Tokencost is an open-source Python library from AgentOps-AI that provides unified token counting and cost calculation across 400+ LLM models from OpenAI, Anthropic, Cohere, Google, and dozens of other providers. Rather than maintaining your own pricing tables that go stale when providers update their rates, tokencost fetches and caches current pricing data, giving you accurate real-time cost estimates for any supported model.
For security practitioners, tokencost is a foundational component in building defenses against unbounded consumption attacks — you cannot enforce a token budget without first knowing how to count tokens accurately and attribute costs to specific models and users.
Installation
pip install tokencostTokencost has minimal dependencies (tiktoken, requests) and works in Python 3.8+.
Core Capabilities
Token Counting
Tokencost provides model-aware token counting that accounts for the different tokenizers used by different model families:
from tokencost import count_string_tokens, count_message_tokens
# Count tokens in a plain string for a specific model
token_count = count_string_tokens(
string="Hello, how can I help you today?",
model="gpt-4"
)
print(f"Token count: {token_count}") # 9
# Count tokens in a full chat messages list (includes role overhead)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
]
prompt_token_count = count_message_tokens(messages, model="gpt-4")
print(f"Prompt tokens: {prompt_token_count}") # ~20Cost Calculation
from tokencost import calculate_prompt_cost, calculate_completion_cost
# Calculate the cost of the prompt (input tokens)
prompt_cost = calculate_prompt_cost(
prompt="Explain quantum entanglement in detail.",
model="gpt-4"
)
# Calculate the cost of a completion (output tokens)
completion_cost = calculate_completion_cost(
completion="Quantum entanglement is a phenomenon...",
model="gpt-4"
)
total_cost = prompt_cost + completion_cost
print(f"Total call cost: ${float(total_cost):.6f}")Multi-Model Cost Comparison
A security-relevant use case: before selecting a model for a high-volume deployment, compare cost profiles across candidates to understand the economic exposure of a flooding attack:
from tokencost import calculate_prompt_cost, calculate_completion_cost
CANDIDATE_MODELS = [
"gpt-4",
"gpt-4-turbo",
"gpt-3.5-turbo",
"claude-3-opus-20240229",
"claude-3-haiku-20240307",
]
# Simulate a sponge example: short prompt, maximum output
sponge_prompt = "List all 50 US states with capitals, populations, and industries."
sponge_output = "Alabama: Capital: Montgomery, Population: 5.1M..." * 100 # ~2000 tokens
print(f"{'Model':<35} {'Input Cost':>12} {'Output Cost':>12} {'Attack Ratio':>14}")
print("-" * 75)
for model in CANDIDATE_MODELS:
try:
prompt_cost = float(calculate_prompt_cost(sponge_prompt, model))
completion_cost = float(calculate_completion_cost(sponge_output, model))
ratio = (completion_cost / prompt_cost) if prompt_cost > 0 else 0
print(f"{model:<35} ${prompt_cost:>10.6f} ${completion_cost:>10.6f} {ratio:>13.1f}x")
except Exception:
print(f"{model:<35} {'N/A':>12} {'N/A':>12} {'N/A':>14}")Building a Cost-Aware Request Layer
The following is a production-ready wrapper that integrates tokencost with per-user budget enforcement:
from __future__ import annotations
import logging
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from decimal import Decimal
from threading import Lock
from typing import Any
from openai import OpenAI
from tokencost import calculate_prompt_cost, calculate_completion_cost, count_message_tokens
logger = logging.getLogger(__name__)
class BudgetExceededError(Exception):
"""Raised when a request would exceed the user's cost budget."""
@dataclass
class UserCostBudget:
"""Per-user cost budget with rolling window enforcement."""
user_id: str
daily_limit_usd: float = 1.00
per_request_limit_usd: float = 0.10
_lock: Lock = field(default_factory=Lock, repr=False)
_daily_spent: Decimal = field(default=Decimal("0"), repr=False)
_day_start: datetime = field(default_factory=datetime.utcnow, repr=False)
def _reset_if_new_day(self) -> None:
if datetime.utcnow() >= self._day_start + timedelta(days=1):
self._daily_spent = Decimal("0")
self._day_start = datetime.utcnow()
logger.info("Daily budget reset for user %s", self.user_id)
def check_and_reserve(self, estimated_cost: float) -> None:
"""
Verify the estimated cost fits within both per-request and
daily limits. Raises BudgetExceededError if not.
"""
with self._lock:
self._reset_if_new_day()
est = Decimal(str(estimated_cost))
daily_remaining = Decimal(str(self.daily_limit_usd)) - self._daily_spent
if est > Decimal(str(self.per_request_limit_usd)):
raise BudgetExceededError(
f"Estimated request cost ${float(est):.4f} exceeds "
f"per-request limit ${self.per_request_limit_usd:.4f}."
)
if est > daily_remaining:
raise BudgetExceededError(
f"Estimated request cost ${float(est):.4f} would exceed "
f"daily budget. Remaining today: ${float(daily_remaining):.4f}."
)
def record_actual_cost(self, actual_cost: float) -> None:
"""Record the actual cost after a successful API call."""
with self._lock:
self._daily_spent += Decimal(str(actual_cost))
logger.info(
"User %s: spent $%.6f this request, $%.4f today of $%.2f daily limit.",
self.user_id,
actual_cost,
float(self._daily_spent),
self.daily_limit_usd,
)
@property
def daily_spent(self) -> float:
return float(self._daily_spent)
class CostAwareLLMClient:
"""
OpenAI client wrapper with per-user cost budgeting and
automatic token cap enforcement.
"""
def __init__(
self,
model: str = "gpt-4",
max_tokens_per_request: int = 500,
) -> None:
self.model = model
self.max_tokens_per_request = max_tokens_per_request
self._client = OpenAI()
self._budgets: dict[str, UserCostBudget] = {}
def get_budget(self, user_id: str) -> UserCostBudget:
if user_id not in self._budgets:
self._budgets[user_id] = UserCostBudget(user_id=user_id)
return self._budgets[user_id]
def complete(
self,
user_id: str,
messages: list[dict[str, Any]],
) -> str:
"""
Execute a chat completion with budget enforcement.
Raises BudgetExceededError before making the API call if the
estimated cost exceeds the user's limits.
"""
budget = self.get_budget(user_id)
# Estimate input cost before calling the API
prompt_tokens = count_message_tokens(messages, model=self.model)
prompt_cost = float(
calculate_prompt_cost(" " * prompt_tokens, self.model)
)
max_output_cost = float(
calculate_completion_cost(
" " * self.max_tokens_per_request, self.model
)
)
estimated_total = prompt_cost + max_output_cost
budget.check_and_reserve(estimated_total)
# Make the API call with a hard token cap
response = self._client.chat.completions.create(
model=self.model,
messages=messages,
max_tokens=self.max_tokens_per_request,
)
# Record actual cost
actual_output = response.choices[0].message.content or ""
actual_cost = float(calculate_prompt_cost(" " * prompt_tokens, self.model)) + \
float(calculate_completion_cost(actual_output, self.model))
budget.record_actual_cost(actual_cost)
return actual_outputMonitoring and Alerting
Integrate cost monitoring with your observability stack:
# Example: Prometheus metrics for token consumption monitoring
from prometheus_client import Counter, Histogram, Gauge
TOKEN_USAGE = Counter(
"llm_tokens_total",
"Total tokens consumed",
["user_id", "model", "token_type"] # token_type: input | output
)
REQUEST_COST = Histogram(
"llm_request_cost_usd",
"Cost per LLM request in USD",
["user_id", "model"],
buckets=[0.001, 0.005, 0.01, 0.05, 0.10, 0.50, 1.00, 5.00]
)
DAILY_SPEND = Gauge(
"llm_user_daily_spend_usd",
"Current daily spend per user",
["user_id"]
)Set alert rules in Grafana or your alerting system:
# Example Prometheus alerting rules
groups:
- name: llm_cost_alerts
rules:
- alert: UserApproachingDailyBudget
expr: llm_user_daily_spend_usd > 0.80
for: 1m
labels:
severity: warning
annotations:
summary: "User {{ $labels.user_id }} at 80% of daily LLM budget"
- alert: AnomalousTokenConsumption
expr: rate(llm_tokens_total{token_type="output"}[5m]) > 1000
for: 2m
labels:
severity: critical
annotations:
summary: "Output token rate exceeds 1000 tokens/min — possible flooding attack"Info
Cost management is both an economic and a security control. A hard monthly budget cap at the API provider level is your last line of defense. Implement it alongside application-level controls — never rely on application controls alone.