Why Rate Limit
Rate limiting protects your API from abuse, prevents resource exhaustion, and ensures fair access for all clients. Without rate limiting, a single misbehaving client can overwhelm your servers, degrade performance for all users, and increase infrastructure costs. At Nexis Limited, all public APIs and SaaS product endpoints implement rate limiting.
Rate Limiting Algorithms
Token Bucket
A bucket holds tokens (up to a maximum capacity). Each request consumes one token. Tokens are added at a fixed rate (e.g., 10 tokens per second). If the bucket is empty, the request is rejected. This algorithm allows bursts (up to the bucket capacity) while enforcing an average rate. It is the most commonly used algorithm and our default choice.
Sliding Window
Track request counts within a rolling time window. If a client makes more than N requests in the last M seconds, reject additional requests. More precise than fixed windows (which can allow double the rate at window boundaries) and simpler than token bucket.
Fixed Window
Count requests in fixed time intervals (per minute, per hour). Simple to implement but allows bursts at window boundaries — a client can make N requests at the end of one window and N more at the start of the next, effectively doubling the rate momentarily.
Leaky Bucket
Requests enter a queue (the bucket) and are processed at a fixed rate. Excess requests overflow (are rejected). This smooths out bursts and produces a very consistent output rate. Used when the downstream system requires consistent throughput.
Per-Client Limits
Apply different rate limits based on client identity:
- Anonymous/unauthenticated: Low limits (e.g., 60 requests/minute per IP).
- Free tier: Moderate limits (e.g., 1000 requests/hour per API key).
- Paid tier: Higher limits based on plan (e.g., 10,000 requests/hour).
- Enterprise: Custom limits negotiated per contract.
Identify clients by API key, authentication token, or IP address. API key or token is preferred — IP-based limiting can affect multiple users behind a shared IP (corporate networks, mobile carriers).
Response Headers
Communicate rate limit status to clients via standard response headers:
- X-RateLimit-Limit: Maximum requests allowed in the current window.
- X-RateLimit-Remaining: Requests remaining in the current window.
- X-RateLimit-Reset: Unix timestamp when the rate limit window resets.
- Retry-After: Seconds to wait before retrying (included in 429 responses).
Return HTTP 429 Too Many Requests when the limit is exceeded, with a clear response body explaining the limit and when the client can retry.
Implementation Approaches
- Application middleware: Implement rate limiting in your application framework (Express.js middleware, Django middleware). Simple but only works within a single application instance — not suitable for horizontally scaled services.
- Redis-based: Use Redis as a centralized counter store. All application instances check the same Redis counters, ensuring consistent rate limiting across the cluster. This is our standard approach.
- API Gateway: Implement rate limiting at the API gateway level (Kong, AWS API Gateway, NGINX). Centralized, language-agnostic, and handles limiting before requests reach the application.
- CDN/Edge: Rate limit at the edge (Cloudflare, AWS WAF) for DDoS protection. First line of defense before traffic reaches your infrastructure.
Graceful Handling
- Return informative 429 responses with Retry-After headers so clients know when to retry.
- Queue rather than reject critical requests when possible (background processing).
- Implement client-side rate limiting in your SDKs to prevent clients from hitting limits in the first place.
- Alert on sustained rate limit hits — they may indicate a bug, abuse, or a client that needs a higher limit.
Advanced Patterns
- Endpoint-specific limits: Apply different limits to different endpoints. A search endpoint may have lower limits than a CRUD endpoint. Expensive operations (report generation, data export) should have strict limits.
- Adaptive rate limiting: Dynamically adjust limits based on server health. Under high load, tighten limits to protect the system.
- Priority queues: Rate-limited requests from premium clients get priority over free-tier requests.
Conclusion
Rate limiting is essential infrastructure for any API. Use the token bucket algorithm for a good balance of burst tolerance and average rate enforcement. Implement rate limiting in Redis or at the API gateway for consistency across instances. Communicate limits clearly through response headers and handle violations gracefully.
Building production APIs? Our team implements robust rate limiting and API protection strategies.