IntegrationJuly 10, 20257 min read

API Integration Patterns for Enterprise Platforms

Enterprise API integrations fail in predictable ways. Most of the failures have nothing to do with the API itself.

APIsIntegrationNode.jsPythonArchitecture

The Failure Modes Are Predictable

I've built enough enterprise API integrations to recognize the patterns. The initial build works. The first few runs work. Then something breaks: a token expires without refreshing, a rate limit gets hit during a large sync, an API returns a 200 with an error in the body, and the integration fails in a way that took the team two hours to diagnose.

None of these are exotic problems. They're predictable. Here's how to build around them.

Token Caching Gets Implemented Wrong Almost Every Time

OAuth2 client credentials is the right default for server-to-server integrations. The typical implementation fetches a new token on every request. That works in testing, so nobody fixes it until the API starts returning 429s from token endpoint rate limiting.

Cache the token. Refresh it before it expires, not after. The expires_in value in the response tells you how long it's valid. Subtract 60 seconds to give yourself a buffer and refresh before you hit the wall. Log token refresh events. If you're refreshing unexpectedly often, something upstream is wrong.

class TokenManager:
    def __init__(self):
        self._token = None
        self._expires_at = 0

    def get_token(self):
        if time.time() < self._expires_at - 60:
            return self._token
        resp = requests.post(TOKEN_URL, data={...})
        resp.raise_for_status()
        data = resp.json()
        self._token = data['access_token']
        self._expires_at = time.time() + data['expires_in']
        return self._token

Rate Limits During Bulk Operations

You'll hit rate limits during initial provisioning. Not maybe. You will. Enterprise platforms document rate limits inconsistently, and the limits that apply to bulk operations are often different from what applies to regular API traffic.

Exponential backoff with jitter on every request is non-negotiable. The jitter prevents synchronized retry storms when multiple processes hit the limit at the same time.

def call_with_retry(fn, max_retries=4):
    for attempt in range(max_retries):
        try:
            return fn()
        except RateLimitError:
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)
    raise Exception("Max retries exceeded after 4 attempts")

Log every rate limit hit. If they're happening regularly in normal operation, you need to redesign the integration, not just tune the retry parameters.

Error Handling: Silent Failures Are Worse Than Loud Ones

The most dangerous failure mode isn't an exception. It's a 200 response with an error object in the body, a null value where a record should be, or a batch operation that partially succeeds without telling you which records failed.

Read the API documentation carefully for how errors are actually represented. Many enterprise APIs don't follow REST conventions strictly. A 200 with {"success": false, "error": "..."} is more common than you'd expect.

Every error path should log enough context to diagnose the problem without reproducing it:

except APIError as e:
    logger.error(
        "Provisioning failed",
        extra={
            "user": user_data.get("email"),
            "status_code": e.status,
            "error_message": e.message,
            "request_id": e.request_id,
        }
    )
    raise

Pagination Is the Thing People Forget

You build against a test environment with 50 users and everything works. You deploy to production with 50,000 users and the first sync only imports the first page. Check the API docs for pagination: whether it exists and what the maximum page size is. Some platforms cap at 100 records per request regardless of what you ask for.

Idempotency: Design for Re-runs

Integrations get re-run after failures. A provisioning script that creates duplicate records when run twice is a problem that's annoying to clean up at 50 users and catastrophic at 50,000. Before writing a record, check if it exists. Before sending a provisioning request, check current state. Running it twice should produce the same result as running it once.

The Observability You Actually Need

Three things, built in from the start:

Structured logs with enough context to diagnose issues in production without reproducing them
An alert when the integration hasn't run successfully in longer than expected
A way to check current sync status without digging through logs

This is always described as optional in early project planning. It's not optional. It's the difference between finding out about a silent failure in the morning standup and finding out about it two weeks later when someone notices the data is stale.

PreviousBuilding Internal Automation That Actually Gets Used NextPractical AI Agents for Business Workflows