Microservices

FastAPI Microservices Architecture: A Production Guide

FastAPI microservices done right: where they belong in a polyglot fleet, the async model that decides performance, and the worker math most teams skip.

Part of Polyglot Microservices: Choosing the Right Language

By Colson · Distinguished Software Engineer, Founder

June 22, 2026 8 min read

FastAPI microservices architecture, shown as a glowing compute core ringed by lightweight worker processes

A production FastAPI microservices architecture earns its place for one reason: it is the fastest way to put a typed, async, well-documented HTTP boundary in front of Python code you already need to run, usually machine-learning inference, data work, or glue.

Use it where Python is the right runtime, keep it off your latency-critical hot paths, and run it under a production ASGI server with the worker math done deliberately. Get the async model and the process count right, and a FastAPI service holds its SLO. Get them wrong, and no framework saves you.

Why FastAPI architecture decisions matter

The most common FastAPI failure in production is not a FastAPI failure. It is a deployment that runs one synchronous worker, blocks the event loop on a CPU-bound call, and then someone concludes “Python is slow.”

FastAPI is fast for what it is. The performance you actually get depends almost entirely on whether you respect the async model and size the process pool to the workload. This post is the architecture I run FastAPI with inside a polyglot system, and the specific places it belongs and does not. It is part of the Language choices in polyglot microservices series.

Where FastAPI belongs in a polyglot fleet

The right question is never “FastAPI vs Go.” It is “what is this service’s job, and is Python the right runtime for it?”

FastAPI belongs in front of work that is already Python: model inference, feature engineering, anything leaning on NumPy, Pandas, PyTorch, or the scientific stack. Rewriting that in Go to save a few milliseconds of framework overhead trades a small latency win for a large velocity loss.

It does not belong on a high-throughput, latency-critical edge path. That is Go’s job, for the reasons in Go vs Rust for Microservices: When to Choose Which. A FastAPI service that exists only to forward JSON is a service that should have been written in the language the rest of your edge already uses.

The cleanest pattern is FastAPI as the typed boundary over a Python compute core, called by Go orchestration services over gRPC or HTTP. The Python service does the Python-shaped work and nothing else.

FastAPI vs Go: which runtime for which service?

Use this as a quick triage when a new service shows up. The rule is “match the runtime to the work,” not “use what the team knows.”

Service type	Use	Why
ML inference / model serving	FastAPI	The work is already Python; rewriting loses velocity
Feature engineering, data transforms	FastAPI	NumPy/Pandas ecosystem, fast iteration
High-throughput API edge / gateway	Go	Latency-critical, concurrency-heavy, GC is fine
Control plane / orchestration	Go	Cloud-native ecosystem, fast compile loop
Latency-critical hot path	Rust	Deterministic tail latency (see the hot-path guide)
Glue / JSON forwarding	Go	If it isn’t Python compute, it isn’t a FastAPI job

Should I use async def or def for FastAPI routes?

Use async def only when everything the handler awaits is non-blocking. If the handler calls synchronous libraries, use a plain def so FastAPI runs it in a threadpool. The one combination to avoid is an async def handler that calls a blocking driver, which stalls the event loop.

FastAPI is built on ASGI and Starlette, and its concurrency story is single-threaded cooperative async per worker process. One blocking call freezes that worker for every concurrent request it is handling.

The rule is mechanical. If a path handler is async def, everything it awaits must be non-blocking: an async database driver, an async HTTP client, an async cache client. The moment you call a synchronous library inside an async def, you have stalled the event loop for every other request on that worker.

If you must call blocking code, two correct options exist:

Define the handler as a plain def. FastAPI then runs it in a threadpool, so it does not block the loop.
For CPU-bound work, push it to a process pool or a separate worker service. Threads do not buy you parallelism against Python’s Global Interpreter Lock.

# WRONG: blocks the event loop for every concurrent request on this worker
@app.get("/user/{uid}")
async def get_user(uid: int):
    return sync_db.query(uid)        # synchronous call inside async handler

# RIGHT: async driver, never blocks the loop
@app.get("/user/{uid}")
async def get_user(uid: int):
    return await async_db.query(uid)

# ALSO RIGHT: blocking work in a plain def -> FastAPI runs it in a threadpool
@app.get("/report/{rid}")
def build_report(rid: int):
    return sync_db.heavy_query(rid)

How many Uvicorn workers should I run?

Start at roughly one worker per CPU core for I/O-bound services, then tune against real latency. For CPU-bound work the GIL means extra per-worker concurrency does not help, so match workers to cores and move heavy compute elsewhere. Always multiply your worker count by per-worker memory before you size the pod.

A FastAPI deployment runs N worker processes, each with its own event loop. Throughput and resource use both scale with N, and picking N is arithmetic, not a default.

For I/O-bound services, a single worker handles many concurrent requests because it yields the loop on every await. You scale workers to use available cores and to provide failure isolation, not to add concurrency per request.

For CPU-bound work, concurrency per worker is a lie. The GIL serializes Python bytecode, so a worker handling a CPU-bound request blocks. Here you either move the work to a process pool, match workers to cores one-to-one, or move the compute out of Python entirely.

Memory is the constraint people forget. Each worker is a full Python process with its own copy of the loaded model or large in-memory data. Four workers each loading a 2 GB model is 8 GB per pod, not 2 GB.

The FastAPI production checklist

These settings separate a FastAPI service that holds its SLO from one that pages you.

ASGI server: Uvicorn workers managed by a process supervisor, or another production ASGI server you have measured a reason to prefer. Never run the development server in production.
No blocking in async: every await hits an async driver; CPU-bound work goes to a pool or another service.
Timeouts everywhere: client timeouts on every downstream call, plus a server-side request timeout. A worker waiting forever on a hung downstream takes its whole concurrency slot down with it.
Health checks that mean something: a readiness probe that fails when the model is not loaded or a critical dependency is down, not one that returns 200 simply because the process started.
Structured logging and distributed tracing: OpenTelemetry instrumentation so a slow request is debuggable across the boundary, which matters more in a polyglot system.
Pydantic v2 models at the edge: validate input at the boundary and let the typed model carry through. The validation cost is real but cheap relative to the bugs it stops.

Is FastAPI fast enough for production microservices?

Yes, for the right jobs. When a service is I/O-bound or fronts Python compute, and you run it under a production ASGI server with no blocking in async handlers, FastAPI holds its SLO comfortably. It is the wrong choice for latency-critical, high-throughput edge paths, which belong in Go.

Plan capacity from the bottleneck, and for most FastAPI services the bottleneck is either downstream I/O or model compute, not FastAPI itself. Start by measuring time spent in the service: framework and serialization overhead versus time awaiting downstream calls versus time in actual compute. The split tells you what to scale. If 90% of the time is model inference, more workers will not help a single slow request; a faster model, batching, or a GPU will.

Then size for the tail. An inference service with variable input sizes has a long latency tail, and your p99 is set by your largest realistic input, not your average. Provision and set timeouts against the tail, not the mean.

What I’d do differently

The mistake I see most is using FastAPI as a general-purpose service framework because the team knows Python, then fighting its concurrency model on paths that were never Python-shaped to begin with.

If I were drawing the lines again, I would be stricter: FastAPI only where Python compute is the actual product of the service. Everything else, the routing, the auth edge, the high-throughput glue, goes to Go. The polyglot win comes from each language doing the job it is good at, not from one language doing every job because it is familiar.

The honest tradeoff is velocity against runtime cost. Python ships data and ML work faster than anything else and costs more per request to run. When the work is genuinely Python, that trade is worth it. When it is not, you are paying the runtime cost for none of the velocity benefit. Once a Python service talks to a Go one, the seam itself becomes the risk: see Why Language Boundaries Break Polyglot Microservices.

Sources

FastAPI documentation, Concurrency and async/await: fastapi.tiangolo.com/async
Starlette documentation: starlette.io
Uvicorn documentation, Deployment: uvicorn.org/deployment

#fastapi #python #microservices #asgi #architecture

Frequently asked questions

Is FastAPI fast enough for production microservices?

Yes, when the service is I/O-bound or fronts Python compute, and when you run it under a production ASGI server with no blocking calls in async handlers. It is not the right choice for latency-critical, high-throughput edge paths, which belong in Go.

How many Uvicorn workers should I run?

Start at roughly one per CPU core for I/O-bound services and tune against real latency. For CPU-bound work the GIL means per-worker concurrency does not help, so match workers to cores and move heavy compute to a process pool or another service. Always account for per-worker memory.

Should I use async def or def for FastAPI routes?

Use async def only when everything you await is non-blocking. If a handler calls synchronous libraries, use plain def so FastAPI runs it in a threadpool. The dangerous case is an async def handler calling a blocking driver, which stalls the event loop.

Is FastAPI good for microservices, or should I use Go?

Use FastAPI where Python is the right runtime, mainly machine learning and data work. Use Go for the network and orchestration layer. In a polyglot fleet they are complementary, not competitors.

Why FastAPI architecture decisions matter

Where FastAPI belongs in a polyglot fleet

FastAPI vs Go: which runtime for which service?

Should I use async def or def for FastAPI routes?

How many Uvicorn workers should I run?

The FastAPI production checklist

Is FastAPI fast enough for production microservices?

What I’d do differently

Sources

Frequently asked questions

Liked this breakdown?

Keep reading

Why Language Boundaries Break Polyglot Microservices

Go vs Rust for Microservices: When to Choose Which

gRPC Across Languages: Production Lessons