FastAPI Microservices Architecture: A Production Guide
FastAPI microservices done right: where they belong in a polyglot fleet, the async model that decides performance, and the worker math most teams skip.
Part of Polyglot Microservices: Choosing the Right Language
A production FastAPI microservices architecture earns its place for one reason: it is the fastest way to put a typed, async, well-documented HTTP boundary in front of Python code you already need to run, usually machine-learning inference, data work, or glue.
Use it where Python is the right runtime, keep it off your latency-critical hot paths, and run it under a production ASGI server with the worker math done deliberately. Get the async model and the process count right, and a FastAPI service holds its SLO. Get them wrong, and no framework saves you.
Why FastAPI architecture decisions matter
The most common FastAPI failure in production is not a FastAPI failure. It is a deployment that runs one synchronous worker, blocks the event loop on a CPU-bound call, and then someone concludes “Python is slow.”
FastAPI is fast for what it is. The performance you actually get depends almost entirely on whether you respect the async model and size the process pool to the workload. This post is the architecture I run FastAPI with inside a polyglot system, and the specific places it belongs and does not. It is part of the Language choices in polyglot microservices series.
Where FastAPI belongs in a polyglot fleet
The right question is never “FastAPI vs Go.” It is “what is this service’s job, and is Python the right runtime for it?”
FastAPI belongs in front of work that is already Python: model inference, feature engineering, anything leaning on NumPy, Pandas, PyTorch, or the scientific stack. Rewriting that in Go to save a few milliseconds of framework overhead trades a small latency win for a large velocity loss.
It does not belong on a high-throughput, latency-critical edge path. That is Go’s job, for the reasons in Go vs Rust for Microservices: When to Choose Which. A FastAPI service that exists only to forward JSON is a service that should have been written in the language the rest of your edge already uses.
The cleanest pattern is FastAPI as the typed boundary over a Python compute core, called by Go orchestration services over gRPC or HTTP. The Python service does the Python-shaped work and nothing else.
FastAPI vs Go: which runtime for which service?
Use this as a quick triage when a new service shows up. The rule is “match the runtime to the work,” not “use what the team knows.”
| Service type | Use | Why |
|---|---|---|
| ML inference / model serving | FastAPI | The work is already Python; rewriting loses velocity |
| Feature engineering, data transforms | FastAPI | NumPy/Pandas ecosystem, fast iteration |
| High-throughput API edge / gateway | Go | Latency-critical, concurrency-heavy, GC is fine |
| Control plane / orchestration | Go | Cloud-native ecosystem, fast compile loop |
| Latency-critical hot path | Rust | Deterministic tail latency (see the hot-path guide) |
| Glue / JSON forwarding | Go | If it isn’t Python compute, it isn’t a FastAPI job |
Should I use async def or def for FastAPI routes?
Use async def only when everything the handler awaits is non-blocking. If the handler calls synchronous libraries, use a plain def so FastAPI runs it in a threadpool. The one combination to avoid is an async def handler that calls a blocking driver, which stalls the event loop.
FastAPI is built on ASGI and Starlette, and its concurrency story is single-threaded cooperative async per worker process. One blocking call freezes that worker for every concurrent request it is handling.
The rule is mechanical. If a path handler is async def, everything it awaits must be non-blocking: an async database driver, an async HTTP client, an async cache client. The moment you call a synchronous library inside an async def, you have stalled the event loop for every other request on that worker.
If you must call blocking code, two correct options exist:
- Define the handler as a plain
def. FastAPI then runs it in a threadpool, so it does not block the loop. - For CPU-bound work, push it to a process pool or a separate worker service. Threads do not buy you parallelism against Python’s Global Interpreter Lock.
# WRONG: blocks the event loop for every concurrent request on this worker
@app.get("/user/{uid}")
async def get_user(uid: int):
return sync_db.query(uid) # synchronous call inside async handler
# RIGHT: async driver, never blocks the loop
@app.get("/user/{uid}")
async def get_user(uid: int):
return await async_db.query(uid)
# ALSO RIGHT: blocking work in a plain def -> FastAPI runs it in a threadpool
@app.get("/report/{rid}")
def build_report(rid: int):
return sync_db.heavy_query(rid)
How many Uvicorn workers should I run?
Start at roughly one worker per CPU core for I/O-bound services, then tune against real latency. For CPU-bound work the GIL means extra per-worker concurrency does not help, so match workers to cores and move heavy compute elsewhere. Always multiply your worker count by per-worker memory before you size the pod.
A FastAPI deployment runs N worker processes, each with its own event loop. Throughput and resource use both scale with N, and picking N is arithmetic, not a default.
For I/O-bound services, a single worker handles many concurrent requests because it yields the loop on every await. You scale workers to use available cores and to provide failure isolation, not to add concurrency per request.
For CPU-bound work, concurrency per worker is a lie. The GIL serializes Python bytecode, so a worker handling a CPU-bound request blocks. Here you either move the work to a process pool, match workers to cores one-to-one, or move the compute out of Python entirely.
Memory is the constraint people forget. Each worker is a full Python process with its own copy of the loaded model or large in-memory data. Four workers each loading a 2 GB model is 8 GB per pod, not 2 GB.
The FastAPI production checklist
These settings separate a FastAPI service that holds its SLO from one that pages you.
- ASGI server: Uvicorn workers managed by a process supervisor, or another production ASGI server you have measured a reason to prefer. Never run the development server in production.
- No blocking in async: every
awaithits an async driver; CPU-bound work goes to a pool or another service. - Timeouts everywhere: client timeouts on every downstream call, plus a server-side request timeout. A worker waiting forever on a hung downstream takes its whole concurrency slot down with it.
- Health checks that mean something: a readiness probe that fails when the model is not loaded or a critical dependency is down, not one that returns 200 simply because the process started.
- Structured logging and distributed tracing: OpenTelemetry instrumentation so a slow request is debuggable across the boundary, which matters more in a polyglot system.
- Pydantic v2 models at the edge: validate input at the boundary and let the typed model carry through. The validation cost is real but cheap relative to the bugs it stops.
Is FastAPI fast enough for production microservices?
Yes, for the right jobs. When a service is I/O-bound or fronts Python compute, and you run it under a production ASGI server with no blocking in async handlers, FastAPI holds its SLO comfortably. It is the wrong choice for latency-critical, high-throughput edge paths, which belong in Go.
Plan capacity from the bottleneck, and for most FastAPI services the bottleneck is either downstream I/O or model compute, not FastAPI itself. Start by measuring time spent in the service: framework and serialization overhead versus time awaiting downstream calls versus time in actual compute. The split tells you what to scale. If 90% of the time is model inference, more workers will not help a single slow request; a faster model, batching, or a GPU will.
Then size for the tail. An inference service with variable input sizes has a long latency tail, and your p99 is set by your largest realistic input, not your average. Provision and set timeouts against the tail, not the mean.
What I’d do differently
The mistake I see most is using FastAPI as a general-purpose service framework because the team knows Python, then fighting its concurrency model on paths that were never Python-shaped to begin with.
If I were drawing the lines again, I would be stricter: FastAPI only where Python compute is the actual product of the service. Everything else, the routing, the auth edge, the high-throughput glue, goes to Go. The polyglot win comes from each language doing the job it is good at, not from one language doing every job because it is familiar.
The honest tradeoff is velocity against runtime cost. Python ships data and ML work faster than anything else and costs more per request to run. When the work is genuinely Python, that trade is worth it. When it is not, you are paying the runtime cost for none of the velocity benefit. Once a Python service talks to a Go one, the seam itself becomes the risk: see Why Language Boundaries Break Polyglot Microservices.
Sources
- FastAPI documentation, Concurrency and async/await: fastapi.tiangolo.com/async
- Starlette documentation: starlette.io
- Uvicorn documentation, Deployment: uvicorn.org/deployment
Frequently asked questions
Is FastAPI fast enough for production microservices?
Yes, when the service is I/O-bound or fronts Python compute, and when you run it under a production ASGI server with no blocking calls in async handlers. It is not the right choice for latency-critical, high-throughput edge paths, which belong in Go.
How many Uvicorn workers should I run?
Start at roughly one per CPU core for I/O-bound services and tune against real latency. For CPU-bound work the GIL means per-worker concurrency does not help, so match workers to cores and move heavy compute to a process pool or another service. Always account for per-worker memory.
Should I use async def or def for FastAPI routes?
Use async def only when everything you await is non-blocking. If a handler calls synchronous libraries, use plain def so FastAPI runs it in a threadpool. The dangerous case is an async def handler calling a blocking driver, which stalls the event loop.
Is FastAPI good for microservices, or should I use Go?
Use FastAPI where Python is the right runtime, mainly machine learning and data work. Use Go for the network and orchestration layer. In a polyglot fleet they are complementary, not competitors.