Production-Ready Machine Learning NLP API for Classification with FastAPI and Transformers

The first version of FastAPI has been released by the end of 2018 and it's been increasingly used in many applications in production since then. That's what we are using behind the hood at NLP Cloud. It is a great way to easily and efficiently serve our hundreds of NLP models, for entity extraction (NER), text classification, sentiment analysis, question answering, summarization... We found that FastAPI is a great way to serve transformer-based deep learning models.

In this article, we thought it would be interesting to show you how we are implementing an NLP API based on Hugging Face transformers with FastAPI.

Why Use FastAPI?

Before FastAPI, we had essentially used Django Rest Framework for our Python APIs, but we were quickly interested in FastAPI for the following reasons:

These great performances make FastAPI perfectly suited for machine learning APIs serving transformer-based models like ours.

Install FastAPI

In order for FastAPI to work, we are coupling it with the Uvicorn ASGI server, which is the modern way to natively handle asynchronous Python requests with asyncio. You can either decide to install FastAPI with Uvicorn manually or download a ready-to-use Docker image. Let's show the manual installation first:

pip install fastapi[all]

Then you can start it with:

uvicorn main:app

Sebastián Ramírez, the creator of FastAPI, provides several ready-to-use Docker images that make it very easy to use FastAPI in production. The Uvicorn + Gunicorn + FastAPI image takes advantage of Gunicorn in order to use several processes in parallel. In the end, thanks to Uvicorn you can handle several FastAPI instances within the same Python process, and thanks to Gunicorn you can spawn several Python processes.

You FastAPI application will automatically start when starting the Docker container with docker run.

It is important to properly read the documentation of these Docker images as there are some settings you might want to tweak, like for example the number of parallel processes created by Gunicorn. By default, the image spawns as many processes as the number of CPU cores on your machine. But in case of demanding machine learning models like NLP Transformers, it can quickly lead to tens of GBs of memory used. One strategy would be to leverage the Gunicorn --preload option, in order to load your model only once in memory and share it among all the FastAPI Python processes. Another option would be to cap the number of Gunicorn processes. Both have advantages and drawbacks, but that's beyond the scope of this article.

Simple FastAPI + Transformers API for Text Classification

Text classification is the process of determining what a piece of text is talking about (Space? Business? Food?...). More details about text classification here.

We want to create an API endpoint that performs text classification using the Facebook's Bart Large MNLI model, which is a pre-trained model based on Hugging Face transformers, perfectly suited for text classification.

Our API endpoint will take a piece of text as an input, along with potential categories (called labels), and it will return a score for each category (the higher, the more likely).

We will request the endpoint with POST requests like this:

curl "" \
-H "Authorization: Token e7f6539e5a5d7a16e15" \
-X POST -d '{
    "text":"John Doe is a Go Developer at Google. He has been working there for 10 years and has been awarded employee of the year.",
    "labels":["job", "nature", "space"]

And in return we would get a response like:

    "labels": [
    "scores": [

Here is how to achieve it with FastAPI and Transformers:

from fastapi import FastAPI
from pydantic import BaseModel, constr, conlist
from typing import List
from transformers import pipeline

classifier = pipeline("zero-shot-classification",
app = FastAPI()

class UserRequestIn(BaseModel):
    text: constr(min_length=1)
    labels: conlist(str, min_items=1)

class ScoredLabelsOut(BaseModel):
    labels: List[str]
    scores: List[float]"/classification", response_model=ScoredLabelsOut)
def read_classification(user_request_in: UserRequestIn):
    return classifier(user_request_in.text, user_request_in.labels)

First things first: we're loading the Facebook's Bart Large MNLI from the Hugging Face repository, and properly initializing it for classification purposes, thanks to Transformer Pipeline:

classifier = pipeline("zero-shot-classification",

And later we are using the model by doing this:

classifier(user_request_in.text, user_request_in.labels)

Second important thing: we are performing data validation thanks to Pydantic. Pydantic forces you to declare in advance the input and output format for your API, which is great from a documentation standpoint, but also because it limits potential mistakes. In Go you would do pretty much the same thing with JSON unmarshalling with structs. constr(min_length=1) is an easy way to declare that the "text" field should at least have 1 character. And conlist(str, min_items=1) specifies that the input list of labels should at list contain one element. List[str] means that the "labels" output field should be a list of strings, and List[float] means that the scores should be a list of floats. If the model returns results that don't follow this format, FastAPI will automatically raise an error.

class UserRequestIn(BaseModel):
    text: constr(min_length=1)
    labels: conlist(str, min_items=1)

class ScoredLabelsOut(BaseModel):
    labels: List[str]
    scores: List[float]

Last of all, the"/entities", response_model=EntitiesOut) decorator makes it easy to specify that you only accept POST requests, on a specific endpoint.

More Advanced Data Validation

You can do many more complex validation things, like for example composition. For example, let's say that you are doing Named Entity Recognition (NER), so your model is returning a list of entities. Each entity would have 4 fields: text, type, start and position. Here is how you could do it:

class EntityOut(BaseModel):
    start: int
    end: int
    type: str
    text: str

class EntitiesOut(BaseModel):
    entities: List[EntityOut]"/entities", response_model=EntitiesOut) 
# [...]

Until now, we've let Pydantic handle the validation. It works in most cases, but sometimes you might want to dynamically raise an error by yourself based on complex conditions that are not natively handled by Pydantic. For example, if you want to manually return a HTTP 400 error, you can do the following:

from fastapi import HTTPException

raise HTTPException(status_code=400, 
        detail="Your request is malformed")

Of course you can do much more!

Setting the Root Path

If you're using FastAPI behind a reverse proxy, you will most likely need to play with the root path.

The hard thing is that, behind a reverse proxy, the application does not know about the whole URL path, so we have to explicitly tell it which it is.

For example here the full URL to our endpoint might not simply be /classification but maybe something like /api/v1/classification. We don't want to hardcode this full URL in order for our API code to be loosely coupled with the rest of the application. We could do this:

app = FastAPI(root_path="/api/v1")

Or alternatively you could pass a parameter to Uvicorn when starting it:

uvicorn main:app --root-path /api/v1


I hope we successfully showed you how convenient FastAPI can be for a NLP API. Pydantic makes the code very expressive and less error-prone.

FastAPI has great performances and makes it possible to use Python asyncio out of the box, which is great for demanding machine learning models like Transformer-based NLP models. We have been using FastAPI for almost 1 year at and we have never been disappointed so far.

I any question, please don't hesitate to ask, it will be a pleasure to comment!

Julien Salinas
CTO at