#2. Your Best Friend - ML Monitoring

Part 2: Mastering the Art of Monitoring Deep Learning Systems That Use Unstructured Data

Jun 08, 2023

Your face when you open your production logs [Image by the Author].

In the last newsletter, we covered how to monitor an ML system that uses structured data.

In that case, you can easily talk about distributions and leverage techniques such as PSI, KL Divergence, JS Divergence, etc.

Now let’s look at how to adapt what we’ve learned for unstructured data such as images or text.

As a short reminder from the last newsletter, when a model is in production, you can acquire the GT in 3 different scenarios:

Ground truth types [Image by the Author].

Also, from your ML pipeline, you can analyze the following:

the model inputs
the model outputs
the actuals/ground truth

While in production, when you don’t have access in time to your GT, you can use drift metrics as a performance proxy.

Drifts for Unstructured Data

When it comes to monitoring unstructured data, you have two main scenarios:

Computer Vision

Drift can occur in production due to the following:
- drastic changes in the pixel distribution due to blurry, spotted, lightened, darkened, rotated, and cropped images
- new objects
- new unique situations, events, and interactions

For example, your training dataset contained only bounding boxes with one cat and in production, the model encountered images with groups of cats or multiple animals.

NLP

Drift can occur in production due to the following:
- new words
- new category
- new language
- changes in terminology of words

Embeddings

When manipulating unstructured data, everything revolves around embeddings.

You can effortlessly encode the text, an image, a video, etc., using an embedding.

If you already have a model, you can use it to compute your embeddings (e.g., the output of the last fully connected layer before classification) or use a generic global model (e.g., BERT from HuggingFace, GPT from OpenAI, etc.).

After you set your baseline (e.g., your training dataset or a window from the past from your production dataset) and production window, you compute the average of the embeddings for every window.

Now you will have two averaged embeddings: one for the baseline and one for the production window.

The last step is to pick a metric to compare them, such as IOU, euclidean distance, cosine distance, or clustering-based group purity scores.

Using Euclidean distance would be a great start.

Detect drift for unstructured data [Image by the Author].

How to Compute Embeddings

When you think about embeddings, you might associate them with deep learning, but I have to tell you there are other ways to compute them, such as using SVD or PCA.

If you are familiar with recommender systems, you have often seen that in the matrix factorization techniques used for collaborative filtering.

There is nothing wrong with these techniques, but they have their limitations, such as not responding well to missing data.

There is where neural networks kick in.

One first well-known initiatives to compute embeddings is word2vec.

Again, nothing wrong with this method. Everything I mentioned above is still widely used in the industry.

But, unstructured data is most commonly processed by deep neural networks. Thus, I will boil it down to two prominent cases.

#1. You are using your own DL model

You have the most control; you must know and apply the proper method to extract the embedding from your model.

You don't have to implement additional models. You must understand the model's architecture and use the proper extraction techniques.

When we are aiming to embed an image, we can have these two scenarios:

1. When classifying with a CNN-based model, you can extract the embedding right before the last classification layer, as seen in Fig. 1.

2. When classifying with a transformer-based model, you have to extract the embedding right before the MLP head, as seen in Fig. 2.

The network's output is a vector with rich information about the image at these extraction points.

#2. You are using a global pretrained model

This is the easiest method, as you can use pretrained models such as BERT from Hugging Face or GPT from OpenAI.

Unfortunately, you lack control using this method, and the embeddings won't be fine-tuned for your particular use case.

But this is a great starting point for many projects.

Extract embeddings [Image by the Author].

But why do we love embeddings so much?

Because embeddings became the most widely used interface between models, just like a REST interface between different microservices.

Also, you can quickly compose embeddings using simple operations such as distances, projections, additions, averages, etc., and they still keep their meaning in the vector space.

Why Data Versioning is Crucial When Using Different Extractions Methods

I showed you a few examples of extraction methods for various models.

But what happens if you want to change the extraction method or model to compare the performance?

Most probably, it will soon become a mess.

We all encountered situations such as: "final_model," "best_final_model," "best_final_final_model," etc. You get the idea... It is tough to keep track of our changes.

3 types of changes can occur when extracting embeddings:

#1. Change your model architecture

This is considered a major change: O.x.x

Changing your model architecture might change the semantics of the embeddings and their dimensionality. Also, as a by-product, it changes your extraction method, and you must retrain your model.

#2. Change your extraction type

This is considered a minor change: x.O.x

Again this might result in changes in your semantics or dimensionality, but you don't have to retrain your model.

#3. Retrain your model

This is considered a patch version change: x.x.O

This won't change your embedding structure, but by retraining, they won't be compatible with your old set of embeddings as the vector space might change.

As you see, your embeddings will change quite often, that is why you need to...

Version your data!

Data versioning is one key aspect of a clean ML system.

Every change will result in a new data version. Then, when you use a specific set of embeddings, you will know exactly how they were computed.

You can easily version your data directly in the feature store for structured data. You can quickly add data versioning for unstructured data using tools such as S3 + DVC/your custom software.

Different extraction types [Image by the Author].

Final Thoughts

Thank you for reading my newsletter. I am excited to share my ML journey with you.

To conclude…

This week you learned about the following:

monitoring unstructured data
how to compute embeddings
why versioning your embeddings is crucial

Discussion about this post

Ready for more?