To understand, turn up the heat
A cold model is either correct or confidently incorrect
Let’s talk about “temperature” in the world of ML. This word is key in language models and machine translation systems, because it’s the variable that helps strike the balance between confidence and uncertainty awareness in the machine’s responses. The hotter it gets, the weirder things get. Let’s learn why.
When a model comes back with raw predictions (called logits), they are not normalized, meaning that unnaturally high probabilities have not been filtered out which can lead to overconfidence. To solve this problem, the logits are then fed to a function called softmax (sometimes called a softargmax), which “smooths” things out. In addition to feeding the softmax the logits, you can also give it a temperature, which tells it how to interpret the outliers in your data. A lower temperature will exaggerate the outliers and make the model very confident (even if it is incorrect – not good), a higher temperature will “increase the smoothing”, make things seem less extreme, and reduce the confidence of the model (great for encouraging the model to get creative)
I did a bunch of work with Open AI Whisper to subtitle and translate a documentary that was in Japanese so that I would have English subtitles, and temperature plays a major role in a task like this. Lower temperatures are better for accuracy, but if your audio isn’t perfect (background noise, bad pronunciation, background audio, etc) a lower temperature may fail to extract any subtitles at all, so increasing the temperature can be helpful for getting creative in interpreting the audio and is used as an automatic fallback. The problem with this though, is that increasing the temperature like this causes drift and context understanding problems, so after it gets whatever data it can out of the input, it has to be periodically reset.
parser.add_argument("--temperature_increment_on_fallback", type=optional_float, default=0.2, help="temperature to increase when falling back when the decoding fails to meet either of the thresholds below")
Another thing to note is that temperature can be a canary in the coal mine for concept drift! It’s a great metric to monitor in production because if the model is requiring a temperature bump to reach the same level of confidence about its predictions, that means it needs to get creative to reach the same level of confidence it had before. That also means the data it was trained on is not what it is seeing (and it’s time to re-train). SREs should be aware of this and it should be part of their SLOs for ML services.
Temperature is all about finding the right balance for your specific task. Whisper handles this automatically for you by bumping the temperature and resetting it periodically, but if you’re building your own system like this, you’ll need to tune and monitor the temperature.
Help Your Future-Self Succeed with ML Monitoring and Retraining
It’s worth it, trust me 🙂
Seriously, your future self will be thanking you for setting up a well defined pipeline around this stuff when it inevitably comes time to retrain the model. In the same way that a CI/CD (Continuous Integration / Continuous Deployment) pipeline is helpful for reproducibility, a CI/CT/CD/CM (those new ones are Continuous Training and Continuous Monitoring) will ensure consistency in your model, minimize concept drift, increase reproducibility, and increase portability (should you need to change where you train or inference).
The first step is to make sure you have a system in place to monitor your model in production to catch data format issues and model performance degradation (how often is it making a bad prediction?). Once you have metrics, you can set up automation to trigger if a threshold is passed. For example, is your random number generator model suddenly returning lots of ones? Maybe there’s a problem there. Probably time for retraining (and a larger dataset). If you care about model accuracy you can use DVC (Data Version Control) so that you know when the dataset has changed and it’s time to retrain
Ideally, you’re training whenever your training data changes…but that’s also expensive. So setting a higher model performance threshold for conditions which warrant training and only then pulling in a new dataset is likely advisable if you are on a budget. When you’re ready to train it’s off to tools like Kubeflow, Vertex AI Pipelines, or Ray to orchestrate a training session, and handle the metadata collection.
Similar to Continuous Delivery and other aspects of DevOps, you’ll need to get buy-in from your organization to really make Continuous Training worthwhile. It’s also important to remember that automation isn’t everything – you’re still going to want to have someone on call who can take a look at things and see if the model needs to be changed (and not just retrained).
By implementing automation, you ensure organizational continuity, continued model relevance, and a better experience for your customers. It’s usually worth the cost!
Building a Sandcastle on the Tideline: Embracing change with ML Ops
If the world hadn’t changed, my model would be amazing!
I see machine learning models a lot like vaccine boosters. Let me explain.
The way vaccine reformulation works is that we take the vaccine that took years to develop (model development), and slightly change it to address the current environment it operates within (new data). We make the changes needed to stand up against the current most popular viral variant…only to manufacture it (train it) for a few months. When the vaccine (model) finally emerges, it finds itself in a world very different from the one it was designed (trained) for. It still does an admirable job, and it’s better than nothing, but the variant it was designed for is likely no longer the dominant one. Imagine how effective the vaccine could be if the cycle time from modification to production was shortened! That extended manufacturing/training time has resulted in a model that is already out of date. This is called concept drift.
I used to think they just train and “launch” ML models and then let them run like most software, dealing with operational issues and such but for the most part letting the already written software do the work. Wrong. AI models are not like regular software. They constantly need to be updated. If you’re not constantly monitoring for drift and retraining with new data, you’ll quickly fall behind. Of course, there are ways to make the impact of concept drift feel less drastic (like allowing extensions that pull in real, live data). But when it comes to making predictions and formulating ideas on research, you need a model that knows the latest way to be 😎
How do you solve this?? Manually training a model is fine in the initial prototyping stage, but you can’t ship something like that. If you did, you would be managing the model lifecycle by hand forever which isn’t sustainable. Beyond an initial local experiment in a notebook, you need to be thinking about pipelines and automation from the beginning. Run the training pipeline whenever there is new data – this is called Continuous Training (CT). The automated process can do supervised training and update the model weights for you.
The industry throws the term model weights around a lot and I think it deserves more of an explanation. The concept of model weights actually comes from the theory of Hebbian plasticity in the brain (a form of synaptic plasticity that basically says “neurons that fire together, wire together”). The way the theory goes, is that if two neurons in the brain are working together often enough they will strengthen the connection between themselves for higher bandwidth transmission (just like we do on our roads when there is too much traffic at rush hour). This strengthened bond ends up re-arranging our neurons, and it’s how we learn! 🧠
Machine learning models do the same thing. During supervised learning, they keep track of how often the model makes a bad prediction. This is measured against the dataset. If I ask the model what Jane’s favorite color is, it better get the right answer if that answer is explicitly spelled out in the dataset it was trained on. If it gets the answer wrong, the weight is reduced. When it gets the answer right, it increases the weight, just like our brains strengthen neural connections. This is where the term neural network comes from.
I’ll do a post soon that goes into detail on getting that magical ML retraining pipeline into place, but for now just remember that automation is key (always be training) and that models have to learn and stay up to date, just like you do.