To understand, turn up the heat
A cold model is either correct or confidently incorrect
Let’s talk about “temperature” in the world of ML. This word is key in language models and machine translation systems, because it’s the variable that helps strike the balance between confidence and uncertainty awareness in the machine’s responses. The hotter it gets, the weirder things get. Let’s learn why.
When a model comes back with raw predictions (called logits), they are not normalized, meaning that unnaturally high probabilities have not been filtered out which can lead to overconfidence. To solve this problem, the logits are then fed to a function called softmax (sometimes called a softargmax), which “smooths” things out. In addition to feeding the softmax the logits, you can also give it a temperature, which tells it how to interpret the outliers in your data. A lower temperature will exaggerate the outliers and make the model very confident (even if it is incorrect – not good), a higher temperature will “increase the smoothing”, make things seem less extreme, and reduce the confidence of the model (great for encouraging the model to get creative)
I did a bunch of work with Open AI Whisper to subtitle and translate a documentary that was in Japanese so that I would have English subtitles, and temperature plays a major role in a task like this. Lower temperatures are better for accuracy, but if your audio isn’t perfect (background noise, bad pronunciation, background audio, etc) a lower temperature may fail to extract any subtitles at all, so increasing the temperature can be helpful for getting creative in interpreting the audio and is used as an automatic fallback. The problem with this though, is that increasing the temperature like this causes drift and context understanding problems, so after it gets whatever data it can out of the input, it has to be periodically reset.
parser.add_argument("--temperature_increment_on_fallback", type=optional_float, default=0.2, help="temperature to increase when falling back when the decoding fails to meet either of the thresholds below")
Another thing to note is that temperature can be a canary in the coal mine for concept drift! It’s a great metric to monitor in production because if the model is requiring a temperature bump to reach the same level of confidence about its predictions, that means it needs to get creative to reach the same level of confidence it had before. That also means the data it was trained on is not what it is seeing (and it’s time to re-train). SREs should be aware of this and it should be part of their SLOs for ML services.
Temperature is all about finding the right balance for your specific task. Whisper handles this automatically for you by bumping the temperature and resetting it periodically, but if you’re building your own system like this, you’ll need to tune and monitor the temperature.