Evaluation
==========

.. _ood-explanation:

Description and Background
--------------------------
One main part of Monitizer is its evaluation.
It needs two key elements:

* a fully specified monitor (i.e. not only a monitor template)
* OOD data to evaluate on

Since it is not always desirable to optimize the monitor already and because it is often done in related work, we also include the computation of the AUROC.
"A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the performance of a binary classifier model (can be used for multi class classification as well) at varying threshold values." (see `Wikipedia <https://en.wikipedia.org/wiki/Receiver_operating_characteristic>`_).
Many monitors use some threshold that determines whether an input is ID or OOD. 
For such monitors, we can compute the ROC by scaling the threshold. 
The AUROC is then the area under the curve.

Monitizer allows for two possible evaluation modes: full (or test) and auroc.
It evaluates the monitor on a set of OOD data. 
As we discuss in our paper (`ArXiv <https://arxiv.org/pdf/2405.10350>`_), it is not entirley clear what OOD sets are a reasonable choice.
We provide an **automatic generation** of OOD images based on certain modifications and a hand selected collection of OOD images for existing datasets.

.. image:: ../images/ood-classes.png
  :width: 800
  :alt: Examples for the OOD classes
  
In the picture, we demonstrate the OOD classes for the `CIFAR10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_ dataset:

(a) ID image
(b) brighter
(c) increased contrast
(d) Gaussian blur
(e) inverted colors
(f) rotated
(g) Gaussian noise
(h) Salt-and-Pepper noise
(i) `FGSM <https://arxiv.org/abs/1412.6572>`_
(j) Unseen object from CIFAR-100
(k) New world from DTD
(l) New world from GTSRB
(m) Not OOD

b - i are the automatically generated images (where for i you need a trained NN). 
j - l are hand collected by us.
The class *unseen object* (j) describes objects in the image that are different from the images in the ID dataset, but in a similar "style".
The class *new world* (k) describes a completely new image that has no resemblance to anything in the ID dataset.
The image (m) contains the image of a truck, which is something that appears in the ID dataset. Therefore, this should not necessarily be used as an OOD image and is, thus, not part of the OOD datasets.

.. _ood-classes:

The OOD classes
^^^^^^^^^^^^^^^
Monitizer provides automatically generated OOD inputs for any dataset, even custom implemented ones (see :ref:`dataset-implementation`).
The classes are the following:

* Noise/Gaussian
* Noise/SaltAndPepper
* Perturbation/Contrast
* Perturbation/GaussianBlur
* Perturbation/Invert
* Perturbation/Rotate
* Perturbation/Light

Implementation Details
----------------------
The evaluation happens in one Python script :mod:`monitizer.evaluate`.
Starting point is the function :meth:`monitizer.evaluate.evaluate`, which then either calls :meth:`monitizer.evaluate.evaluate_on` for the standard evaluation or :meth:`monitizer.evaluate.evaluate_auroc` for the evaluation of the AUROC.