Average Precision (AP), more commonly, further averaged over all queries and reported as a single score — Mean Average Precision (MAP) — is a very popular performance measure in information retrieval. However, it is quite tricky to interpret and compare the scores that are seen. From my observations, most hard challenges (TRECVID Semantic Indexing / Genre Classification) have very low (0.05 to 0.3) MAP scores (max 1).
This post is my view on what the score conveys. I will also try to interpret what is a “good” MAP score, and what is “bad”. While this depends on the application, it is still good to have an idea of what to expect.
So firstly, what is MAP, or AP. Suppose we are searching for images of a flower and we provide our image retrieval system a sample picture of a rose (query), we do get back a bunch of ranked images (from most likely to least likely). Usually not all of them are correct. So we compute the precision at every correctly returned image, and then take an average. If our returned result is
1, 0, 0, 1, 1, 1
where 1 is an image of a flower, while 0 not, then the precision at every correct point is: how many correct images have been encountered up to this point (including current) divided by the total images seen up to this point.
1/1, 0, 0, 2/4, 3/5, 4/6
The AP for above example is 0.6917 (see comments).
0.461, and although of the 6 images returned 4 are correct, the AP is less than half! This is precisely what makes the score somewhat hard to understand.
A simple way to interpret is to produce a combination of zeros and ones which will give the required AP.
For example, an AP of 0.5 could have results like
0, 1, 0, 1, 0, 1, ...
where every second image is correct, while an AP of 0.333 has
0, 0, 1, 0, 0, 1, 0, 0, 1, ...
where every third image is correct.
Hopefully, it is clear by now that for an AP of 0.1, every 10th image will be correct, and that is definitely a bad retrieval system. On the other hand, for an AP above 0.5, we will encounter more correct images than wrong in the top results which is definitely a good sign.
MAP is just an extension, where the mean is taken across all AP scores for many queries, and again the above interpretation should hold true. Does your system have a MAP > 0.5? 🙂
Further, I would consider an improvement in MAP score from 0.2 (every 5th) to 0.4 (every 2nd/3rd) as much more significant than an improvement from 0.7 to 0.9 when the results are already pretty good.
While, the possibility to estimate a MAP score given the (multi-class) classification accuracy still remains a mystery to me, any other interpretations are more than welcome!
UPDATE 1: Another measure, the Mean Reciprocal Rank is equivalent to MAP when the number of correct elements in the returned list is just 1. In that case, we simply use the precision at that point, or the reciprocal index of the correct location and average across all queries. Thanks to zhangmeng1010 for the tip.