# Intuition behind Average Precision and MAP

Average Precision (AP), more commonly, further averaged over all queries and reported as a single score — Mean Average Precision (MAP) — is a very popular performance measure in information retrieval. However, it is quite tricky to interpret and compare the scores that are seen. From my observations, most hard challenges (TRECVID Semantic Indexing / Genre Classification) have very low (0.05 to 0.3) MAP scores (max 1).

This post is my view on what the score conveys. I will also try to interpret what is a “good” MAP score, and what is “bad”. While this depends on the application, it is still good to have an idea of what to expect.

So firstly, what is MAP, or AP. Suppose we are searching for images of a flower and we provide our image retrieval system a sample picture of a rose (query), we do get back a bunch of ranked images (from most likely to least likely). Usually not all of them are correct. So we compute the precision at every correctly returned image, and then take an average. If our returned result is
`1, 0, 0, 1, 1, 1`
where 1 is an image of a flower, while 0 not, then the precision at every correct point is: how many correct images have been encountered up to this point (including current) divided by the total images seen up to this point.
`1/1, 0, 0, 2/4, 3/5, 4/6`

The AP for above example is 0.6917 (see comments). 0.461, and although of the 6 images returned 4 are correct, the AP is less than half! This is precisely what makes the score somewhat hard to understand.

A simple way to interpret is to produce a combination of zeros and ones which will give the required AP.
For example, an AP of 0.5 could have results like
`0, 1, 0, 1, 0, 1, ...`
where every second image is correct, while an AP of 0.333 has
`0, 0, 1, 0, 0, 1, 0, 0, 1, ...`
where every third image is correct.

Hopefully, it is clear by now that for an AP of 0.1, every 10th image will be correct, and that is definitely a bad retrieval system. On the other hand, for an AP above 0.5, we will encounter more correct images than wrong in the top results which is definitely a good sign.

MAP is just an extension, where the mean is taken across all AP scores for many queries, and again the above interpretation should hold true. Does your system have a MAP > 0.5? 🙂

Further, I would consider an improvement in MAP score from 0.2 (every 5th) to 0.4 (every 2nd/3rd) as much more significant than an improvement from 0.7 to 0.9 when the results are already pretty good.

While, the possibility to estimate a MAP score given the (multi-class) classification accuracy still remains a mystery to me, any other interpretations are more than welcome!

UPDATE 1: Another measure, the Mean Reciprocal Rank is equivalent to MAP when the number of correct elements in the returned list is just 1. In that case, we simply use the precision at that point, or the reciprocal index of the correct location and average across all queries. Thanks to zhangmeng1010 for the tip.

## 26 thoughts on “Intuition behind Average Precision and MAP”

1. Dr. Computational Biologist

your AP example is incorrect. the AP of that sequence is 0.69, as you should be integrating (summing) over the “correct” images, so it’s (1 + 2/4 + 3/5 + 4/6) / 4 = 0.69.

consider that you are to some extent integrating through the precision-vs-recall space, particularly along the recall (x) axis.
each ‘delta’ on this axis represents a new “correct” or “positive” case, and there are only 4 such steps that can be taken in your example.

http://en.wikipedia.org/wiki/Information_retrieval

cheers!

1. Makarand Tapaswi Post author

Yes indeed. Thanks!

2. izyan

Thank you for correcting that..if that is the case, does AP=0.69 means for every 3rd information retieved, at least 2 will be correct?

1. Makarand Tapaswi Post author

Yes, I believe the rest of the intuition is correct. At least the AP values for the sequences are.

3. Clayton Lemons

> as you should be integrating (summing) over the “correct” images
I think the value in the denominator is even more subtle than the number of “correct” images. According to the Wikipedia article, you should be integrating over the number of *relevant* images. In your example, the number of relevant images for the rose query isn’t specified.

As an example, let’s say there are 10 images in the database, and 6 of them are roses. In that case, (1 + 2/4 + 3/5 + 4/6) / 6 = 0.461 would actually be correct. If there are 4 roses in the database, then 0.69 is correct.

>What’s the meaning of this scenario:
>I retrieve 10 images and only the first one is relevant, AP will be 1,
>1/1,0,0,0,0,0,0,0,0,0,0 AP = 1
It makes sense that AP = 1 if there is only one image in the database that is relevant to the query. “The only relevant image was retrieved, and it had the #1 rank, so you get the highest score!” However, if there are 10 relevant images in the database, then AP = 0.1.

1. Makarand Tapaswi Post author

Thanks, this does make sense!

2. StudenTt

What’s the meaning of this scenario:
I retrieve 10 images and only the first one is relevant, AP will be 1,
1/1,0,0,0,0,0,0,0,0,0,0 AP = 1?

1. Makarand Tapaswi Post author

Well it could be the case that you have only one correct image to retrieve in your data set. So, since you retrieve it first, that gives an AP of 1. In the ideal case while computing AP on data sets, you usually consider all images that should be retrieved. So if you had an image at first place, and then at 100th place, the AP score will crash. But, if you have only 1, then its perfect.

3. zhangmeng1010

Is this mean reciprocal rank, but not average percision?

1. Makarand Tapaswi Post author

Hi, Thanks for the point. Mean Reciprocal Rank (MRR) applies when the number of correct results in the list returned is just 1. In that case mAP and MRR are the same. See http://en.wikipedia.org/wiki/Mean_reciprocal_rank

1. zhangmeng1010

So the difference of those MRR and average precision on value is the nominator?

2. Makarand Tapaswi Post author

I’m not sure I understand what you mean by nominator. But the difference is basically this: If there is only one correct answer in the database for every query you make, then MRR is the same as MAP. If there are multiple correct answers, like you are searching for a beach image and you have 5 beach images in your dataset, then you should not use MRR, because you need to consider the positions of all (5) retrieved results and this can be done only by AP.

4. Luis

I don’t understand why the precision at every point is this:
1/1, 0, 0, 2/4, 3/5, 4/6

I’d say it is the following:
1/1, 1/2, 1/3, 2/4, 3/5, 4/6

…since we have one positive at the second trial and still one positive at the third trial.

If so, then the AP is 0.6, right?

1. Makarand Tapaswi Post author

The precision at every point is multiplied by whether the document is relevant or not. You can imagine that this comes from the notion of “precision” which checks only on the ones that are actually correctly predicted and discards the rest. Here are the relevant formulas and the links to the corresponding papers — http://en.wikipedia.org/wiki/Information_retrieval#Average_precision

5. Camilo

Thanks for explaining the intuition behind MAP.
I was wondering: Is there any way to use this metric with non binary labels? e.g. Predicting relevant items using ratings.

I can only imagine defining a threshold but it seems to be arbitrary…

1. Makarand Tapaswi Post author

Hi
I don’t really know about such a metric, but yes it is certainly interesting. Is your question going towards recommendation systems? You might want to check those out. Also, you can try searching for the keyword “ordinal” ranking. This is a 1 – N ordered ranking system with no item having the same rank.

1. Camilo

Yes, definitely related to recommendation systems. More specifically about a competition I’m in, but I asked the administrators and they are using a 4 out 5 stars threshold to label items as relevant as I suspected.

6. Devraj Mandal

Hello thanks for the informative tutorial. However in the Ben Hammer Test Cases : I tried to do all by hand and the results are not matching. For example for test case (1:5, [1 1 1 1 1], 5, 0.2) the MAP = 0.2 and not = 1 ? Perhaps you could clarify this for me.

1. Makarand Tapaswi Post author

Sorry, what are these test cases? Also, I don’t understand the data format, could you please clarify what do you mean by 1:5, [1 1 1 1 1], 5, 0.2? What do each of these stand for?

7. sam

For example, an AP of 0.5 could have results like
0, 1, 0, 1, 0, 1, …

Isn’t the AP 1.5 here? It’s not 0.5 but it’s a multiplication of 0.5..
1. Makarand Tapaswi Post author
8. sam