This post is my view on what the score conveys. I will also try to interpret what is a “good” MAP score, and what is “bad”. While this depends on the application, it is still good to have an idea of what to expect.

So firstly, what is MAP, or AP. Suppose we are searching for images of a flower and we provide our image retrieval system a sample picture of a rose (query), we do get back a bunch of ranked images (from most likely to least likely). Usually not all of them are correct. So we compute the precision at every correctly returned image, and then take an average. If our returned result is

`1, 0, 0, 1, 1, 1`

where 1 is an image of a flower, while 0 not, then the precision at every correct point is: how many correct images have been encountered up to this point (including current) divided by the total images seen up to this point.

`1/1, 0, 0, 2/4, 3/5, 4/6`

The AP for above example is 0.6917 (see comments). ~~0.461, and although of the 6 images returned 4 are correct, the AP is less than half! This is precisely what makes the score somewhat hard to understand.~~

A simple way to interpret is to produce a combination of zeros and ones which will give the required AP.

For example, an AP of 0.5 could have results like

`0, 1, 0, 1, 0, 1, ...`

where every second image is correct, while an AP of 0.333 has

`0, 0, 1, 0, 0, 1, 0, 0, 1, ...`

where every third image is correct.

Hopefully, it is clear by now that for an AP of 0.1, every 10th image will be correct, and that is definitely a bad retrieval system. On the other hand, for an AP above 0.5, we will encounter more correct images than wrong in the top results which is definitely a good sign.

MAP is just an extension, where the mean is taken across all AP scores for many queries, and again the above interpretation should hold true. Does your system have a MAP > 0.5? đź™‚

Further, I would consider an improvement in MAP score from 0.2 (every 5th) to 0.4 (every 2nd/3rd) as much more significant than an improvement from 0.7 to 0.9 when the results are already pretty good.

While, the possibility to estimate a MAP score given the (multi-class) classification accuracy still remains a mystery to me, any other interpretations are more than welcome!

UPDATE 1: Another measure, the Mean Reciprocal Rank is equivalent to MAP when the number of correct elements in the returned list is just 1. In that case, we simply use the precision at that point, or the reciprocal index of the correct location and average across all queries. Thanks to zhangmeng1010 for the tip.

Dr. Computational Biologistyour AP example is incorrect. the AP of that sequence is 0.69, as you should be integrating (summing) over the “correct” images, so it’s (1 + 2/4 + 3/5 + 4/6) / 4 = 0.69.

consider that you are to some extent integrating through the precision-vs-recall space, particularly along the recall (x) axis.

each ‘delta’ on this axis represents a new “correct” or “positive” case, and there are only 4 such steps that can be taken in your example.

http://en.wikipedia.org/wiki/Information_retrieval

cheers!

Makarand TapaswiPost authorYes indeed. Thanks!

izyanThank you for correcting that..if that is the case, does AP=0.69 means for every 3rd information retieved, at least 2 will be correct?

Makarand TapaswiPost authorYes, I believe the rest of the intuition is correct. At least the AP values for the sequences are.

Clayton Lemons> as you should be integrating (summing) over the â€ścorrectâ€ť images

I think the value in the denominator is even more subtle than the number of “correct” images. According to the Wikipedia article, you should be integrating over the number of *relevant* images. In your example, the number of relevant images for the rose query isn’t specified.

As an example, let’s say there are 10 images in the database, and 6 of them are roses. In that case, (1 + 2/4 + 3/5 + 4/6) / 6 = 0.461 would actually be correct. If there are 4 roses in the database, then 0.69 is correct.

For StudentTt who asked:

>Whatâ€™s the meaning of this scenario:

>I retrieve 10 images and only the first one is relevant, AP will be 1,

>1/1,0,0,0,0,0,0,0,0,0,0 AP = 1

It makes sense that AP = 1 if there is only one image in the database that is relevant to the query. “The only relevant image was retrieved, and it had the #1 rank, so you get the highest score!” However, if there are 10 relevant images in the database, then AP = 0.1.

Makarand TapaswiPost authorThanks, this does make sense!

StudenTtWhat’s the meaning of this scenario:

I retrieve 10 images and only the first one is relevant, AP will be 1,

1/1,0,0,0,0,0,0,0,0,0,0 AP = 1?

Makarand TapaswiPost authorWell it could be the case that you have only one correct image to retrieve in your data set. So, since you retrieve it first, that gives an AP of 1. In the ideal case while computing AP on data sets, you usually consider all images that should be retrieved. So if you had an image at first place, and then at 100th place, the AP score will crash. But, if you have only 1, then its perfect.

Pingback: Affinity Diffusion | The Technical Experience Page

zhangmeng1010Is this mean reciprocal rank, but not average percision?

Makarand TapaswiPost authorHi, Thanks for the point. Mean Reciprocal Rank (MRR) applies when the number of correct results in the list returned is just 1. In that case mAP and MRR are the same. See http://en.wikipedia.org/wiki/Mean_reciprocal_rank

zhangmeng1010So the difference of those MRR and average precision on value is the nominator?

Makarand TapaswiPost authorI’m not sure I understand what you mean by nominator. But the difference is basically this: If there is only one correct answer in the database for every query you make, then MRR is the same as MAP. If there are multiple correct answers, like you are searching for a beach image and you have 5 beach images in your dataset, then you should not use MRR, because you need to consider the positions of all (5) retrieved results and this can be done only by AP.

LuisI don’t understand why the precision at every point is this:

1/1, 0, 0, 2/4, 3/5, 4/6

I’d say it is the following:

1/1, 1/2, 1/3, 2/4, 3/5, 4/6

…since we have one positive at the second trial and still one positive at the third trial.

If so, then the AP is 0.6, right?

Makarand TapaswiPost authorThe precision at every point is multiplied by whether the document is relevant or not. You can imagine that this comes from the notion of “precision” which checks only on the ones that are actually correctly predicted and discards the rest. Here are the relevant formulas and the links to the corresponding papers — http://en.wikipedia.org/wiki/Information_retrieval#Average_precision

CamiloThanks for explaining the intuition behind MAP.

I was wondering: Is there any way to use this metric with non binary labels? e.g. Predicting relevant items using ratings.

I can only imagine defining a threshold but it seems to be arbitrary…

Makarand TapaswiPost authorHi

I don’t really know about such a metric, but yes it is certainly interesting. Is your question going towards recommendation systems? You might want to check those out. Also, you can try searching for the keyword “ordinal” ranking. This is a 1 – N ordered ranking system with no item having the same rank.

CamiloYes, definitely related to recommendation systems. More specifically about a competition I’m in, but I asked the administrators and they are using a 4 out 5 stars threshold to label items as relevant as I suspected.

Thanks for your answer.

Devraj MandalHello thanks for the informative tutorial. However in the Ben Hammer Test Cases : I tried to do all by hand and the results are not matching. For example for test case (1:5, [1 1 1 1 1], 5, 0.2) the MAP = 0.2 and not = 1 ? Perhaps you could clarify this for me.

Makarand TapaswiPost authorSorry, what are these test cases? Also, I don’t understand the data format, could you please clarify what do you mean by 1:5, [1 1 1 1 1], 5, 0.2? What do each of these stand for?

samFor example, an AP of 0.5 could have results like

0, 1, 0, 1, 0, 1, …

Isn’t the AP 1.5 here? It’s not 0.5 but it’s a multiplication of 0.5..

Waiting for your reply.

Makarand TapaswiPost authorThe AP has to always be in the range of 0-1. Based on how we compute it, see the example, it should be clear that we get half the retrieved results correct, resulting in 0.5 AP

samI see.

I have seen elsewhere that to calculate you need to know the number of relevant documents in the test collection. have a look at slide no 20 (page 10) here (http://web.cecs.pdx.edu/~maier/cs510iri/IR_Lectures/CS_510iri_Lecture8RelevanceEvaluation-revised.pdf). what do you think of that?

Also i have seen other ways of calculating AP where you take into account the change of recall as well..?