Intuition behind Average Precision and MAP

Average Precision (AP), more commonly, further averaged over all queries and reported as a single score — Mean Average Precision (MAP) — is a very popular performance measure in information retrieval. However, it is quite tricky to interpret and compare the scores that are seen. From my observations, most hard challenges (TRECVID Semantic Indexing / Genre Classification) have very low (0.05 to 0.3) MAP scores (max 1).

This post is my view on what the score conveys. I will also try to interpret what is a “good” MAP score, and what is “bad”. While this depends on the application, it is still good to have an idea of what to expect.

So firstly, what is MAP, or AP. Suppose we are searching for images of a flower and we provide our image retrieval system a sample picture of a rose (query), we do get back a bunch of ranked images (from most likely to least likely). Usually not all of them are correct. So we compute the precision at every correctly returned image, and then take an average. If our returned result is
1, 0, 0, 1, 1, 1
where 1 is an image of a flower, while 0 not, then the precision at every correct point is: how many correct images have been encountered up to this point (including current) divided by the total images seen up to this point.
1/1, 0, 0, 2/4, 3/5, 4/6

The AP for above example is 0.6917 (see comments). 0.461, and although of the 6 images returned 4 are correct, the AP is less than half! This is precisely what makes the score somewhat hard to understand.

A simple way to interpret is to produce a combination of zeros and ones which will give the required AP.
For example, an AP of 0.5 could have results like
0, 1, 0, 1, 0, 1, ...
where every second image is correct, while an AP of 0.333 has
0, 0, 1, 0, 0, 1, 0, 0, 1, ...
where every third image is correct.

Hopefully, it is clear by now that for an AP of 0.1, every 10th image will be correct, and that is definitely a bad retrieval system. On the other hand, for an AP above 0.5, we will encounter more correct images than wrong in the top results which is definitely a good sign.

MAP is just an extension, where the mean is taken across all AP scores for many queries, and again the above interpretation should hold true. Does your system have a MAP > 0.5? 🙂

Further, I would consider an improvement in MAP score from 0.2 (every 5th) to 0.4 (every 2nd/3rd) as much more significant than an improvement from 0.7 to 0.9 when the results are already pretty good.

While, the possibility to estimate a MAP score given the (multi-class) classification accuracy still remains a mystery to me, any other interpretations are more than welcome!

UPDATE 1: Another measure, the Mean Reciprocal Rank is equivalent to MAP when the number of correct elements in the returned list is just 1. In that case, we simply use the precision at that point, or the reciprocal index of the correct location and average across all queries. Thanks to zhangmeng1010 for the tip.

Advertisements

23 thoughts on “Intuition behind Average Precision and MAP

  1. Dr. Computational Biologist

    your AP example is incorrect. the AP of that sequence is 0.69, as you should be integrating (summing) over the “correct” images, so it’s (1 + 2/4 + 3/5 + 4/6) / 4 = 0.69.

    consider that you are to some extent integrating through the precision-vs-recall space, particularly along the recall (x) axis.
    each ‘delta’ on this axis represents a new “correct” or “positive” case, and there are only 4 such steps that can be taken in your example.

    http://en.wikipedia.org/wiki/Information_retrieval

    cheers!

    Reply
    1. izyan

      Thank you for correcting that..if that is the case, does AP=0.69 means for every 3rd information retieved, at least 2 will be correct?

      Reply
      1. Makarand Tapaswi Post author

        Yes, I believe the rest of the intuition is correct. At least the AP values for the sequences are.

    2. Clayton Lemons

      > as you should be integrating (summing) over the “correct” images
      I think the value in the denominator is even more subtle than the number of “correct” images. According to the Wikipedia article, you should be integrating over the number of *relevant* images. In your example, the number of relevant images for the rose query isn’t specified.

      As an example, let’s say there are 10 images in the database, and 6 of them are roses. In that case, (1 + 2/4 + 3/5 + 4/6) / 6 = 0.461 would actually be correct. If there are 4 roses in the database, then 0.69 is correct.

      For StudentTt who asked:
      >What’s the meaning of this scenario:
      >I retrieve 10 images and only the first one is relevant, AP will be 1,
      >1/1,0,0,0,0,0,0,0,0,0,0 AP = 1
      It makes sense that AP = 1 if there is only one image in the database that is relevant to the query. “The only relevant image was retrieved, and it had the #1 rank, so you get the highest score!” However, if there are 10 relevant images in the database, then AP = 0.1.

      Reply
  2. StudenTt

    What’s the meaning of this scenario:
    I retrieve 10 images and only the first one is relevant, AP will be 1,
    1/1,0,0,0,0,0,0,0,0,0,0 AP = 1?

    Reply
    1. Makarand Tapaswi Post author

      Well it could be the case that you have only one correct image to retrieve in your data set. So, since you retrieve it first, that gives an AP of 1. In the ideal case while computing AP on data sets, you usually consider all images that should be retrieved. So if you had an image at first place, and then at 100th place, the AP score will crash. But, if you have only 1, then its perfect.

      Reply
  3. Pingback: Affinity Diffusion | The Technical Experience Page

      1. Makarand Tapaswi Post author

        I’m not sure I understand what you mean by nominator. But the difference is basically this: If there is only one correct answer in the database for every query you make, then MRR is the same as MAP. If there are multiple correct answers, like you are searching for a beach image and you have 5 beach images in your dataset, then you should not use MRR, because you need to consider the positions of all (5) retrieved results and this can be done only by AP.

  4. Luis

    I don’t understand why the precision at every point is this:
    1/1, 0, 0, 2/4, 3/5, 4/6

    I’d say it is the following:
    1/1, 1/2, 1/3, 2/4, 3/5, 4/6

    …since we have one positive at the second trial and still one positive at the third trial.

    If so, then the AP is 0.6, right?

    Reply
  5. Camilo

    Thanks for explaining the intuition behind MAP.
    I was wondering: Is there any way to use this metric with non binary labels? e.g. Predicting relevant items using ratings.

    I can only imagine defining a threshold but it seems to be arbitrary…

    Reply
    1. Makarand Tapaswi Post author

      Hi
      I don’t really know about such a metric, but yes it is certainly interesting. Is your question going towards recommendation systems? You might want to check those out. Also, you can try searching for the keyword “ordinal” ranking. This is a 1 – N ordered ranking system with no item having the same rank.

      Reply
      1. Camilo

        Yes, definitely related to recommendation systems. More specifically about a competition I’m in, but I asked the administrators and they are using a 4 out 5 stars threshold to label items as relevant as I suspected.
        Thanks for your answer.

  6. Devraj Mandal

    Hello thanks for the informative tutorial. However in the Ben Hammer Test Cases : I tried to do all by hand and the results are not matching. For example for test case (1:5, [1 1 1 1 1], 5, 0.2) the MAP = 0.2 and not = 1 ? Perhaps you could clarify this for me.

    Reply
    1. Makarand Tapaswi Post author

      Sorry, what are these test cases? Also, I don’t understand the data format, could you please clarify what do you mean by 1:5, [1 1 1 1 1], 5, 0.2? What do each of these stand for?

      Reply
  7. sam

    For example, an AP of 0.5 could have results like
    0, 1, 0, 1, 0, 1, …

    Isn’t the AP 1.5 here? It’s not 0.5 but it’s a multiplication of 0.5..
    Waiting for your reply.

    Reply
    1. Makarand Tapaswi Post author

      The AP has to always be in the range of 0-1. Based on how we compute it, see the example, it should be clear that we get half the retrieved results correct, resulting in 0.5 AP

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s