The working is extremely simple. Small segments of the time domain signal are analysed and their spectrum computed. A peak detection algorithm is run on this which obtains the amplitudes and frequencies (bins) of the strong harmonics seen in the spectrum. The important parameter here is the FFT size and the window length. This is because the frequency resolution and time resolution are dependent on this factor, as described in this earlier post on Formants and Harmonics.
Now, the pitch variation for that music signal or speech signal is generally known. Hence, a range of pitch values are chosen, and their harmonics are computed. These harmonics are are tried to match with the measured ones of the signal. The closer the match, the better the estimation of the pitch. This matching technique ensures that if a few peaks are missed in the signal, they will be omitted and penalised accordingly. Also extra peaks will be penalised.
This matching technique is performed in both ways. Matching the predicted harmonics with the measured partials and the other way round, measured partials with predicted harmonics. A weighting of these two errors is carried out, and the final Pitch estimate error is obtained. The global minimum is computed among the chosen range of pitch values, and this corresponds to the pitch frequency of that segment.
A smoothing (median filtering) may be carried out later in case of speech signals which generally tend to have a constant-pitch over small segments of time.