We start with a set of questions we can ask the data so as to classify them. In a 2-D dataset , they could be as simple as or and so on. The obvious problem to do this manually is how would one select the questions, and in which order (note that, the solution is like an incomplete binary tree) do we ask them so as to be able to classify the fastest with a high accuracy.
Multiple splitting criteria are available, some of which could be minimum combined entropy of the child nodes, the minimum global entropy, or maximum likelihood, etc. But the concept in all is the same. In every step, we choose a question which gives 2 classes of which atleast 1 contains the dataset of mainly one class (this is the low entropy case mathematically). The ideal case is when it contains data of only one class, where we have 0 entropy.
It can be seen that the global entropy (of all leaf nodes) reduces as we increase the number of splits. However there has to be a limit, and the stopping criteria are defined pretty intuitively. They could be a fixed number of leaf nodes, or a threshold global entropy, or a threshold change in global entropy, or just based on how many splits are to be performed.
This technique is often used in speech recognition systems which suffer from insufficient training of triphones. So we cluster the triphones based on their linguistic properties like voiced/unvoiced/plosive/fricative/etc. (our questions in this case) Thus a similar model can be used for each of the triphones (around 6000 to cover all) which land in the same node after a CART analysis. This has been found to reduce the relative error by around 15%.