What should I check first when AI labeling results fluctuate?

A while ago, when I was working on an AI labeling project for customer service ticket intent classification, the biggest headache wasn't the workload, but the fact that the label distribution for the same batch of tickets varied significantly between morning and afternoon runs, leading operations to suspect the model was unstable. If you just guess based on experience, it's easy to blame…