How to perform label audits and sampling reviews when training data labels are inconsistent?

I've been working on a customer service intent classification model recently. The offline accuracy looks decent, but manual spot checks before deployment revealed that the same type of refund issue was labeled as 'refund' by some and 'after_sales' by others. It's not that the model can't learn; it's that the training set standards are inconsistent. My approach was to first export the…