A manager finishes the session, gives it a high rating, and goes back to the same habits in the next tense one-on-one. That is the measurement problem.
Attendance and completion rates matter for administration. They do not tell you whether a manager pauses before reacting, asks better follow-up questions, or corrects harm without becoming defensive. Sensitivity training earns its keep only when behavior changes under pressure, across email, chat, video calls, and in-person conversations.
The model I use tracks four things: reaction, learning, behavior, and business results. The first two tell you whether the training landed. The last two tell you whether anything transferred into daily management practice.

For teams building a disciplined evaluation system, this framework for training measurement is a useful companion to internal HR metrics.
Level one and two reaction and learning
Start with reaction, but keep it in proportion. Challenging training often scores lower than agreeable training, especially when managers are asked to examine blind spots, power dynamics, or faith-related assumptions they have never had to name before.
Use post-session surveys to test practical value, not general satisfaction:
- Relevance: Did the cases reflect actual management decisions?
- Confidence: Can participants identify a better first response when an employee raises impact?
- Clarity: Which concepts still feel hard to apply in real conversations?
- Context fit: Did the examples match remote, hybrid, or faith-based realities where relevant?
Then test learning directly. A short pre-check and post-check works well if it uses scenarios rather than definitions. Ask managers to choose the best response to a Slack message that reads as dismissive, rewrite a poorly handled accommodation discussion, or identify what a supervisor missed in a hybrid meeting where one employee was repeatedly talked over.
A weak learning measure usually looks polished and proves very little.
| Weak measure |
Better measure |
| “Did you enjoy the training?” |
“Which response best acknowledges impact and sets a next step?” |
| Definition recall |
Scenario judgment |
| One final quiz |
Pre and post comparison |
| Generic examples |
Cases matched to your workplace context |
Level three behavior on the job
Behavior is the true test, and it is where many programs lose discipline. The excuse is familiar. It is harder to measure. True. It is still measurable if you define observable actions before the training starts.
Track a small set of manager behaviors that your organization can see:
- Response quality: Does the manager acknowledge concern without arguing intent in the first reply?
- Meeting conduct: Do quieter team members get invited in, including remote participants on video or chat?
- Conflict handling: Does the manager address disrespect early, or wait until HR gets pulled in?
- Follow-through: After a concern is raised, does the manager document actions and close the loop?
- Context judgment: In faith-based settings, does the manager separate belief from behavior and apply standards consistently?
Use more than one source. Direct report pulse surveys help. Skip-level conversations help. Structured observation during simulations helps. Self-reflection can be useful too, but it should never stand alone.
I also recommend measuring repair. Managers will still miss cues. What matters is whether they recover faster, own impact sooner, and make cleaner adjustments the next time.
Measure the quality of the manager's response after friction appears. That is a better indicator than asking whether friction disappeared.
Timing matters. Check soon after training, then check again after real work has tested the skill. A practical cadence is 30 days, 90 days, and 6 months. By then, you can usually tell whether the program changed habits or just improved vocabulary.
Level four business results
Business results should connect to the case established earlier, but this section is where internal evidence matters more than borrowed benchmarks. The goal is not to prove that training is good in theory. The goal is to show whether your managers are creating a safer, fairer, more workable team environment.
Use outcome measures that leadership already cares about, then tie them to manager behavior where possible:
- Engagement items tied to trust: Questions about fairness, voice, respect, and manager support
- Retention patterns: Especially in teams with prior conflict, high complaint volume, or lower inclusion scores
- Complaint trends: Severity, repeat themes, speed of resolution, and whether concerns are handled earlier at the manager level
- Participation patterns: Who speaks in meetings, who gets stretch work, and whether remote staff are included consistently
- Promotion and development access: Whether opportunities are distributed more evenly after managers complete training
Interpret these measures carefully. A rise in reported concerns is not always failure. Early on, it can mean employees trust the process enough to speak up. Lower complaint volume is only a good sign if resolution quality, team climate, and manager conduct improve with it.
That is why I prefer a simple review cycle over one large annual analysis:
- Immediate check: Reaction and learning
- 30-day review: Early use in live situations
- 90-day review: Direct report feedback and observed behavior
- 6-month review: Team patterns, complaint handling, retention signals, and inclusion indicators
This approach keeps the training honest. If people liked the session but supervisors still mishandle conflict, the workshop was interesting but did not transfer. If manager behavior improves and team metrics stay flat, examine reinforcement, incentives, workload, and senior leader modeling before blaming the curriculum.