Dr. Semmelweis
Semmelweis Society International
For students, physicians and patients to defend against and avoid the harm of biased peer review while pressuring
Congress to amend the laws that allow good physicians to become the victims of career assassination.
Home | Practice Selection | Practice Protection | Patient Safety | Contact Us

PRACTICE PROTECTION

Incorporate / Case Histories / Lawyers / If you are under attack click here
Live chat by Boldchat
Live chat by Boldchat
Headlines

September 25, 2004
Example of how well unbiased peer review works in other professions
full story...

September 25, 2004
Peer Review's intended use is to increase patient safety
full story...

September 25, 2004
Health Policy Institute Established at University of the Sciences in Philadelphia
full story...

September 25, 2004
S.C. medical board alters policy on publicizing sanctions against physicians
full story...

September 25, 2004
Surgeons to protest insurance rates with slowdown
full story...

September 25, 2004
Doctors Against Tort Reform Doesn't Add Up--or Does It?
full story...

September 24, 2004
Example of re the proper use of peer review
full story...

September 24, 2004
Poor Medical Treatment Kills Thousands in U.S., Says New Report on Health Care Quality
full story...
Response at Galen's log

September 22, 2004
Testing: For Doctors is never ends. More physicians are finding that board recertification has evolved into a continuous certification process.
full story...

September 21, 2004
Whistle-Blower Files Suit
full story...

September 21, 2004
Michael Porter's Prescription
For the High Cost of Health Care
full story...

September 21, 2004
Dallas: Insurer lowers rates ; Some leaders say move is sign that malpractice caps are working
full story...

September 20, 2004
Pills for the drug industry: cites the need for unbiased peer review in all aspects of health care
full story...

September 20, 2004
Poliner's patients speak up for him
full story...

September 20, 2004
Politics keeps real remedies for medical errors off radar
full story...

September 18, 2004
Monsour inspection turns up deficiences
full story...

September 17, 2004
AMA: "Disruptive Physicians"
full story...

September 17, 2004
Obstetrician wins key ruling against hospital, Monterey CA
full story...

September 16, 2004
Poor oversight, care faulted in health costs
full story...

September 16, 2004
A Reeling King/Drew Receives Huge Blow
full story...

September 16, 2004
There's a game under way in the health care industry, a national expert believes and he doesn't like it.
full story...

September 16, 2004
Hospital whistle-blowers confess,
Albany health system has sued over faxes that doctor, accountant term a 'public service'
full story...

September 16, 2004
Governator vetoes bills which would've allowed conflicts of interest in peer review hearings
full story...

September 15, 2004
Docs Will Be in Short Supply in US, Analysts Say
full story...

September 14, 2004
Yale-New Haven Sued In Class Action, Hospital Accused Of Unfair Treatment For Uninsured Patients
full story...

September 14, 2004
Survey of patient care at 200 CA hospitals released
full story...

September 14, 2004
Dr. Scanlan responds to Wichita Eagle Editorial Re: HR 663 & S 720
full story...

September 12, 2004
Florida: Physicians and Lawyers square off in the ballot box this fall
full story...

September 6 , 2004
AMA's position: California deal reaffirms medical staff autonomy
full story...

September 6 , 2004
AMA's position: Congress must finish work on patient safety
full story...

September 6 , 2004
Hospitals to divulge treatment facts
full story...

September 2 , 2004
Class-Action Status Is Upheld for Doctors Suing Insurers
full story...

August 28, 2004
Dr. Lawrence Poliner awarded $366 million in damages after being denied work at Presbyterian Hospital full story...

August 26, 2004
Seven Indian doctors plan to form new cardiology practice
full story...

August 25, 2004

E.R. to reject orthopedic cases, Lancaster, LA area
full story...

August 25, 2004
Shortage in OB dept., Chillicothe, MO
full story...

HUGE NEWS OUT OF VENTURA!!

August 18, 2004
Ventura hospital, staff reach terms Deal likely ends CMH legal fight
full story...

August 16, 2004
Report ups medical error death toll
full story...

August 13, 2004
New Article: Fighting a Sham Peer Review
full story...

August 12, 2004
Gary, Ind: State says doctor unfit to practice
full story...

August 11, 2004
NYTimes: Health Plan That Cuts Costs Raises Doctors' Ire
full story...

August 4, 2004
AMA, CMA File Brief Supporting Ventura Medical Staff
full story...

August 2, 2004
Senate passed S.720
full story...

DR. ERIC N GROSCH - A new approach to LOR

Dear Dr. Morgenstern:

I read your article[1] with interest. You wrote:

In considering a new approach to the pediatric LOR, COMSEP and the APPD welcome the input of the broader pediatric community.

Thanks for expressing an interest in receiving my comments, since I'm an internist, not a pediatrician. I'm pleased to contribute to the dialogue, which I think is important. I apologize, in advance for the length of my text but I think the topic warrants it. Quoted text is indented and my own unindented.

The LOR and its close relative, the performance-appraisal, are sacred cows in medical education, training and job-placement. They purport to provide means of communicating a candidate's traits among mentors -- the performance-appraisal among mentors within an institution; the LOR from mentors in one institution to those in another. The approach for each is much the same.

That purpose seems analogous to the medical record in patient-care, which provides a means of communicating a patient's disease-traits among the patient's physicians. The analogy is false for reasons I cite, in the chart, below:

Clinical chart LOR/performance-appraisal

appraisal and appropriate appraisal summative, end of rotation presented usually action at least daily after the fact, too late for improvement

goal is improvement of the patient goal varies between promotion of the candidate to his elimination from consideration

Relies on objective evidence for Often relies on rumor, innuendo, scuttlebutt for decision-making decision-making

Documentation is as long as is Documentation is as brief as possible to save

necessary reading time

Documentation is in terms of specific Documentation is in terms of unsubstantiated

clinical events opinions, couched in generalities, of mentors, peers, etc.

On the evidence that I've examined, I believe that there are better ways than the LOR and the performance-appraisal to accomplish the mission. I don't consider that a flippant belief. I've arrived at my opposition to the LOR/performance-appraisal through considerable thought, reading and anecdotal experience, both as an author and subject of LORs. Most of what I say here is in the public domain and obvious. I get the impression that nobody ever puts it together, so I've done that, though, perhaps incompletely. If you have any objections to what I've said here, please let me know them.

I divide my reasons into generic sections:

1. Golden Rule: Do unto others as you would have others do unto you. The Golden Rule[2], alone, should persuade anyone with any insight into the treatment that he would ideally prefer for himself, that the performance-appraisal/LOR can never work.

2. Comparisons are always odious.

3. Deleterious effect: The idea of performance-appraisal is fundamentally flawed, even dysfunctional, because of the often deleterious effect it has on those trainees that appraisers rate as less than the very best, even though quality of performance is a lottery, governed in large part by random chance. Accordingly, rating people who are of the system makes no sense.

4. Improper substitute for “where do I stand”: LORs and performance-appraisals serve the organization or institution, not the individual appraised.

5. Inaccuracy:

a. misapplication of the Likert-scale principle

b. inevitability of rating-inflation

c. popularity-contest

d. mismeasure of “excellence”

e. mentor-inattention: Mentors, who are supposed to do the evaluations and LORs, don't pay enough attention to their trainees' performance to fulfill that function adequately because their contact with trainees is minimal and sporadic, so their appraisal of the performance of their trainees is most often inaccurate and may even reverse the reality on the ground.

f. self-fulfilling prophecy

g. absence of evidence-basis

h. glittering generalities: Even if mentors paid attention to trainees' performance, the rating systems that they use address only glittering generalities, such as “general medical knowledge,” require presentation of no supporting evidence and rarely to never address the only index of work-performance in medicine, namely, clinical outcomes of patients under the trainees' care.

i. distortion from “confidentiality,” under perpetual tension

6. The ultimate goal: communion of “top” talent in “top” institutions

7. Illustrative anecdote which is more typical than it should be

1. Golden Rule[2]

The performance-appraisal and the LOR are exercises in disregarding the needs of others and in attributing to others the character and nature of objects. The psychic mechanisms that prompt those in authority to impose performance-appraisal/LOR on others -- what they would not want for themselves -- are obscure but the most likely reason seems to be that the very act of imposition may, in and of itself, provide a pleasurable and ego-boosting exercise of arbitrary authority.

Whatever the psychic mechanisms, the proof of the observation appears in the contrast between the AMA's consistent endorsement of peer-review for physicians they presume, generically, to be “bad doctors and its disparagment of Professional Standards Review Organizations (PSROs).

For example, it is a matter of record that the AMA was a strong supporter of enactment of the Health Care Quality Improvement Act of 1986 (HCQIA), which codified peer-review provisions from hospital bylaws into federal law. The AMA initially supported the HCQIA because it supports peer-review of “bad doctors” but opposed one feature of that act, the National Practitioners Data Bank (NPDB), and withdrew its support altogether from the HCQIA over that issue, but it eventually obtained a quid pro quo: NPDB as well as absolute immunity from liability for hospital-level peer-reviewers, a provision that has led to the proliferation of bad-fath peer-review.[3]

At the same time, JAMA and other journals have published articles that have impugned the accuracy of the findings of PSROs, the agents of which may conduct peer-review on any practicing physician, including one whom the AMA would presume to be a “good doctor.” Neither JAMA nor any other medical journal has published even one article that has examined the accuracy or validity of hospital-based peer-review, which the AMA enthusiastically approves because statutes render such peer-review privileged/confidential.

2. Comparisons are always odious

Farrell[4] noted the emotional effect of sex-reversed beauty-contests among men. The winner was, of course, ecstatic, enthusiastic and high in self-esteem but the runner-ups felt devastated at the relative rejection and they experienced an epiphany: why women (at least those of runner-up grade [or less] physical appearance) dislike customary beauty-contests. Comparisons are always odious because the comparison-game is a zero-sum proposition. Performance-appraisal is an appraisal of something that the evaluee can control to some extent by his conscious will, as opposed to appearance, which he can't, so it's marginally less pernicious than a beauty-contest but not much. The fact remains that the better one individual rates, the worse others do. It's inescapable.

The performance-appraisal/LOR in most fields, including medical education, always hinges on comparing one trainee, by various criteria, with others. The comparison may appear as a class-rank or as a comparison of the subject's rating with an ideal rating, e.g., 6 of a possible 10 points, 4 of a possible 5 points, etc. The message is always: “You don't measure up.”

That message is especially demoralizing to the usual medical trainee, since the very fact that he's survived to the stage of medical training means that he has already survived very stringent selection/exclusion filters and thus become accustomed to superlative accolades in his early education, through schooling and his undergraduate years. Thrown in the midst of similar high-achievers, the normative performance is likely to be uniformly high and he may rate merely “average.”

3. Deleterious effect:

Deming lists the performance-appraisal (and, by implication, also the customary LOR) among the deadly diseases of business-organizations. The same ideas apply, in spades to medical organizations and to LORs, which are retrospective, summative performance-appraisals, frozen and immutable, in perpetuity:

...the deadly diseases...

3. Evaluation of performance, merit rating...Many companies in America have systems by which everyone...receives from his superior...a rating...(101) Management by objective leads to the same evil...Management by fear would be a better name...(Deming 1986 107)

Fair rating is impossible. A common fallacy is the supposition that it is possible to rate people; to put them in rank order of performance for next year, based on performance last year.

The performance of anybody is the result of a combination of many forces -- the person himself, the people that he works with, the job, the material that he works on, his equipment, his customer, his management, his supervision, environmental conditions (noise, confusion, poor food in the company's cafeteria). (109) These forces...[which]...arise almost entirely from action of the system...will produce...large differences between people....A man not promoted is unable to understand why his performance is lower than someone else's. No wonder; his rating was the result of a lottery. Unfortunately, he takes his rating seriously...(Deming 1986, 110)

The effect is devastating:

It nourishes short-term performance, annihilates long-term planning, builds fear, demolishes teamwork, nourishes rivalry and politics.

It leaves people bitter, crushed, bruised, battered, desolate, despondent, dejected, feeling inferior, some even depressed, unfit for work for weeks after receipt of rating unable to comprehend why they were inferior. It is unfair, as it ascribes to the people in a group differences that may be caused totally by the system that they work in.

...what is wrong is that the performance appraisal or merit rating focuses on the end product, at the end of the stream, not on leadership to help people. This is a way to avoid the problems of people. A manager becomes, in effect, manager of defects.

The idea of merit rating is alluring. The sound of the words captivates the imagination: pay for what you get; get what you pay for; motivate people to do their best, for their own good.

The effect is exactly the opposite of what the words promise. Everyone propels himself forward, or tries to, for his own good, on his own life preserver. The organization is the loser.

Merit rating rewards people that do well in the system. It does not reward attempts to improve the system. Don't rock the boat.

...a merit rating is meaningless as a predictor of performance, except for someone that falls outside the limits of dif- (102) ferences attributable to the system that the people work in...

Traditional appraisal systems increase the variability of performance of people. The trouble lies in the implied preciseness of rating schemes...Somebody is rated below average, takes a look at people that are rated above average; naturally wonders why the difference exists. He tries to emulate people above average. The result is impairment of performance. (103)

...The problem lies in the difficulty to define a meaningful measure of performance. The only verifiable measure is a short-term count of some kind...(Deming 1986, 103)

Degeneration to counting. One of the main effects of evaluation of performance is nourishment of short-term thinking and short-time performance...(103) A man must have something to show. His superior is forced into numerics. It is easy to count. Counts relieve management of the necessity to contrive a measure with meaning.

...people that are measured by counting are deprived of pride of workmanship. Number of designs that an engineer turns out in a period of time would be an example of an index that provides no chance for pride of workmanship. He dare not take time to study and amend the design just completed. To do so would decrease his output. (105)

A good rating for work on new product and new service that may generate new business five or eight years hence, and provide better material living, requires enlightened management. He that engages in such work would study changes in education, changes in style of living, migration in and out of urban areas. He would attend meetings of the American Sociological society, the Business Section of the American Statistical Association, the American Marketing Association. He would write professional papers to deliver at such meetings, all of which are necessary for the planning of product and service of the future. He would not for years have anything to show for his labors. Meanwhile, in the absence of enlightened management, other people getting good ratings on short-run projects would leave him behind. (Deming 1986, 106)

Stifling teamwork. Evaluation of performance explains...why it is difficult for staff areas to work together for the good of the company. They work instead as prima donnas, to the defeat of the company. Good performance on a team helps the company but leads to less tangible results to count for the individual. The problem on a team is: who did what?

How could the people in the purchasing department, under the present system of evaluation, take an interest in improvement of quality of materials for production, service, tools, and other materials for nonproductive purposes? This would require cooperation with manufacturing. It would impede productivity in the purchasing department, which is often measured by the number of contracts negotiated per man-year, without regard to performance of materials or services purchased. If there be an accomplishment to boast about the people in manufacturing might get the credit, not the people in purchasing. Or, it could be the other way around. Thus...teamwork so highly desirable, can not thrive under the annual rating. Fear grips everyone. Be careful; don't take a risk; go along.

Heard in a seminar. One gets a good rating for fighting a fire. The result is visible: can be quantified. If you do it right the first time, you are invisible. You satisfied the requirements. That is your job. Mess it up, and correct it later, you become a hero.

Two chemists work together on a project, and write up their work as a scientific paper. The paper is accepted for a meeting in Hamburg...only one of the pair may go to Hamburg to deliver the paper -- viz., the one with the higher rating. The one with the lower rating vows never again to work close with anyone else.

Result: every man for himself.

Evaluation of performance nourishes fear. People are afraid to ask questions that might indicate any possible doubt about the boss's ideas and decisions, or about his logic. The game becomes one of politics. Keep on the good side of the boss. Anyone that presents another point of view or asks questions runs the risk of being called disloyal, not a team player, trying to push himself ahead. Be a yes man.

Top levels of salaries and bonuses are in many American companies sky-high. It is human nature for a young man to aspire...to...one of these positions. The only chance to reach a high level is by consistent, unfailing promotion, year after year. The aspiring man's quest is not how to serve the company with whatever knowledge he has, but how to get a good rating. Miss one raise, you won't make it: Someone else will. (108)

A man dare not take a risk. Don't change a procedure. Change might not work well. What would happen to him that changed it? He must guard his own security. It is safer to stay in line.

The manager, under the review system, like the people that he manages, works as an individual for his own advancement, not for the company. He must make a good showing for himself.

Another Irving Langmuir? Can American history, under handicap of the annual rating, produce another Irving Langmuir, a Nobel Prize winner, or another W. D. Coolidge? Both these men were with the General Electric Company. Could the Siemens company produce another Ernst Werner von Siemens?

...It is worthy of note that the 80 American Nobel prize winners all had tenure, security. They were answerable only to themselves. (Deming 1986, 109)

“It can't be all bad.”...top management delay[s abolition]...of the annual rating of performance...by refuge in the...corollary that “It can't be all bad. It put me into this position.”...He reached this position by coming out on top in every annual rating, at the ruination of the lives of a score of other men. There is a better way.

Modern Principles of Leadership...will replace the annual performance review. The first step...will be to provide education in leadership. The annual perfor- (116) mance review may then be abolished. Leadership will take its place...

The annual performance review sneaked in and became popular because it does not require anyone to face the problems of people. It is easier to rate them; focus on the outcome...Western industry needs...methods that will improve the outcome. Suggestions follow.

1. Institute education in leadership; obligations, principles, and methods.

2. More careful selection of the people in the first place.[5]

It seems difficult to imagine selecting medical trainees by methods any more careful than current ones.

3. Better training and education after selection.

4. A leader, instead of being a judge, will be a colleague, counseling and leading his people on a day-to-day basis, learning from them and with them. Everybody must be on a team to work for improvement of quality in the four steps of the Shewhart cycle:...

In the absence of numerical data, a leader must make subjective judgment. A leader will spend hours with every one of his people.. They will know what kind of help they need. There will sometimes be incontrovertible evidence of excellent performance, such as patients, publication of papers, invitations to give lectures.

People that are on the poor side of the sytem will require individual help...(Deming 1986, 117)

a. What could be the most important accomplishments of this team? What changes might be desirable? What data are available? Are new observations needed? If yes, plan a change or test. Decide how to use the observations.

b. Carry out the change or test decided upon, preferably on a small scale.

c. Observe the effects of the change or test.

d. Study the results. What did we learn? What can we predict?...(Deming 1986, 88)

5. A leader will discover who if any of his people is (a) outside the system on the good side, (b) outside on the poor side, (c) belonging to the system. The calculations required...are...simple if numbers are used for measures of performance. Ranking of people...that belong to the system violates scientific logic and is ruinous as a policy,...

In the absence of numerical data, a leader must make subjective judgment. A leader will spend hours with every one of his people. They will know what kind of help they need...

People...on the poor side of the system will require individual help....(Deming 1986 117)...

7. Hold a long interview...three or four hours, at least...not for criticism, but for help and...everybody[‘s]...better understanding...

8. Figures on performance should be used not to rank the people...that fall within the system, but to assist the leader to accomplish improvement of the system...(118)

...Running a company on visible figures alone (counting the money). One can not be successful on visible figures alone...he that would run his company on visible figures alone will in time have neither company nor figures.

...the most important figures...are unknown and unknowable..., but successful management must nevertheless take account of them. Examples.

Fallacies of reward for winning in a lottery. A man in the personnel department of a large company came forth with an idea, held as brilliant...to reward the top (274) man of the month on a certain production line (the man that made the lowest proportion defective over the month) with a citation. There would be a small party on the job in his honor, and he would get half a day off. This might be a great idea if he were indeed an unusual performer for the month. There were 50 men on the production line.

Do the results of inspection of their work form a statistical system...? If the work of the group forms a statistical system, then the prize would be merely a lottery...if the top man is a special cause on the side of low proportion defective, then he is indeed outstanding. He would deserve recognition, and he could be a focal point for teaching men how to do the job.

There is no harm in a lottery...provided it is called a lottery. To call it an award of merit when the selection is merely a lottery...is to demoralize the whole force, prize winners included. Everybody will suppose that there are good reasons for the selection and will be trying to explain and reduce differences between men. This would be a futile exercise when the only differences are random deviations, as is the case when the performance of the 50 men form[s] a statistical system. (Deming 1986 275) [5]

In a similar vein, Ierodiakonou and Vandenbroucke term medicine a stochastic art:

Ancient Greek philosophers thought that medicine was an art with peculiar characteristics, and they called medicine a stochastic art. A doctor might treat a patient conscientiously according to all learned precepts; yet the patients' condition might deteriorate. Another patient might be treated rather carelessly by another doctor; yet the patient might regain full health. Thus, in medicine there exists unpredictability between means and ends. By contrast with other arts a diligent execution of the tasks does not guarantee a good outcome, and vice versa...

...we have long witnessed a debate on the right way to measure the quality of medical care: should we use outcome or process criteria?...For instance, a few years ago, (542) a series of outcome investigations was started in the USA. Presumably, some administrators had been convinced that even in health care, quality of performance should be measured according to strict outcome criteria, as is practised in the Japanese car industry, for example. The simplest outcome measure was mortality in hospital. Third party payers such as the Health Care Financing Administration, which administrates Medicare, started to rank hospitals according to mortality rates for specific procedures. The mere idea sent ripples of alarm through the American Medical Association (AMA). Do not we intuitively know that medical centres with the highest reputation attract patients whose illnesses are close to being beyond rescue? Advanced epidemiological techniques have proved that differences in hospital mortality can be explained away by adjustment for differences in patient mix. To use outcome as a means to monitor quality would necessitate continuous evaluation of all individual patient characteristics. This process would be a gigantic research effort, close to the examination of treatments in randomised controlled trials, and would defy all realistic efforts at quality assurance. The whole armamentarium of epidemiology and statistics, such as randomisation, matching, blinding, placebo-procedures, strict selection criteria, and modelling, aims at mastering the stochastic elements that confound our judgment...[6]

Why should the performance of trainees be deterministic, not stochastic, inasmuch as the stochastic vagaries of patient-characteristics (case-mix) must influence it, in any individual instance? The difference is that hospital-administrators have an influential voice in such decision-making. The individual trainee does not.

Under pervasive fear of rating, the trainee dares not ask a question that he thinks that the rater might think foolish. He has to confine his questions only to what he considers “intelligent,” posed with the purpose in mind of impressing the rater with his insight and wisdom, beyond his years. The good regard of the rater is the trainee's life-preserver, on a stormy sea of insecurity. The trainee walks on eggshells, fearing that his every move, his every word is an element in a cumulative body of chit-marks that may eventually torpedo his reputation. If he should get a black mark against him, for any reason, and the rater learns of it, the trainee shall thereby have lost the rater's support and enter career free-fall. Accordingly, he pre-edits all questions for acceptability before letting them out of his mouth. If he can't think of a zinger of a question, he'll most likely stay mum and live in ignorance about a broad variety of subjects, out of fear of asking a “stupid question.”

MIT Cal Tech and other high-prestige institutions have experimented with a pass-fail grading system because they understood the inherent absurdity of grading-systems, with their arbitrary cut-off points for each letter-designation. That represented a rejection of the very notion of grading. I'm not certain of the status at those institutions at the moment. Maybe their graduates have had difficulty translating their academic performance into terms that other institutions, that recognize grading, understand, so maybe they've gone back to grading.

4. Improper substitute for “where do I stand”

Coens and Jenkins deplore performance-appraisals and, by implication, also LORs:

At a recent quality conference, a CEO was questioned as to why his organization continued to use appraisals after shifting to a quality management culture of system and process improvement...“We think we owe it to people to let them know where they stand.”...(27)

What people really want is access to the knowledge and information that influences the organization's pay, promotion, and status systems and how these affect or apply to them...People are insatiably curious about Where do I stand? because, in most organizations, this query is decided with a maze of unspoken rules, inscrutable political influences and other dynamics of organizational life. Appraisal is not the system that drives pay, careers, and status; it is an incidental effect of those dynamic systems. Appraisal is...the paper-shuffling that sanctifies decisions already made.(28)[7]

The cognate of pay and promotion, in the corporate setting, is gaining acceptance into a “top” (whatever that might mean) training program, in the medical-educational setting.

Too often, a trainee finds out, to his surprise or shock, where he stands only when he reads his retrospective performance-appraisal/LOR and, by then, it's too late to do anything about it. LORs and performance-appraisals have an especially pernicious affect on medical students, at that vulnerable stage in their development, but they're bad for any trainee and for any person.

5. Inaccuracy:

a. misapplication of the Likert-scale

“Likert” seems an unlikely choice for naming the method, since Likert used the scale in canvassing members of population-samples to obtain aggregate ratings of their attitudes in his 1932-article[8], the presumed basis of the eponym. Likert, himself, had the good sense not to apply such rating scales to important matters that could affect people's livelihoods, though his predecessors already had and his successors still do. Likert[9] credited prior authors, Fechner and Galton, without citing a reference, for the origination of such questionnaires, circa 1888. Scott introduced the system to the United States Army in the early part of the last (20th) century[10]. Paterson, an employee of the Scott-Company, described a later adaptation of Scott's method[11,12] for “objective” evaluation of job-performance, the purpose of interest here.

With no apparent insight into the inherent vagueness of the method, Paterson claimed to distinguish objective from subjective qualities without doing so:

objective qualities . . . “efficiency,” “originality,” “perseverance,” and “quickness” . . . subjective qualities . . . “courage,” “cheerfulness” and “kindliness.” . . .[12]

The criteria he cited for rating workers were similarly vague:

Ability to Learn, Quantity of Work, Quality of Work, Industry, Initiative, Co-operativeness, Knowledge of Work[13]

Strangely, Paterson disregarded the opportunity for evidence-based assessment of the criterion most amenable to objective evaluation, namely quantity of work, in terms, say, of number of units the worker produces per unit-time. Instead, the instructions bade the rater give a worker on that criterion a rating-score, presumably to foster “uniformity”[14] with ratings of other criteria, not amenable to objective assessment.

Other authors described errors and pitfalls inherent in the method. Thorndike first described the halo effect as a

. . . constant error toward suffusing ratings of special features with a halo belonging to the individual as a whole[15]

He found even the most capable rater

unable to treat an individual as a compound of separate qualities and to assign a magnitude of each . . . in independence of the others.[16]

As a countermeasure to minimize the halo-effect, he exhorted:

. . . the observer should report the evidence, not a rating, and the rating should be given on the evidence to each quality separately without knowledge of the evidence concerning any other quality in the same individual.[16]

Thorndike did not explain how a rater could avoid having knowledge of ratings he had given on other criteria listed on the same form. The “evidence” Thorndike had in mind consisted in vague descriptive adjectives, similar to those that Paterson cited,[17] that the rater had a general impression might apply to the ratee.

Kingsbury addressed accuracy:

. . . ratings as ordinarily made are . . . unreliable, and . . . only under what may be called ideally favorable conditions will they approximate accuracy, even on a scale so gross as one of five divisions. (18)

Kingsbury enumerated those allegedly ideal conditions:

Ratings, to be reliable, necessitate (1) averaging three independent ratings, each made on an objective scale; (2) these scales must be comparable and equivalent, made in conference under expert supervision; (3) the three raters must be competent to rate.[18]

Paterson joined, with slightly different wording, in affirming Kingsbury's ideal conditions (2) and (3) and in thus implicitly alluding to pitfalls of the method:

. . . Ratings should be accepted and filed for use only from those who have proved themselves capable of accurately judging human qualities. . . a rating scheme will not work automatically. It must be closely supervised preferably by trained personnel research workers who must continually subject the ratings to critical analysis and assist in training executives in proper use of the method. There is no escape from this requirement.[19]

Paterson and Kingsbury omitted mention of what specifics the training they proposed for the personnel-research workers should comprise and accomplish but they presumably intended, among other things, that the trained supervisors should somehow ensure separate evaluation of labeled traits to exclude Thorndike's halo-effect; then, by averaging, fine-tuning, adjustment and manipulation of the scores from at least three raters, all of whom knew how to provide accurate ratings (presumably assessed by the raters' mutual agreement on each candidate's score on each criterion) obtain a set of ratings consistent with the aggregate global impression each candidate made on the raters (the candidate's halo). The circularity of the rationale seems inescapable.

Prior to receiving requests to fill out forms consisting of Likert-scale ratings on others' performance, I have never received any of the extensive training or testing to prove myself “capable of accurately judging human qualities,” nor, I daresay, has any appraiser of my performance received such training and testing, to my knowledge. The originators of such forms seemed to assume that the rating scemes would “work automatically,” contrary to Kingsbury's admonition.

Rugg may have had more insight:

. . . The unordered -- yes, the chaotic -- character of the judgments appears, irrespective of what traits are considered or of what kinds of scales are compared. I now believe that the evidence establishes the futility of obtaining single “ratings” on point scales of such dynamic qualities as “intelligence,” “personal qualities,” “general work,” and the like.[20]

Paterson cautioned and predicted:

These rating methods should not be looked upon as perfect or final. Further research is necessary, and industry will profit . . . as progressive, experimentally minded executives realize the scope of the problem and engage in the necessary research . . . to develop newer and more reliable methods than we now possess.[21]

The progress Patterson envisioned has been slow in developing, as the medical-education evaluation-literature amply shows [22- 28]. The Likert-scale remains alive, well and unimproved since Paterson, Kingsbury and Thorndike fretted over it and tortured it and since Rugg dismissed it as inherently invalid over eighty years ago.

Rating-criteria in medical education continue to be as vague as Paterson's, e.g., “general medical knowledge (1-5),” “procedural skill (1-5),” “rapport with patients (1-5),” “rapport with nurses (1-5),” “overall general impression (1-5)” (the most global “halo”-criterion of all) and the like.

Many,[22-28] though not all[29,30] current users and discussants of the Likert-scale treat it as an axiomatically good and self-explanatory scheme.

In current medical-education usage, mentors rate trainees and trainees rate mentors without any expert supervision - in disregard of Kingsbury's ideal conditions[18], Paterson's precautions[19] and Rugg's skepticism[20] -- perhaps in imitation of Likert[8], who may have felt justified in ignoring the precautions, conditions and invalidity of the method for objective evaluation because he pursued only subjective attitudes rather than purportedly objective traits. Yet, those who publish studies based on Likert-scale “data” apply the numerical scores derived as if they were facts and manipulate them with parametric statistics as if they were not ordinal[ 22-28] . Guilford[30] appears to equate Likert-scales with formal psychometric tests by including them in his book, entitled, “Psychometric Methods.” Worse, many seem to follow Thorndike[9] in attributing traits to ratings, a tendency Tryon deplores, even for psychological tests, which are more formal than Likert-ratings, yet purveyors of Likert -ratings attribute traits to them:

The test-trait fallacy [consists in presuming] that test scores provide measures of enduring and generalized characteristics of the person, called traits. . .

The test-trait fallacy begins with the assumption that test scores are trait measures. The second assumption is that trait measures are basic properties of the person. It easily follows that test scores reflect basic properties of the person. . . hence a measurement is reified into a causal force. . . the unsound logic of drawing inferences about ability on the basis of observed performance is integral to the test-trait fallacy. . .[32]

Traits are alluring because they are . . . compatible with the stimulus-organism-response paradigm to which virtually all psychologists subscribe. . . To presume that psychological tests . . . measure organismic traits and to further presume that such traits are the basic properties that cause behavior is to place the psychologist in an attractively powerful theoretical and clinical position. The volume of psychological tests . . . is evidence of their allure for clinicians and researchers alike.[33]

Authors even apply statistical methods to aggregate number-scores from a group of raters, compute inter-observer correlations and the like. Literature-approval of Likert-scale “data” encourages decision-makers to attach unwarranted worth to Likert-scale merit-ratings and serenely to apply them in life-altering decisions touching subordinate trainees,[22] such as recommendation for certifying examinations, and employment, and even in promoting faculty-members[25].

Albanes[34] suggests that “real life” ratings, presumably of qualified physicians, are objective and based on outcomes, yet Carey[35] asserts that evaluations of physician-faculty must be subjective. Codman[36] and his spiritual successors[37-41] have called for outcome-based rating of performance and, by extension, of competence, but physicians and hospitals have pointed the deficits of that method and prevented its spread, to date, by citing the multiplicity of factors, unrelated to institutional or physician-competence, that determine outcome.[42]

The champions of rating attribute two roles to it, evaluative or summative (entailing punitive and deterrent purposes) and formative.[42] Paterson touted the formative purpose:

I. Rating methods have been developed because of a recognition of the educational value of ratings . . .

a. . . . on those who make the ratings. . . insures the analysis of subordinates in terms of the traits essential for success in the work.

b. . . . on the employee. . . encourages self-analysis and provides an incentive for self-improvement in . . . traits in which he is weakest.[35]

As educational feedback, rating fails to fulfill Ziegenfuss' proposed criteria for adequacy and efficacy:

. . . the art of feeding back quality-related data is a critical point of quality improvement work. . .

Feedback is effective when the following conditions are met:

1. Clarity of Purpose. Data can be used for development or for rendering judgment (formative versus summative . . .). . . for . . . organizational development, . . . the purpose is . . . formative . . . Learning and change to improve processes is the goal. A judgmental purpose (summative) offers a . . . grade of pass or fail and is designed for accountability. . .[45]

Since “accountability” entails punishment[46], it does not belong in any workplace.[4] In education, by definition, the only appropriate purpose of feedback is the formative one. The Likert-rating, in its customary application, succeeds in the summative, punitive goal of criterion 1 but fails in its formative goal.

2. Clear and Specific Data. Data . . . must be . . . relevant to the . . . recipient.[45]

The vague expression, “general medical knowledge, 3” (or any other number) is unclear and non-specific, so rating fails criterion 2 and is not relevant to the recipient (see criterion 5).

3. Descriptive, Not Evaluative. Useful feedback describes what is happening but does not offer an evaluative judgment (unless that is the intended purpose). The presenters must not rush to judgment without some interactive discussion with the audience.[45]

The Likert-scale rating substitutes for relevant evidence and thus fails criterion 3.

4. Timely. How close to the action . . . reviewed are the data describing the events? The golden rule is quick feedback . . . Old data is useful for historical and longitudinal purposes but is not supportive of behavior change in the near term.[45]

The team, employed long-term, may tolerate a monthly, quarterly or semi-annual feedback-cycle. The medical student or other trainee, who often has monthly rotations in clinical departments, needs a shorter feedback-cycle. Feedback should be continuous, its formal aspect should be at least weekly, and, preferably, daily. The commonest Likert-rating comes to the ratee's attention as a summative, end-of-rotation event, delivered too late for him to implement improvement, so it fails criterion 4.

5. Limited. How great is the scope of the data? . . . tailor and focus the data to fit the specific, targeted needs of users. . .[45]

A Likert-rating, e.g., “general medical knowledge, 3,” is too vague and global to serve a ratee's needs. It invites the so-called halo-effect and fails criterion 5.

6. Comparative. . . To leave out comparative information is to deprive the recipients of knowledge about their progress or lack thereof. . .[45]

Albanes[34] deplored the rater's “failure to discriminate” among trainees in awarding them equal marks. He thereby pursued a similar goal of making distinctions for distinctions' sake alone and disregarded the “lottery”-nature of rating people who operate “within the system,”[5]

Kingsbury likewise suggested:

. . . we do have to make distinctions between people . . .

. . . and the rater should realize that it is not so disastrous to make some employees 2 who are not much worse than some he marks 3, as it is to mark them all alike to avoid seeming to magnify the difference. . .[47]

As Deming eloquently explains,[5] it's disastrous for an individual to suffer a low rating. A low rating may be especially crushing to a medical student, accustomed such a tender soul often is, from the experience of a lifetime, to high academic ratings.

If two or more employees or trainees perform equally well and very well, say, 5 of 5, they would deserve equal marks because equality of their performance reflects truth. The company, to which marking two or more employees alike, e.g. 5 of 5, may seem disastrous, can stand the gaff more easily than an individual arbitrarily marked down, despite his best effort, merely to “make distinctions between people.” Neither Albanes[34] nor Kingsbury[47] justified the need to make such distinctions. He presumably considered the principle axiomatic and self-evident.

Since the end-of-rotation Likert-scale rating provides no progressive comparisons and since mentors might balk at the administrative burden of completing Likert-scale ratings more often than once monthly, it provides no sequential comparison and fails criterion 6.

7. Participative Interpretation. . . . final . . . analysis can[not] be conducted without audience involvement. Joint interpretation is consistent with the developmental/formative purpose, as together we discuss meaning and follow-up action . . .[45]

In the medical-education context, the Likert-scale rater rarely discusses his rating with his ratee(s) prior to entering it. It comes most often to the recipient's attention as a fait accompli, too late for him to improve it. The Likert-scale rating fails criterion 7.

8. Safety and Security. Receiving performance feedback is . . . technical and . . . psychological . . . We need first to have the data correct (technical). . . Presenters must be sensitive to the psychology of the process and offer language and behavior that protect the recipients.[45]

The Likert-scale rating inherently fails the technical criterion, since it consists of a set of numerical scores which obscures the evidence that purports to form its basis. Various errors, to wit, the halo-effect (supra) and tendency toward the mean[48,49] inhere in the Likert-scale.

As applied, it most often fails the psychological criterion since the social-control function, which Albanes[34] advocated is crucial to its deterrent/punitive function. To pull a punch at the moment of delivery would diminish or annihilate the crushing impact the rater can otherwise accomplish.

9. Practical and Action Oriented. To be useful, the data should suggest some followup action and should be practical enough to be used by professionals in the field. . . [32,50]

Having received a rating of, e.g., “general medical knowledge, 3,” the recipient can discern no idea from the rating how to improve. The Likert-rating fails criterion 9.

The evidence seems clear that ratings fail all of Ziegenfuss's rational criteria for effective feedback.

b. inevitability of rating-inflation

A universal human conceit holds that everybody's a fool and a moral pervert except for thee and me and I'm not so sure about thee. The individual expects others to rate him in a manner consonant with the intrinsic, superlative characteristics that he attributes to himself. When mentors, in a medical-education setting, rate him harshly, he feels helpless and often non-plussed and feels an urge to press his raters to improve his rating.

Some years ago, Sissela Bok, philospher and wife of Derek Bok, former President of Harvard University, addressed merit-ratings on “fitness-reports” in the US Army. Her context was “lying” and her example of a liar was the supervisor who rated his subordinates too highly on traits, such as “leadership,” “appearance,” etc., which are at least as nebulous as entities that raters in medicine attempt to address, e.g., “general medical knowledge,” “rapport with staff,” etc. They're all manifestations of the great tendency to generalize from skillful execution of a narrow scope of activities, such as getting high scores on tests, to global “excellence,” “outstandingness” or “bestness,” in general.

Bok's description shows that your observation that “excellent” is a third-tier rating has a history:

...Those who rate officers are asked to give them scores of “outstanding,” “superior,' “excellent,” “effective,” “marginal,” and “inadequate.” Raters know...that those who are ranked anything less than “outstanding” (say “superior” or “excellent”) are then at a great disadvantage, and become likely candidates for discharge...superficial verbal harmlessness combines with the harsh realities of the competition for advancement and job retention to produce an inflated set of standards to which most feel bound to conform. (Bok 73)

...The US Army tried to scale down evaluations by publishing the evaluation report...cited. It suggested mean scores for the different ranks, but few felt free to follow these means in individual cases, for fear of hurting the persons being rated. As a result, the suggested mean scores once again lost all value. (Bok 74) [51]

Professor Bok, writing as a member of the establishment. LORs and performance-evaluations cause little to no worry to her and her husband, who have made it to the top of the academic heap, from which pinnacle, they may comment on us, herebelow:

In elite . . . organizations, the evaluation model tends to be elitism. Two lines of argument are involved. First, since the organizations have selected the best people, evaluation of performance is irrelevant. After all, if the best people could not succeed, who could do better? Second, since the quality of the organizations and their output is determined primarily by the equality of their people, attention to system, methods, or management is inconsequential. It follows that, if the organizations already have the best people, “the opportunities for increased productivity in them are small and come slowly.” Finally, . . . elitism tends to create self-perpetuating closed circles whose members are exempt from review except by peers within.

Converting work problems into people problems is a process of denying organizational accountability. It is a process of establishing a hierarchy of special privilege and immunity to rank with the hierarchy of authority. It is a process of maintaining the status quo; it denies both the need for change and the possibility. (27)[52]

Accordingly, Professor Bok focused on the “lies” perpetrated to help the plebeian but omitted any mention of organization dishonesty: the rumor-grapevines, chiefly by telephone, which leave no paper-trail, and which circumvent and subvert the normal channels of committed, transparent, written communication, to which subjects of ratings on LORs might obtain access.[53] Personnel-managers use such underhanded means to evade legal liability for defamation of character to find out from former employers “what applicants are really like.”

You wrote in your article[1] in nearly identical terms of the inevitable tendency toward rating inflation, your “hierarchy of superlatives,” the Lake Wobegon effect, in which everybody is “above average,” and the tendency toward rating-fragmentation to permit raters to distinguish the "is one of the finest medical students of the year," "...one of the best medical students I have ever worked with," "richly deserves the honors awarded in the rotation," or "receives my highest recommendation",[54] from among the best and those who are the very best in the past year, the best ever, etc., etc. Speer et al cited grade-inflation in internal medicine as well:

. . . a significant number of clerkship directors (43%) felt that we are unable to appropriately identify students with failing performances. The implication for our ability to certify students as clinically competent is concerning. . . (116)[55]

That's evidently not their concern. They express more concern with labeling trainees clinically in competent.

. . . faculty were the key to both the cause and solution. (116)[55]

That is a truer statement than Speer et al perhaps realized, though faculty would probably prefer to blame the trainee-victims.

Yet, clinical medicine simply doesn't contain tasks of sufficient sophistication that trainees could perform that would enable a trainee could distinguish himself from his fellows to the extent depicted in all the finely nuanced and ever mounting expressions of enthusiasm. The difficulty would be quite similar to the difficulty of rating a patient in similar terms, according to his response to treatment. Objectively, he either gets better, stays the same or gets worse. It's difficult to imagine that an evaluator of patients could find rational criteria for appraising a patient's recovery as “excellent,” “outstanding,” one of the best on the ward,” “one of the best in the past year,” “the best ever,” etc. If a rater can't do it for a patient, how can he do it for a trainee?

Gould attributed the fallacy of confusing objects with labels to John Stuart Mill:

The tendency has always been strong to believe that whatever received a name must be an entity or being, having an independent existence of its own. And if no real entity answering to the name could be found, men did not for that reason suppose that none existed, but imagined that it was something peculiarly abstruse and mysterious.[56]

Gould cited the fallacy in noting that Benet, originator of IQ, intended none of the social elitism to which it has given rise.[56] Such reification of jargon is a prominent feature also of rating practice.

c. popularity-contest

What feats of clinical derring-do can a trainee, at any level, perform that would make him so much better than any of his contemporaries that he would qualify for such sterling and distinctive accolades as "is one of the finest medical students of the year," "is one of the best medical students I have ever worked with," "richly deserves the honors awarded in the rotation," or "receives my highest recommendation",[54] in contradistinction to his fellows, whose performance might rate a mere “excellent”?

Did pediatric resident A miraculously heal a girl with Friedrich's Ataxia so she never progressed and even achieved a normal gait? If so, how did he do it? By Divine Intervention? By Black Magic? By weird science? Miracle-healing, if accomplished, would obviously exceed customary expectations and be well the upper control-limit of performance that Deming defines as “within the system.” Miracle-healing may thus warrant the highest accolades but even outstanding residents rarely to never perform it.

The only realistic answer that comes to my mind is that the highly regarded trainee manufactures his high regard by ingratiating himself, through force of intrinsic personality or insidious, political means, into the rater's favor. The rater then comes to like the trainee personally so much that he's willing to go out on a limb for him with various superlative terms of enthusiasm, presumably assuming his performance be at least adequate. In other words, rating of trainees for personnel-records and the LOR are popularity-contests. Who has the bubbliest personality? Who is the most “well liked?”[57]

Such a system could select for those who go along to get along and who may rate pleasing their administrative superiors to enhance the chance of their own advancement as more important than performing what's right for a patient, perhaps contrary to the will of his superiors. Such disregard of objectively correct performance may lead to deterioration of quality of patient-care, ostensibly the opposite of rational goals for a health-care system.

d. mismeasure of “excellence”

Mere prattle without practice doesn't necessarily tranfer well to good real-world outcomes. Howard Zinn spoke of “the best and the brightest”:

The New York Times did a survey of high-school students to see how much history they knew. They do this every few years. They do a survey of young people to prove how dumb they are and to prove how smart are the givers of the tests and so they gave this test to high-school seniors and corroborated what they thought. Young people don't know anything about history. They asked questions like, “Who was the President during the War of 1812?” “Who was the President during the Mexican War?”...We're in a great quiz-culture...“What came first the Homestead Act or the Civil Service Act?” You recognize questions like that because those are the questions that appear on tests which enable you to get into graduate-school. You can go very far if you know enough of those answers. You'll be Phi Beta Kappa. You'll become an advisor to the President of the United States. You remember the book, The Best and the Brightest, which was precisely about that point, that the people surrounding the President were...the people who got the highest scores. They were Phi Beta Kappa and they were the architects of the War in Vietnam.[58]

Holman cited an analogous problem related to inflated self-esteem, the ‘excellence' deception in medicine[59].

Simpson addressed the examination-system but his remarks apply at least as well to any rating system:

...the traditional examination system...achieves...pseudo-precision, for it has chosen the accurate measurement of the barely relevant in preference to the less precise measurement of the most highly relevant...our cultural bias towards believing that anything expressed in numbers must be significantly more true than the same thing expressed in words...allows the student to accumulate a sequence of numerical ascriptions and grades, often of very dubious reliability and validity...added together and averaged to help us guess at whether he is fit to leave medical school. This is as logical as making a pre-operative surgical assessment by adding and averaging your patient's haemoblobin, potassium, urea and blood sugar levels. It produces results...of little or no predictive validity and...neither tell the student who has passed the exam why he has done well (so that we can be reasonably sure he can do it again) nor tell the student who has failed anything of much use to him in avoiding further failure...[60]

e. mentor-inattention

The descriptions of how recipients of LORs perpetrate Mill's reification-fallacy in an attempt to attach specific meanings to various phrases that the phrases themselves don't necessarily denote[61], seems especially anomalous in a context in which the author may be a department-chairman who may even concede that he has never had any contact with the trainees, about whom he has a duty to write LORs, not have even what Albanes called

The episodic, fragmented, and...small amount of contact that clinical faculty have with students...(Albanes 653)[34]

Albanes claimed that that circumstance

...leaves them [raters] reluctant to make ratings that would call attention to students' performance deficits...(Albanes 653)[34]

In those circumstances, faculty-members' reluctance to make ratings of any sort , at all, would bespeak their simple honesty. Yet, somehow, most faculty-members, whether in good conscience or not, rate their students and other trainees after clinical rotations and later when they write LORs for them.

Kefalides affirms faculty-expectations and complains that insurance-rules newly require faculty-members to take care of patients and thus provide golden opportunities for clinical teaching, which he seems to disparage.[62]

Cydulka et al present time-cosuming, close observation of trainees as a startling new departure.[63]

In the industrial setting, in which TQM arose, nobody could ever confuse the manufactured product with the worker whose efforts produce it. In another article,[64] Albanes did just that. He attempted to apply TQM to medical education but, in the process, he conflated students as human beings with students as objects, products of the education-process and got his ideas twisted. As a result, in one section of his article, grading is good, while in another, it's bad. The very fact that Academic Medicine published his article indicates the likelihood that the thinking, among many academics, about rating and evaluating the performance of trainees and others is confused.

f. self-fulfilling prophecy

Bosk noted:

One striking feature of the clinical judgment of residents is how easily the whole process may turn into self-fulfilling prophesy ( sic )....good reputations exercise a protective or deviance-reducing effect while bad ones generate a destructive or deviance-amplifying one. If a resident is considered trustworthy, monitoring by attendings is decreased. Therefore, deficiencies are less likely to be discovered. Conversely, if a resident is suspect, monitoring increases. Convinced that they are there for the finding, an attending is more likely to find evidence of sloppy work. When found, these only increase surveillance, which again increases the probability of mistakes. Clearly suspicion does not create residents who are unfit-- after all, something creates the suspicion. Nonetheless, being suspect is for a resident a very vulnerable and demoralizing position. Not only that, being above suspicion gives a fair amount of protection, especially when mistakes need not be seen as innocent error. Given these dynamics, it is not surprising that those who fall on the short end of evaluation (or their attorneys) often characterize it as arbitrary and capricious.[65

Strangely, when a physician so abuses ancillary personnel that they lose their self-confidence in an analogous manner, he becomes a “disruptive physician,”[66] fit only for expulsion. Yet, in the setting of medical education, such abuse is tolerable, even customary.

g. Absence of evidence-basis

Rating/evaluation is particularly vulnerable to charges of resting on an inadequate evidence-basis:

...In perusing the folders of the residents in the training program that I studied, I found only one evaluation that mentioned a specific incident. This leads me to suspect that residents who are dismissed from programs could easily argue that their “due-process rights” were violated, which raises a very thorny issue. Surgery to a large degree rests on peer trust, and it is unclear what degree of formal, concrete evaluation is consistent with that trust. (12)[67]

Two pediatric core-curricula have come out for emergency-pediatrics,[68,69] one for pediatric interventional cardiology,[70] one core-content inventory for adult emergency-medicine[71] and a retrospective inventory of diagnoses encountered in internal-medicine residency.[72]

Other core-curricula may exist in other specialties, yet, in no specialty, do recommendations for the LOR relate in any manner to specific elements of any defined core-curriculum. If the LOR is supposed to reflect job-performance, what justification is there for omitting any mention of job-performance criteria, delineated in national core-curricula, core-content statements or otherwise?

In all the literature on LORs and evaluations, none that I've seen suggest including the cumulative statistics on clinical outcomes of patients under the care of the subject of the LOR. Yet, without such evidence of actual job-performance, in terms of numbers and proportions of patients saved, lost and improved, the rest is nothing.

The medical literature is replete with accounts of physicians' inaccurate performance-appraisals of their colleagues and of trainees.[73-86] Those accounts render the idea of entrusting performance-appraisal of anyone to physicians patently absurd.

Perhaps the most concrete, objectively verifiable category is “procedural skills.” The trainee either succeeds at the lumbar puncture by obtaining CSF or not, succeeds in intubating a patient or not. No performance-evaluation I've ever seen has any space devoted to citing the specific number of procedures that the mentor observed the subject performing, far less a score-card that documents how many he performed successfully and in how many he failed. What would be the distinction in a rating of 3/10 vs. a rating of 7/10 in the category, “procedural skills?” One might imagine that the evaluee succeeded in 30% or 70%, respectively, of the procedures he performed during a clinical rotation. Did a month-long rotation provide even ten opportunities for each of, say three trainees, to perform lumbar punctures or intubations? It seems unlikely.

If the trainee's score was low, where is the documentation of the help that the mentor provided to the trainee to improve his performance? I've never seen it and the proposed standard Letter of Recommendation (SLOR), in emergency-medicine, omits mention of anything like it.[54,63]

Where is the documentation of the progress in the trainee's score during the month? Was his score 2 of 10 at the beginning of his rotation and 8 of 10 at the end? I've never seen anything like that, either, possibly because a trainee may have one opportunity to perform one clinical procedure in a month, if he's lucky. The customary evaluation is post hoc, delivered as a summative accolade or condemnation, long after the trainee can do anything about his scores.

What was the quality of his performance? What did he do to succeed in the procedure, if he succeeded? Did he fracture teeth of patients he intubated? If so, how many each? How many of the patients on whom he performed a lumbar puncture required a blood-patch afterwards to stem post-procedure CSF-leakage? I've never seen any such evaluation in writing.

How many of the procedures that the evaluee performed did the mentor personally observe? SLOR has no space for any such entry[54,63]. Is the rating based on a “general impression” of the evaluee's procedural skill, as an intrinsic trait, derived from rumor? If so, upon what specific evidence or criteria did the evaluator base the score that he assigned?

Is one of the mentor's considerations his own anxiety over giving the evaluee a big head? Did the evaluee, rather, need a low score to give him a harsh dose of “reality?” If so, on what evidentiary criteria did the evaluator base his concept of “reality,” such that a low rating would give the evaluee a dose thereof and, in some sense (what sense?) improve the evaluee's outlook? Did the evaluator apply his dose of reality to all evaluees consistently? If not, why not? Did he condemn those whom he personally disliked (perhaps because they asked him tough questions to which the mentor felt embarrassed at not knowing the answers) and favor those whom he personally liked (perhaps because they never asked him any tough questions)? If he applied his dose of reality consistently, without regard to the evaluee's actual performance (which the evaluator may never have observed -- my consistent experience, throughout “training”), isn't that practice arbitrary, unreasonable and capricious, i.e., a manifestation of chaos and irrationality, in a setting where rational thought is supposed to prevail?

Most important, what does the rating score tell the relevant candidate about what he should do to improve his performance?

One might argue that success in procedures, like medicine, itself, is a stochastic matter[5,6], i.e., that some procedures fail even in the best of hands and some succeed even in the worst of hands, in whatever sense of “best” and “worst” one might choose to apply. I would reply that that's correct. Success in procedures is, at least to some extent what Deming terms a lottery,[5] no question. Given that truism, what's the point of making “procedural skills a ratable category, in the first place?

h. glittering generalities

Greenburg et al wrote, in relation to LORs:

Brevity and generality...come across as distinctly negative features, causing the reader to wonder...whether the writer actually knows the applicant. (197)[87]

Yet, the evaluation-criteria, upon which LORs are most often based, rely upon brevity and generality, presumably in the assumption that evaluators' general opinions of candidates reflect the truth. No evidence supports that proposition and my personal observation is that it is false. If brevity and generality be negative features of a LOR, how can the same features be acceptable in the underlying evaluation-criteria?

Bosk terms vague indices of “quality” of the candidate, such as “general medical knowledge,” “rapport with staff” and the like “essentially-contested concepts.”[67] They are summative, glittering generalities, intended to make the evaluation-form brief, that have no necessary evidentiary relation, either to the subject-physician's actual performance, clinical acumen or to clinical outcomes of his patients.

In the field of academic emergency-medicine, Harwood et al referred to various elements of evaluative jargon:

Of the applicants submitting SLORs to our EM residency program, 49% or more received the superlative response in the categories of "commitment," "work ethic," and "personality." In contrast, only 35% of the applicants received the superlative response regarding their "differential diagnosis ability." The "global assessment" operated similarly, with 37% of the applicants receiving the superlative response. The least common superlative response was the "match rating," with only 23% of the applicants receiving a "guaranteed match."

These data can serve as a reference for both interpreting and writing SLORs. The data show that EM applicants least commonly receive the superlative response in the categories of "differential diagnosis ability," "global assessment," and "match rating," making these key categories for residency selection committees. These results suggest that authors can justifiably evaluate most applicants in the highest categories of personal traits, but that they should be more discerning with assessing "differential diagnosis ability," "global assessment," and "match rating."[88]

Harwood et al seem to pretend as if ratings were objective facts, rather than what they are, subjective appraisals based on the author's claimed but unverifiable (and probably negligible) familiarity with the trainee.

In the foregoing passage, Harwood et al urged institutional authors of SLORs (standardized letters of reference) to manipulate their performance-appraisals in various sections of the SLOR, on the premise that the evidentiary basis of such appraisals don't matter, with a view to pandering to the selection-committees for emergency-medicine residencies and manipulating the outcomes of their deliberations over trainee-selection. Harwood et al seem to ignore the possibility that fewer authors rate trainees highly in the glittering-generality categories of "differential diagnosis ability," "global assessment," and "match rating" than in the glittering-generality categories, "commitment," "work ethic," and "personality" because the authors could be inappropriately ungenerous with their ratings in the first three categories, most likely because those categories are the clinically oriented ones and authors would very likely believe that they weren't performing their watchdog/gatekeeper function properly (to keep bad doctors from practicing emergency-medicine) unless they had condemned a certain quota of trainee-candidates with each batch that left their respective institutions. Heaton depicts that practice in terms of what he calls the “basic process”:

The Basic Process

The basic process of an individual in a hierarchy is to avoid mistakes. . . individuals are rated by their errors, for their tasks are predetermined. There is no premium for achievement outside assigned hierarchical tasks but there are penalties for every shortfall from perfection.

The normal distribution in a hierarchy includes a percentage of failures, so grading on a curve means that students making the most mistakes are given failing grades. . . When failing students are eliminated, those next above them succeed to the failing category. The rule of thumb is for one-third to leave between the fifth and twelfth grades, . . . The next third become failure-threatened, declining in rank regardless of effort or improvement. Apprehension then blocks learning so there can only be unskilled repetition. Thus this middle third is taught submission and place within the hierarchy. . . (32)

Is it true or false that for every winner there has to be a loser? False – there has to be a continuing supply of losers if a winner is to keep on winning. In schools, grading on a curve . . . means that the A student needs an F student at the other end of the normal distribution; then annually or more often, when the F student is eliminated or drops out, another student must be pushed into the failing position. . . companies seem to survive only by establishing a large pool of marginal workers who can be picked up when needed and dropped when business is slow. . .

. . . Schools in exclusive suburbs do not produce so many failures . . . Instead they assume their students are mostly in the upper half of a normal (59) distribution. . . there are schools which assume their students are mostly in the lower half of a normal distribution. In one vocational high school in New York, no teacher could give a grade above C without special approval by the principal. In a ghetto high school a department head told me that only one student in a . . . class of twenty was capable of learning. I knew the students were capable and interested, but sure enough, nineteen dropped out and failed. . . grading in schools is a process that produces failures and accomplishes rejecting.

Winners are custom-made, but losers are mass-produced. . . (62)[52]

Pursuing a similar line of “reasoning,” raters in medical education may believe that they can enhance the reputation and credibility of their respective institutions by making a big show of being “tough graders” and the clinically oriented rating criteria are the most attractive targets for that sort of behavior.

i. distortion from “confidentiality,” under perpetual tension.

In both editorial peer-review and performance-appraisal/LOR, the thesis is that the rater cannot deliver an “honest and accurate”[89] rating unless he labors under the protection of “confidentiality,”[61,89,90] meaning that everybody except for the subject, gets to see the rating.

Decades of organizational oppression, in which ratees had to tolerate the intolerable, finally prompted the US Congress to enact the enlightened Buckley Amendment, a federal law that requires schools that receive federal funding to make student records available for viewing by parents and the students themselves if they are 18 or older.[89,91] Accordingly, even though federal law mandates that the trainee should be able to see his rating, those in medical education, prefer the old oppression. They recommend that the organization should compel the trainee to “waive” his legal right, under the Buckley-Amendment, to see his rating, in the interest of “honesty”,[1,90] “authenticity,”[89] and “objectivity” (read freedom of the rater to give an adverse rating with the security of knowing that the ratee cannot learn of it and therefore not have grounds for retaliation) of the letter and of its “value”[1] to the receiving institution.

In editorial peer-review, the author receives the rating but not the identity of the rater. In performance-appraisal, the trainee knows the identity of the rater but, ideally, not the rating.

Yet, dissenting voices resist such organizational oppression, and for good reason, in my view:

...One of our wisest and most experienced faculty members, Dr. Douglas Lindsey, offers to write letters for every medical student. He writes them honestly. He then shows the student the letter. It is up to the student to decide wether it is sent. This is an excellent policy of a great teacher. Unfortunately, it is probably unique. (320)

A few students have been asked to sign statements that they have not seen their reference letters. This is ridiculous and unenforceable. Don't sign...it is common practice for students to be asked to sign a waiver of their right to request to see referee letters. If you are forced into this type of situation, you may have to sign it and hope for good letters. If possible you do want to see those letters before they go out. (321)[53]

The practice of “confidentiality,” under compulsion and under false color of “honesty,” in the rater, may thus spawn duplicity and dishonesty in the ratee.

On the subject of so-called honesty, one naturally wonders whether the “honesty” will be even-handed or biased. A few obvious questions spring to mind:

Will the rater be as “honest” about how he himself prioritized the needs of trainees lower than his own personal needs and therefore devoted insufficient time to the those in need of guidance to foster their improvement as he claims to be about the shortcomings of those trainees, whom the rater thus abandoned? Will he be honest about his own failures to implement and incorporate core content of his specialty (e.g., emergency-medicine) in training and rating his trainees? Will he be honest about his own failure to provide daily feedback to trainees to keep them informed of what specific performances they needed to demonstrate the following day to show improvement? Will the rater be honest about his own failure to document daily or weekly improvement or otherwise and reasons therefor in his rating-comments? Will the rater be honest about his own failure to define behavioral educational objectives,[92] toward which the trainees might strive? Will the rater be honest about how he exchanged gossip with other faculty about various trainees and thereby formed a collective, united, homogenized opinion of trainees, insteade of expressing his own opinion, based on his personal observations? Will the rater be honest about casting the evaluation in terms only of the trainee's failures, not in terms of systematic failures of the institution?

Tonesk provides a twisted view of objectivity vs. subjectivity and authority-relationships in medical education.[93]

In the realm of editorial peer-review, Walsh et al found referees more considerate and courteous toward authors if their names attached to their reports[94]. What's wrong, therefore, with accountability in LORs?

Flacks wrote:

. . . maintaining the confidentiality of the contents of evaluations and letters of reference would [not] improve the quality of such assessments. On the contrary, . . . I've become convinced . . . that the reverse is true. New state laws and university regulations have opened the process . . -- and the results . . . have been good. Faculty members and departments now have the opportunity to respond to negative reviews . . . timely . . . and with some understanding of the arguments that may merit rebuttal. The review process is now more cumbersome, but it is . . . less Kafkaesque. . . A new law that would require full disclosure has passed the legislature but is being contested in the courts by the University. I am quite sure that the . . . motivation for the University's resistance is not so much to protect the quality of the review process as it is to protect the discretionary powers of the administration.

. . . The need for open evaluations is not simply that such openness promotes due process. The due process argument applies to all institutions in their treatment of workers. . . open access helps ensure that each member can benefit from critical feedback and also ensures that criticisms are made in a way that is responsible to canons of scholarly objectivity. . .[95]

Fashing wrote:

. . . If we allow people to require anonymity as the price for the exercise of candor and professional responsibility, then surely we encourage a pernicious form of cowardice. Are our sensibilities so delicate that they cannot contend with the requirement to render our negative judgments openly and honestly with whatever risk that entails? And if they are, should we continue to encourage such delicacy or should we begin to require a modicum of courage to go with our “candor”? I for one believe we should. . . the requirement of anonymity raises serious questions of credibility in its own right. Why should we believe that anonymity is the price of honesty any more than that it is an opportunity for dishonesty? . . .

. . . there are compelling reasons for confronting intellectual, professional, and . . . personal differences as a minimal requirement for the development of any serious sense of community. This will no doubt produce some unpleasant moments in the context of whatever conflicts surface, but what group that constitutes a serious community, or perhaps more importantly, a community to be taken seriously, especially in intellectual terms, is without conflict? That consensus about all issues is unnecessary to the maintenance of a healthy community is recognized by all but the most resolutely conservative members of the academy. To address such differences and to resolve them, or in the case of intellectual differences, to provide a climate in which debate and conflict of opposing ideas are a catalyst for intellectual growth and creativity, strikes me as the essence of academic community and a primary requirement for intellectual and academic freedom. In this sense disclosure should promote rather than retard intellectual excellence. (222)[96]

6. The ultimate goal: communion of “top” talent in “top” institutions

The counter-argument to the foregoing is that the most competitive programs have to select the most competitive trainees.

Why? Even assuming that the selection-process be valid, a dubious proposition, what ultimate utility is there in aid of quality of patient-care in concentrating “top talent” in “top institutions?” Isn't that just elitism run amock? What about spreading the wealth, if that's what it is (a dubious proposition), around a little? Wouldn't the “non-competitive” trainees gain from exposure to “top institutions” and wouldn't “competitive trainees,” if they offer any genuine advantage over non-competitive trainees, be able to work their magic in institutions in more humble locations?

I've interacted with finished physicians from a broad range of institutions and I'm constantly impressed with how alike they are. Physicians from Harvard, Yale and other Ivy League institutions are no great shakes and some of the most impressive come from the hinterlands. What was all the fuss about during education and training, then?

7. Illustrative anecdote which is more typical than it should be

When I worked as a civilian in the ER of the military hospital, Fort Stewart, GA, my military supervisor, a Major in the Army Medical Corps, liked me pretty well at first but seemed to dislike me more and more as time went on, evidently because of conflicts that swirled around me.

He criticized my handwriting, so I brought in a word-processor to write up my charts and make them optimally legible. He didn't stop me from doing that but, long after I'd left there, I obtained copies of my personnel-records, including documentation of his commentary on the episode. Without explaining what he intended, he put an exclamation after the statement, “he brought in a word-processor!” I gather he disapproved of my constructive response to his criticism, yet he suggested no other alternative. What did he want from me? Did he expect me suddenly to develop handwriting like his? He never explained.

In perhaps the emblematic episode of my tenure there, I pissed off one of his fellow Army-officers by calling him in at night to attend a female patient of his by admitting for her evaluation and monitoring of her chest-pain that I suspected had a cardiac origin. He chewed me out for disturbing his sleep and wanted me to release her home without forcing him to come in and examine her. He claimed to know her so well that he KNEW that her chest-pain was not cardiac but, instead, was from her COPD. The rules, not of my making, required him to come in and examine a patient whom the ER-physician suspected of requiring admission. Under protest, he came in, chewed me out some more in front of nurses and other personnel and released her home. A few weeks later, her cardiac catheterization at Fort Gordon revealed severe coronary artery disease. I had committed an unpardonable sin: being right when an army-doctor was wrong.

It's not as if this were a diagnostic coup. It could hardly have been more stereotypical. She had chest-pain, reminiscent of cardiac chest-pain. It was bread-and-butter medicine. She needed admission for the sake of safety. The officer fulfilled his paper-duty under protest by getting out of bed and examining the patient. He failed in his duty to admit her for monitoring.

I pissed off a pediatrician by calling him in at night a few times to attend febrile infants who I thought might need admission, as a posted directive required me to do. Whether the patient's condition is serious enough to warrant admission is a matter of judgment and, if I think the patient needs admission, the pediatrician may disagree. I assumed that to be in the realm of disagreement among reasonable people. He evidently disagreed, even with that principle, probably because he was the pediatrician on call and fulfilling his duty required him to exert unwelcome effort. He impugned my “judgment,” as a tactic in his campaign. He sent all the patients I referred to him home, possibly as a way of accumulating incompetence-points against me. Those incidents illustrate the principle, universal, in my observation, that hospital-personnel pay abundant lip-service to concern for quality of patient-care but their actions bespeak only concern for their own convenience.

Thus, I accumulated “complaints” against me but the hospital never preferred any charges against me or offered me a peer-review hearing for me to rebut such charges, presumably because the notion would have been absurd, even to Army-brass.

Hypothetical charge 1: diagnosing chest-pain as cardiac which later proved to be cardiac but pissing off Army-Officer in the meantime by calling him in at night to do his duty. Charge 2: complying with posted hospital-directive by calling in Army-Officers in relevant specialties “unnecessarily,” and thereby pissing them off, on nights when they're on call to attend patients, possibly appropriate for admission, and to render their opinions.

Instead of taking a formal route, they chose a typical bureaucratic route: my supervisor completed consecutive evaluation-reports in secret and never discussed them with me. The personnel-records I obtained years later, exhibited an unmistakeable halo-effect: In all components, from “medical knowledge” and “rapport with staff” to “health” and “appearance,” the ratings descended in parallel from 9 or 10 of 10, steadily downward, to end at about 3 or 4 out of 10, under the influence of multiple complaints of pissing off Army-physicians by asking them to do their duty. That is, each evaluation-cycle, my supervisor assigned all components the same rating: all 9s, all 8s, all 7s, all 6s and so forth. Yet, my “appearance” and “health” were verifiably the same throughout that time: fine and stable. He presented not a scintilla of evidence of my deteriorating health, for example, yet he “documented” its deterioration in his numerical ratings. This person had an MD-degree!

Thereupon, enough poor pseudo-ratings had accumulated against me to “justify” my termination and to provide an ironclad “paper-trail,” in case I should have decided, at some point, to contest my termination legally.

LORs that I requested from Fort Stewart stated only the dates of my employment there but made no mention whatever of my performance, e.g., my thoroughness and my diligence, for the benefit of patients, against the odds of dysfunctional military-bureaucratic obfuscation. Those LORs illustrate a fundamental principle of all LORs: LORs accommodate the needs of the ambient power-hierarchy, not of the subject thereof. That makes them inherently inaccurate. If the academic is honest with himself, he will concede that academic power-hierarchies exhibit similar manifestations.

I could provide other anecdotes with similar import but I've gone on far too long already, so I'll stop.

When will decision-makers cop themselves on to the inherent unfeasibility of rating human beings?

 

Top | Home | About Us | Site Map | Privacy Policy | Contact Us | ©2004 Semmelweis Society International