# Should We Ignore the “Mediocre” Wines?

The Stellenbosch Tastings – 2013: Should We Ignore the “Mediocre” Wines?

Last year, the annual meeting of the American Association of Wine Economists was held in Princeton, NJ. The big news was that New Jersey wines did almost as well as French wines. But enough. Another contest pitting French wines against…is not needed. And at the Stellenbosch (West Cape province, South Africa) tastings (South African Sauvignon Blancs and Pinotages), the interesting issues had nothing to do with winners or even whether more expensive wines got higher scores (they did not).

The issues at the Stellenbosch Tastings were whether ratings or scores should be used and how to distinguish good from bad tasters. And yes, my question: are wines that get “middle” scores worth drinking? Fortunately, I chaired a session at Stellenbosch that included a mathematician, a biostatistician, an oceanographer/statistician, and a J.D lawyer/grad of the French Culinary Institute. Just the right group of professionals to help explore these issues!

The Results

I will use the results of the tastings to illustrate these questions. At the tastings, randomly selected volunteers were asked to score the wines as follows: scores between 50 and 100, where: <70=Poor/Unacceptable, 70-79=Fair/Mediocre, 80-89=Good/Above Average, 90-100=Excellent to Superior. In the following tables, wines on the left got the highest scores, those on the right, the lowest. A common question with such data is how close were the tasters in their ratings? Dom Cicchetti, the biostatistician on my panel, confirms that the best statistic to address this question with is Kendall’s Tau. The Tau value for reds was very low – .07 – suggesting no agreement between tasters on wine scores. The Tau for whites was fair or average – .46 – suggesting modest agreement among the judges.

Table 1. – Stellenbosch Sauvignon Blanc Tasting – 2013

* Estimated retail price in South Africa converted at 1 US\$=R10
Source: Data collected and tabulated by Neal Hulkower

Table 2. – Stellenbosch Pinotage Tasting – 2013

* Estimated retail price in South Africa converted at 1 US\$=R10
Source: Data collected and tabulated by Neal Hulkower

The Scoring Versus Ranking Controversy

In judging wines, is it better just to rank them, e.g., 1, 2, 3, etc. or score them as was done at the Stellenbosch tastings? Neal Hulkower argued for rankings in a paper presented at Stellenbosch while Dom Cicchetti made the case for scoring wines. In essence, Hulkower argues that scoring introduces too many arbitrary taster decisions into the judgments while Cicchetti argues that scoring allows tasters to register different intensities of likes and dislikes. The following examples will highlight the differences: assume 2 tasters and 3 wines.

Case 1 – The tasters have opposite tastes on wines A and C but agree on B. They are required to rate the wines – 1, 2, 3. The result is a 3-way tie – all wines get 4 points.

Case 2 – The same 3 wines and 2 tasters but in this case each taster gets 9 points he can use to score the different wines. One outcome is represented in the table below where the results are the same as the rating example:

But there is another possibility as represented in the following table where wine A gets the highest score:

Question: is taster 1 “playing the system” to get his wine to win (Hulkower’s concern) or does it reflect a true difference in the wine “like/dislike intensity” of each taster (Cicchetti)?

Case 3 – Hulkower has one further concern – the potential arbitrariness of scale level choice when tasters are not all required to use the same number of points as illustrated by the following table:

Question: does the difference in number of points used reflect a true difference in how much the tasters liked the wines or is it simply the result of an arbitrary mean/starting point?

There are no conclusive answers to these questions. Hulkower’s approach eliminates any possible playing the system/arbitrary starting point. But Cicchetti makes a very legitimate case for using scores to measure taste intensity differences.

What can we learn about this issue from the Stellenbosch tastings? As Table 3 indicates, scoring/rating the wines made very little difference in where the wines ended up.

Table 3. – Wine Ratings Using Scores/Ranks – Stellenbosch, 2013

Source: Data collected and tabulated by Neal Hulkower

But there is one further point on this subject coming out of the Stellenbosch tastings worth noting: the average scores of the tasters (Table 4). Note the spreads – from 91.9 to 65.8 for whites and 85.8 to 63.8 for reds. Does this represent real differences in likes and dislikes among the tasters? Or are these somewhat arbitrary starting points/norms for the tasters. I am inclined to believe much of these differences are arbitrary.

Table 4. – Average Scores of Tasters, Stellenbosch Tastings, 2013

Source: Data collected and tabulated by Neal Hulkower

Rating Tasters

In the tasting results reported above, you might have noted three wines in each tasting were the same. Here is why. Robert Hodgson has his own winery and has been troubled by appeared to be erratic ratings his wines were receiving from judges at tastings. As a consequence, he has been analyzing judge performance at the California State Fair for over a decade. The key result is that only about 10% of the judges are consistent in their ratings, and this 10% are not the same judges year to year. He concludes that competition awards have a major random component. To correct this problem, Hodgson has come up with a method to judge candidates. The key to his method? Have the candidates do blind tastings that include more than one glass of the same wine in each tasting. If the candidates do not score glasses of the same wine nearly the same, they are not competent to judge wines. Hodgson’s suggested overall scheme is quite rigorous: candidates must do four blind tastings of ten glasses each. At each tasting, there are three glasses from the same bottle. And for a candidate to qualify as a judge, the scores given on the glasses on the same wines must be “close”. Hodgson presented a paper on the subject at the Stellenbosch meetings, and we employed an adapted version of his method out at the tastings.

A priori, one would think that people who have flown from all over the world to attend a wine conference in Stellenbosch should be as competent a group to judge wines as any that could be assembled. Table 5 indicates how they scored the 3 glasses of wines poured from the same bottle. The results are organized by the “Spreads” which are the difference between their highest and lowest score for the same wines. It is worth noting how the tasters were instructed to judge the wines: anything below 70 was supposed to be “Poor/unacceptable” while 80+ were supposed to be “Good/above average”. Perhaps the top 11 tasters for both reds and whites did an acceptable job with spreads of 11 or less. But starting with tasters 8 and 27, one has to wonder about competency. And if this is the result for this group, one does have to wonder more generally about judge/taster competency.

Table 5. – Tasters’ Scoring of the Same Wines, Stellenbosch Tastings, 2013
Source: Data collected and tabulated by Neal Hulkower

Eliminating the Boring Middle

At our session, Robin Goldstein gave a paper titled “Do Negative Ratings Mean More Than Positive Ratings”. And it got me thinking. It often happens wines come out on top not because of some outstanding characteristic, but because of a lot of middle scores/rankings. And in fact, most wines today are “okay/mediocre”. We are not interested in more of these. Instead, we are looking for wines that either are “really special” wines that “should be avoided”. To get data on these good and bad wines, we should not look at average scores/rankings but instead the “tails” – the very good and very bad scores. So let’s look at the “tails” at the Stellenbosch tastings. How should tails be defined? For the Stellenbosch tastings, wines scoring, remember the definitions the tasters were asked to use: <70=Poor/Unacceptable, 70-79=Fair/Mediocre, 80-89=Good/Above Average, 90-100=Excellent to Superior. We definitely are not interested in identifying Mediocre or worse wines so the lower tail should be <80. For the top tail, I take >85 on grounds there might be some exceptional wines in the “Good/Above Average” category.

The results are presented in Table 6. What can be drawn from it? Certainly, any wine that does not get at least 3 high tails is not a good bet. That means that among Sauvignon Blancs, I would try the Boschendal before the De Morgenzon even though the latter got a higher score. The variation in tail scores among the 3 glasses of Du Toitskloof remain troubling. On the Pinotages, only the Steytler with 4 high tails would appear to be a good enough gamble to try again.

Table 6. – “Tails” Count, Stellenbosch Tastings, 2013

Unfortunately, I am not familiar enough with the South African wines to know whether focusing on the “tails” makes sense. So I complement these with “tails data” results from tastings of the Lenox Wine Club. Last fall, the Lenox Wine Club was formed with 14 members. Its members have been drinking wines for 30-40 years. We conducted 5 blind tastings at which 5 wines were tasted- heavy whites, heavy reds, light reds, heavy red blends, and light whites. In all tastings, the scoring was done via a tight 1-5 ranking scheme, where 5 means the best and 1 means the worst (Borda Scores). Detailed results can be found on my web site.

For tails, I allow wines getting 5 (or tying) for the high tail and 1-2 (or tying) for the low tail. Table 7 provides the data. I offer several observations:

• For the Light Whites, the Picpoul and Bota Box tied. But I would probably try the Picpoul and the Vouvray again before the Box because they both had more high tails. The Geywacke was a real disappointment, measured by both scores and low tails.
• For the Heavy Red Blends, the Bota Box had more high tails than the score winner. It also had more low tails – you either like it or you do not. The Chateau Dominique Bordeaux – what a waste of money – live and learn.
• For the Light Reds, the tail counts suggest the Almaden Box is a more “interesting” wine than the Falernia score winner.
• In the Heavy Whites, the Box Set Chardonnay had the best score and the most tails. You will note the “winner” of the Princeton Tasting of whites was included – a low score and 6 low tails.
• The Bota Box and the Cantena did well in both score and high tails in the Cab tasting.

Table 7. – “Tails Count, The Lenox Club Tastings, 2013

Conclusions

The ranking versus scoring debate will continue. It is important to keep in mind the strengths and weaknesses of each. We should keep in mind that many wine tasters are not able to discern wine differences. I conclude that the “tails” provide useful information to help in wine assessments. A final question: since most people drink wine with food, when will we start design tastings to be done with food?

Post Script

In her welcoming speech to the Stellenbosch Conference, Helen Zille, the Premier of the Western Cape, made the point that wine (more generally alcoholic beverages) is a two-edged sword. On the one hand it provides great pleasure, reduces stress, and employs people. On the other hand, alcoholism leads to physical abuse, deaths on the highways, lost jobs, and family destruction. I could not agree more.