Thursday, July 24, 2008

Wine again

Levitt blogs again about Robin Goldstein - this time about Goldstein's new book on blind taste testings. From what I understand, Goldstein held numerous blind taste testings with average, everyday Americans - about 6,000 people were subjected to the experiment in total. People were all self-selected, so keep that in mind. But anyway, they were given brown bags holding wines of different quality and price, from which they poured participants small glasses of wine. Participants were then asked to fill out a survey on the wine. The previous article linked to on this blog doesn't explain the methodology, but rather references the book, so to know more about the precise experiment, you have to pick up the book. This part from Levitt's post stuck out at me.
Not until I read Goldstein’s book did I realize just how weak the correlation was in blind tastings between expert evaluations and price in experimental settings. Yet, somehow Wine Spectator, which claims to do tastings blind (at least with respect to who the producer is), has an extremely strong positive correlation between prices and ratings. Hmmm … seems a bit suspicious.
I need to just read Goldstein's book, I guess. But in that previous article, presented at the Association of Wine Economists, Goldstein and the authors actually find a strong, positive correlation between price and what they called "experts." So I'm not sure what Levitt is talking about here. The negative correlation between price and preference exists only for the non-experts. But for people who have taken a class on wine, or who might otherwise be categorized as experts, even under blind taste tests, they prefer the more expensive wines (even though they don't know which those are).

To me, this is really, really simple. There's increasing returns to education and experience in the appreciation of wine. That is, there is a learning component - the more wine you drink, the more discerning your palate becomes. To take it even further, the more you drink of it, the more you prefer certain kinds of wine. That a non-random selection (or even a random selection) of Americans can't tell the difference between good wine and bad wine hardly proves the point that there is no such thing as good or bad wine. It's like asking me to rank my preference of techno music or classical music, even, when I've maybe listened to samples of each a handful of times in my entire life. Or to ask my kid to tell me if he likes kale, mashed potatos or pizza. Of course he's going to pick pizza. Does that mean pizza is "better" in some objective sense? Seems to me like Levitt's committing the ought/is fallacy. Just because something is the case doesn't mean it ought to be the case.

Update: One last thing. In that more recent paper by Goldstein and co-authors, they appear to use a robust standard error correction. But, now that I think about it - and mainly I'm thinking about something my friend, Matt, said to me once about his own experimental work - shouldn't they be clustering on the session? The errors are clearly correlated within session. The sessions were blind, yes, but they were also done in public view of one another. Surveys were also filled out at tables where people could see one another's answers. Shouldn't you cluster at the session? This would affect inference, and I'd be interested in how it changed the results. I also really wish that instead of price, he was controlling for quality rating based on Wine Spectator or Wine Advocate scorings. Price is interesting, but the real measure of a wine's quality is going to be the "expert" ratings. It'd like to see price and quality controlled for and see whether preferences follow or not.

2 comments:

Matthew Pearson said...

Not clustering standard errors is a big red flag for me. If they didn't do it, they may have a good reason but it's weird that they don't explain that. These days you want to cluster whenever you think the errors might be correlated, and an experimental session is just such a time. In addition to the ways in which correlation might enter through participants influencing each other, there are also "common shocks" (see Manski 1993) to the groups, such as time of day, day of week, climate, variation of bottle's quality or temperature exposure, etc. That stuff will get absorbed into the error term, and those elements of the error are going to be common to the session. I think that's right anyway.

scott cunningham said...

Yeah, I was thinking that same thing - really your points about that with your experimental work made me wonder why they aren't doing it. The paper is very vague on the standard error correction. Nothing even about White. Just says "robust p-values in parenthesis" which I'm assuming is just a simple robust correction, like Stata's. But even that's unclear. It's definitely nothing like a clustered correction at the level of the experiment, and if you look at the youtube on the book's website, you'll see what appears to me to be the possibility of common shocks everywhere. All of it may be blind, but it's all done publicly and together. I suspect that if nothing else, though, it would probably get rid of the inference on the "expert" dummy, as the combined significance of that interaction is only barely significant (p-value of 0.095).