Gas station without pumps

2012 November 11

Barlow bashing data

Filed under: Uncategorized — gasstationwithoutpumps @ 10:44
Tags: ,

A lot of academic bloggers are commenting on the recent hype about digital badges (as I did yesterday), but some of them seem to have gone on into bashing many fields of academic research as well.

For example, Aaron Barlow, the executive editor of the Academe Blog (an opinion column published by the AAUP, but not reflecting official AAUP positions), wrote a post Badges! Two: Why Data Is Never Enough. In it he misinterprets Gödel’s proof and admiringly quotes Esther Quintero:

The irony, of course, is that this notion is actually contrary to the scientific process. Being data-driven is only useful if you have a strong theory by which to navigate; anything else can leave you heading blindly toward a cliff.

Now perhaps Quintero’s quote has been taken out of context and misapplied, but as stated, it is simply incorrect and denigrates discovery science.  A lot of big data science projects are not based on “strong theory”, but on the notion that unbiased collection of data can lead to a lot of hypothesis generation—hypotheses that could not have been conceived if the theory had to come before the data was collected.  The notion of a strong theory to test works well in model-driven fields like physics, but not in fields that study more complicated systems, like biology.

Bioinformatics is a field that is not driven by strong theories but by data.  We have found it very valuable to have large amounts of high-throughput data—looking at every base of a genome, every mRNA molecule that is expressed, every methylation of cytosines, DNA from every bacterium in an environmental sample, … .  Analysis of such big data sets is often difficult, but can lead to some surprising discoveries—co-incidences that no one would have thought to look for in hypothesis-based research.

It is true that bioinformatics with big data usually leads to hypothesis generation, not confirmation, which usually requires more focused experimentation.  In large part this is because the huge number of “hypotheses” that are being examined in discovery research will inevitably lead to a some looking very well-supported by the data used to discover them, when they are really just co-incidences or experimental artifacts.

Generally, bioinformaticians try to report their results with robust estimates of the false positive rate—in essence saying things like “here are 50 interesting things I found in the data, but 5–10 of them are probably wrong”. The 50 or so interesting things are usually pulled from a pool of thousands or millions of possibilities. More targeted experiments, generally using methods that will make different errors than the methods used for the original data, are then done to try to figure out which of the hypotheses are real finds and which are just coincidences.

But the big data is needed to come up with the hypotheses to test, and Quintero and Barlow’s disdain for collecting data is an insult to many scientists.

One irony of Barlow’s article is that he uses hoary drunk-looking-for-his-keys-under-the-streetlight analogy to support his argument, but that analogy applies equally well to the theory-first error he makes about using data.  Limiting your search, either by what data is easy to collect or by what theories you find easy to believe, is going to severely limit what conclusions you can reach.

Of course, I’m in agreement with some of what Barlow has to say.  For example,

Real education is a great deal more than reaching numerical benchmarks

and

The rubrics for writing exams, no matter how flexible we want to make them, always seem to drag the writers back to the five-paragraph theme. Why? Because they always focus on things that can easily be counted (paragraphs, sentences, spelling errors, sentence types, punctuation variety, supporting points, etc.). They have to, or the conclusions reached will be different for different readers.

The problem with that? Someone who writes brilliantly, but in another fashion, will always fail–even when the reader knows the writing is better than anything encompassed in the grading rubric. The reader isn’t given the freedom of deciding on his or her own, but only to make a decision based on the ‘countables’ of the rubric.

If Barlow had limited himself to talking about the low quality of the data available to educators and how poorly the data measured what we were really interested in, I’d have been posting a positive summary of his post.  But his jump to bashing data-driven fields as a whole makes his post unacceptable to me.

About these ads

3 Comments »

  1. Thanks for the response but, well… as I said, I do simplify Godel, but I don’t think I misinterpret him (if I do, please show me how he is not saying a system cannot be both complete and consistent). Nor do I take Quintero out of context. Read her post and you will see. Also, you will see that I use the ‘streetlight’ metaphor because she does and because it provide a quick shortcut. Because something is cliche does not mean it is not useful (though I might have found a different one, had she not started on this one).

    You may also see that we are not really all that far apart in what we are saying. I don’t agree with Quintero completely for just the reason you give, for she puts too much emphasis on theory as the starting point. If you would read carefully, you would see that I don’t subscribe to a simple ‘theory first’ approach, but say that there’s a needed on-going process that includes both data and theory. In no way do I insult data collection: it is the poor use of data that bothers me… and my point is that data and data collection are never enough. That, to me, is the problem with our current mania for data-driven assessment.

    Again, thanks. Your comments, even if I don’t always find them quite complete or satisfying, are always intelligent and thoughtful.

    Comment by Aaron Barlow — 2012 November 11 @ 11:00 | Reply

    • You said “Kurt Godel, in his famous proof, showed (to put it most simplistically) that no system can be both complete and consistent.” Gödel’s proof is not so sweeping as that. There can be systems that are both complete and consistent—they just can’t contain the axioms that define the integers. What Gödel showed was that “no consistent system of axioms whose theorems can be listed by an ‘effective procedure’ (e.g., a computer program, but it could be any sort of algorithm) is capable of proving all truths about the relations of the natural numbers (arithmetic).” [http://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorems]. This is a somewhat narrower result than what you said and is not really applicable to the rather fuzzy world of educational theories, where proofs simply don’t exist.

      I’m glad to hear that you don’t agree with Quintero completely (though that did not come across in your original post, where you sounded very admiring).

      I agree with you that data collection alone is not enough to set policies or make decisions, but your post seemed to me to argue against data collection that was not theory-driven or theory-guided. The experience in bioinformatics with high-throughput biology experiments has been that unbiased data collection (and public release of the data) has been a major driver in coming up with new theories to test. Only weak theories (that genomic differences are important in cancer, or that epigenetic markers like methylation are important) are needed to guide the data collection.

      Comment by gasstationwithoutpumps — 2012 November 11 @ 11:32 | Reply

      • Yes, if you take Godel no further than arithmetic, you are stuck with his theory in terms of axioms. But the greater implication still stands… which is simply that knowledge can’t be reduced only to systems for, when that is tried, the particular system is found to be either inconsistent or incomplete. If you can find a system that can consistently cover all instances (that is not limited, in other words, to a subset of the world), I would love to hear of it.

        And, of course weak theories are the starting point for guiding data collection, but the data collected then guides revision of the theories. It’s a process.

        Thanks again. I love the discussion–even though I think we are moving away from the main point, which is the misuse of data… not its value.

        Comment by Aaron Barlow — 2012 November 11 @ 11:47 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 251 other followers

%d bloggers like this: