Gas station without pumps

2014 June 21

2014 AP Exam Score Distributions

Once again this year, I’m posting a pointer to 2014 AP Exam Score Distributions:

Total Registration has compiled the following scores from Tweets that the College Board’s head of AP, Trevor Packer, has been making during June. These are preliminary breakdowns that may change slightly as late exams are scored.

Disclaimer: I have no connection with the company Total Registration, and do not endorse their services.  If the College Board would collect Trevor’s comment themselves, I’d point that page.  The main interest in AP result distributions comes in May, when students are taking the tests, and July when the students get the results.

The official score distributions (still from 2013 as of this posting) from the College board are at, at least until the College Board scrambles their web site again, which they do every couple of years, breaking all external links.  They post a separate PDF file for each exam, which makes comparison between exams more difficult (deliberately, I believe, since inter-exam comparison is not really a meaningful thing to do).  It is also difficult to get good historical data on how the exam scores have changed over time—College Board probably has it on their website somewhere, but finding stuff in their morass is not easy.

Views for my 2011 AP distribution post show the May and July spikes.

Views for my 2011 AP distribution post show the May and July spikes. This has been my most-viewed blog post, which is a bit embarrassing, since it has little original content.

My 2013 AP distribution post has not been as popular, probably because of search engine placement at Google.

My 2013 AP distribution post has not been as popular, probably because of search engine placement at Google.

My most popular post this year was How many AP courses are too many?, with about 10 views per day.  (It has also come in third over the lifetime of the blog, behind 2011 AP Exam Score Distribution and Installing gnuplot—a nightmare.) The question of how many AP courses seems to come up both in the fall, when students are choosing their schedules, and in the spring, when students are overwhelmed by how many AP courses they took.

The one AP exam my son took this year was AP Chemistry, for which only 10.1% got a 5 this year and about 53% pass (3, 4, or 5). We won’t have his score for a while yet, so we’re keeping our fingers crossed for a 5.  He finished all the free-response questions, so he’s got a good shot at it.

The Computer Science A exam saw an increase of 33% in test takers, with about a 61% pass rate (3, 4, or 5). The exams scores were heavily bimodal, with peaks at scores of 4 and at 1.  I wonder whether the new AP CS courses that Google funded contributed more to the 4s or to the 1s. I also wonder whether the scores clustered by schools, with some schools doing a decent job of teaching Java syntax (most of what the AP CS exam covers, so far as I can tell) and some doing a terrible job, or whether the bimodal distribution is happening within classes also.  I suspect clustering by school is more prevalent. The bimodal distribution of scores was there in 2011, 2012, and 2013 also, so is not a new phenomenon.  (Calculus BC sees a similar bimodal distribution in past years—the 2014 distribution is not available yet.) Update 2014 July 13: all score distributions are now available, and Calculus BC is indeed very bimodal with 48.3% 5s, 16.8% 4s, 16.4% 3s, 5.2% 2s, then back up to 13.3% 1s. Calculus AB has a somewhat flatter distribution, but the same basic shape: 24.3% 5s, 16.7% 4s, 17.7% 3s, 10.8% 2s, and 30.5% 1s. Overall calculus scores are up this year.  The 30.5% 1s on Calculus AB indicates that a lot of unprepared students are taking that test.  Is this the “AP-for-everyone” meme’s fault?

Physics B scores were way down this year, and Physics C scores way up—maybe the good students are getting the message that if you want to go into physical sciences, calculus-based physics is much more valuable than algebra-based physics. I expect that the algebra-based physics scores will go up a bit next year when they roll out Physics 1 and Physics 2 in place of Physics B, but that the number of students taking the Physics 2 exam will drop a lot.  I don’t expect a big change in the number of Physics C exam takers—schools that are offering calculus-based physics will not be changing their offerings much just because the College Board wants to have more low-level exams.

AP Biology is still  seeing the nearly normal distribution of scores, with 6.5% 5s and 8.8% 1s, so there hasn’t been a return to the flatter distribution of scores seen before the 2013 test change.

As always, the “easy” AP exams see much poorer average scores than the “hard” ones, showing that self-selection of who takes the exams is much more effective for the harder exams. When College Board and the high-school rating systems push schools to offer AP, the schools generally start by offering the “easy” courses, and push students who are not prepared to take the exams.  As long as we have stupid ratings that look only at how many students are taking the exams, rather than at how many are passing, we’ll see large numbers of failed exams.

2014 January 17

CS commenters need to learn statistics

There was a recent report about how many students were taking AP CS exams, breaking out the information by gender, race, and state, which has been released in a few different forms.  Mark Guzdial’s blog post provides pointers to the data collected by Barbara Ericson.  Some of the comments provided on that post shows an appalling lack of statistical reasoning (like comparing states by subtracting percentages of different things).

So what are the interesting questions to ask of the data and how should they be handled statistically?

Most of the “gee-whiz” statements are about how few people in some group or other took (or passed) the AP CS exam:

  • No females took the exam in Mississippi, Montana, and Wyoming.
  • 11 states had no Black students take the exam: Alaska, Idaho, Kansas, Maine, Mississippi, Montana, Nebraska, New Mexico, North Dakota, Utah, and Wyoming.

Some people pointed out that some of these numbers may not be more than a small sample effect (no one took the exam in Wyoming, so having zero female test takers is not surprising).  How can we best state that a number is interesting?

Generally , this is done by creating a null model—one that computes the probability of different outcomes based on everything except the hypothesis being tested.  Then you look at how surprising the observed outcome is given the null model.   Exactly how the null model is constructed is crucial, as all that the statistical tests tell you is how badly your null model fits the data.

What sort of mathematical model should we be using for assigning probabilities to numbers of test takers (or numbers passing the test)?  One convenient one is a binomial distribution.  The binomial distributions are  a family of distributions over non-negative integers with two parameters N and p.  They are good for modeling the count of a number of independent events each of which occurs with some fixed probability.  If we think of each high school student in a state as having some (small) probability of taking the exam, then the number of exam takers can be modelled as a binomial distribution whose N value is the number of students and p the probability that each one takes the exam.  When N is large (as it would be for the number of high school students in a state) and Np is reasonably large, then the binomial distribution can be approximated by a normal distribution with mean Np and variance Np(1-p), but an even better approximation is to use the Poisson distribution with mean Np, which is what I’ll use here. The probability of zero test takers: P(0)= \binom{n}{0} p^0 (1-p)^{n-0} = (1-p)^n \approx e^{-np}.

So all we need to set the parameters of our null model is an expected number of test takers based on everything except what we wanted to test.  For example, if we wanted to test whether black test takers were under-represented in Maine, we would need a model that predicted how many black students would take the test, perhaps using the probability that students in Maine would take the test independent of race and the fraction of students in Maine that are black.  For Maine, there were 161 test takers, and 0 black test takers.  I don’t know the racial mix of high school students in Maine, but Wikipedia gives the black fraction of the whole state population as 1.03%.  Thus the expected number of black test takers is 1.658, and we can use e^{-1.658} as the probability of seeing zero black test takers by chance.

UPDATE: 2014 Feb 1.  Some values in the following table corrected, due to clerical errors in copying from spreadsheet (I’m not sure which I hate worse, spreadsheets or HTML tables—they’re both awful formats).

state # test takers state % black expected black test takers under-rep p<
Alaska  21  4.27%  0.897  0.41
Idaho 6 47  0.95%  0.086 0.447  0.92 0.64
Kansas 12 47  6.15%  0.738 2.891  0.48 0.056
Maine  161  1.03%  1.658  0.19
Mississippi 2 1  37.3%  0.746 0.373  0.47 0.69
Montana 0 11  0.67%  0 0.074  1 0.93
Nebraska 12 46  4.50%  0.540 2.070  0.58 0.126
New Mexico 7 57  2.97%  0.208 1.693  0.81 0.184
North Dakota 1 9  1.08%  0.011 0.097  0.99 0.91
Utah 11 103  1.27%  0.140 1.308  0.87 0.27
Wyoming 2 0  1.29%  0.026 0  0.97 1

Even before we do a correction for having 51 hypotheses (50 states plus District of Columbia), none of these “no black students” states shows significant under-representation of black students. In fact, it would have been significantly surprising if the test taker in North Dakota had been black. None of the states had so few students that a black test taker would have been surprising (except Wyoming).

One can do similar computations to show that the lack of women in Mississippi, Montana, and Wyoming is not surprising.  Montana looks surprising if treated as a single hypothesis (p<0.004), but not after multiple-hypothesis correction (E-value=0.21). Even combining all three states (which increases the number of hypotheses enormously and would call for a stronger multiple-hypothesis correction), the under-representation of women in those states is not statistically significant.

There are states that do have significant under-representation of women: for example, Utah had 103 test takers, only 4 of whom were women. With an expected number of about 51.5, this is p<1.4E-16. Even with 51× multiple hypothesis correction, this under-representation is hugely significant.  Looking nationwide, total counts were 5485 female test takers out of 29555 total test takers.  That’s p< 1.4E-1677. The highest percentage of female test takers was in Tennessee, with 73 out of 251, which is  p< 2.6E-7, again highly significant.

Tennessee also had a high proportion of black test takers with 25 out of 251.  With an expected number of 42.12, this is p<0.003 (still significantly under-represented).  To see if black students were under-represented nationwide, one would have to add up the expected numbers for each state and see how the actual number compared with the expected number.  (I’m certain that the under-representation is hugely significant since even the states with high numbers of black test takers are under-represented,  but I’m too lazy to do the multiplication and addition needed.)

The case can clearly be made for female and black students being under-represented, though pointing to the states with 0 female or 0 black test takers is not the way to do it. (From a marketing standpoint, rather than a statistical one , shouting “no black test takers in these states”, “no female test takers in these other states” may be exactly the right way to get attention, even though the real story about blacks and females is in the states where there were enough test takers to say something about them after dividing them into subgroups.)

A case could also be made for some states having far fewer CS AP test takers than others.  One would need to come up with an expected number of test takers from some model (for example, by state population as a share of national population, or by number of total AP test takers in state as share of national total AP test takers).  The second model would correct for state-to-state differences in age distribution or in popularity of AP exam taking in general.  One could also base predictions on some other STEM test, such as AP Calculus, if one wanted to control for different amounts of STEM instruction in different states.

Let’s look at the states with no black test takers again, to see if they are significantly under-represented in CS.  There were 29555 AP CS tests taken nationwide and 3,824,691 AP tests nationwide total, so we would expect the CS tests taken in a state to be 0.77% of the total for the state.

state #  CS test takers # all test takers expected CS test takers p < E-value
Alaska 21 4570 35.31 0.0066 0.34
Idaho 6 47 9723 75.13 6.3E-25 3.3E-4 3E-23 1.7E-4
Kansas 12 47 15339 118.53 5.95E-36 6.25E-14 3E-34 3.2E-12
Maine 161 14051 108.58 0.9999
Mississippi 2 1 9032 69.79 1.23E-27 3.5E-29 6E-26 1.8E-27
Montana 0 11 4868 37.62 4.59E-17 3.4E-7 2E-15 1.7E-5
Nebraska 12 46 11117 85.91 1.9e-23 1.7E-6 1E-21 8.8E-7
New Mexico 7 57 13365 103.28 3.7E-35 4.7E-7 2E-33 2.4E-5
North Dakota 1 9 2295 17.73 3.7E-7 0.018 2E-5 0.91
Utah 11 103 35721 276.03 2.4E-101 5.6E-23 1E-99 2.8E-21
Wyoming 2 0 2050 15.84 1.9E-5 1.3E-7 0.00096 6.7E-6

Of these eleven states, eight appear to be under-represented in CS test takers (Maine is significantly over-represented in CS test takers).  When I do the multiple-hypothesis correction for having 51 different “states” (including the District of Columbia), the mild under-representation in Alaska and North Dakota is no longer significant, but the other nine eight are.

So the zero black AP CS test takers for the nine states can be fairly confidently attributed to the lack of AP CS test takers, and in Maine to the shortage of black students.  For Alaska, the lack of black AP CS test takers is probably due to the shortage of AP CS test takers in the state.

One can generalize the techniques here to any method of predicting the mean number of students in some category, to see whether the observed number is significantly smaller than the predicted number.  When the predicted number is small, even 0 students may not be statistically significant under-representation.

2013 July 21

Automatically grading programming homework poorly

Filed under: Uncategorized — gasstationwithoutpumps @ 11:39
Tags: , , , , ,

Mark Guzdial’s post Automatically grading programming homework: Echoes of Proust pointed me to an MIT press release, Automatically grading programming homework, which starts with the claim

Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), working with a colleague at Microsoft Research, have developed a new software system that can automatically identify errors in students’ programming assignments and recommend corrections.

While such a debugging aid may be useful for finding and correcting common errors in small programs (such as those found in beginning programming courses), it does not address what I see as one of the main goals of hand-grading student programs: evaluating how well their programs are structured and documented. I often spend as much time grading the comments, decomposition into procedures or methods, data structures chosen, error-handling, and variable names as the algorithmic details—none of which are addressed by this grading scheme.

In a comment on Guzdial’s post, Edward Bujak wrote,

Proust automatic grading caught 85% of the semantic errors since the domain (the specific program) was specified. In a short 6-month consultant position at ETS, we replicated this with the automated grading of the APCS free response questions. This was when the APCS test was in Pascal. The program prompt was known so an expert System (ES) was trained for that problem which would query an Abstract Syntax Tree (AST) dynamically constructed from the students submitted program. We graded on good and bad. The grading was reliable, consistent, and outperformed humans. It never saw the light of day.

Given that the AP CS problems are tiny coding problems that are not checked for the things I look for in hand grading anyway, it seems like an ideal application of automated grading of programs.  The syntax parser would have to be very forgiving (as able to recover from missing semicolons or mistyped variables as a human grader) to grade fairly.

Of course, the AP exams are handwritten, not typed, and OCR is still too unreliable for grading hand-written exams.  The data entry to enable automatic grading of AP CS exams probably exceeds the cost and error rate of hand grading the exams, so it is no wonder that the expert system Bujak worked on never saw the light of day.  Perhaps someday the AP exams will be done with keyboard entry, but the extra opportunities that introduces for cheating make it unlikely to be adopted any time soon.

I suspect that an adequate automatic grader for CS1 problems is possible (if you ignore comments, programming style, variable names, and other important things that CS1 should teach), by combining the generic automatic debugging approaches MIT is using with the problem-specific expert systems of Proust and whatever Bujack worked on for ETS.  The effort may be useful for making MOOCs a little less awful at grading, though it would not help with other problems with the pedagogic approaches of mass instruction.

2013 July 4

AP Computer Science MOOC

Filed under: Uncategorized — gasstationwithoutpumps @ 12:14
Tags: , , , , ,

One approach that is being tried next year to get around the lack of CS instructors in high schools is the first AP Computer Science MOOC.

I have no idea how well this MOOC will work—online education for high schoolers has rather mixed results.  A number of home-schooled students are relying on college-level MOOCs for their instruction, but the drop-out rate is large and the amount of feedback they get usually too little for high school students (probably too little for college students also).

At least got an experienced high-school AP CS teacher to teach the MOOC:

Rebecca Dovi has been teaching high school computer science for over 16 years.

She currently teaches in Hanover County, Virginia where she heads the computer science curriculum committee. She is among 10 secondary school teachers nationwide selected to pilot the new CS Principles course under development by College Board.

One of the other concerns with MOOCs, the lack of verifiable measures of student achievement, is alleviated with this course, as the AP CS A exam provides a fairly well-accepted means of final assessment.

My main concern would be whether students get enough feedback on their programming assignments to learn how to structure and document programs properly—something that is labor-intensive but essential for students to really learn the material properly.  Unfortunately, that is not something easily measured on a 3-hour test like the AP exam, so even decent results on the exam may not tell us whether the students are learning as much as they ought to.  (Of course, the same can be said of in-person AP CS courses—we have no guarantees that the students have learned anything not tested on the exam.)

I think that AP CS does make a good test case for high-school MOOCs—there are few places currently teaching computer programming in high school, and an online course is better than no course.  Aligning the MOOC to the AP test makes it more attractive to high school students and more likely to get high school credit than a random CS MOOC.

Because there are so few high schools teaching CS, the MOOC is not going to displace many teachers using better teaching techniques.

2012 August 3

What comes after AP CS?

Filed under: home school — gasstationwithoutpumps @ 14:29
Tags: , , , ,

A parent on one of the home-school mailing lists I subscribe to asked my advice about computer science courses past AP CS (in private e-mail, not on the list).

My high schooler is looking for recommendations on possible paths beyond AP CS.  They could be online or classroom (SF bay area) or hybrid.  I am aware of the eIMACS University Computer Science 1 and 2 courses that use Scheme, but not sure about usefulness of learning Scheme.  Also, in eIMACS progression, these courses are pre-requisites for their APCS course, so I wonder if these may be suitable after AP CS.  Other than these I am aware of Udacity/Coursera, etc.

This is a hard request for me to answer, because my son’s journey through computer science has been highly idiosyncratic.  He has been programming for about 6 years and is fairly skillful now in Python (including doing some multi-threaded code, some work on GUIs, and optimization with Pyrex), but has not learned Java yet (he’ll get that this coming year, so he can take the AP CS exam, which is highly Java-specific).  My son did do a course using Scheme in 2009–10, and I think it was a valuable addition to his programming knowledge.  It wasn’t the eIMACS course, so I can’t comment on those courses.

Generally, learning very different programming languages helps students build “notional models” that are abstracted away from features of specific languages, making it easier for them to learn new languages in future.  Learning a LISP-based language (like Scheme) often provides a better understanding of recursion, linked lists, trees, and pointers than an Algol derivative like C, Pascal, or Java.  So the eIMACS courses may be useful even for someone who has done well in a Java course.

The usual advice after AP CS would be to take a data structures course, since the heavy load of Java syntax in AP CS usually means that the students have learned almost nothing about how to use data structures, just the syntax for them.  If the student is already familiar with lists, trees, and hash tables, then an algorithms course would be appropriate.  For students thinking of a computer engineering career, it would be best to take a computer architecture class that includes both C and assembly language programming.  (My son has had a little C and C++ programming on the Arduino, including some work with interrupts, but has not gotten into assembly language yet.)

A more theoretical path would suggest taking an applied discrete math course (combinatorics, mathematical induction, Boolean algebra), followed by analysis of algorithms.  I think my son will be taking applied discrete math (as well as picking up Java) in the coming year, since he still needs 2 more years of math in high school, and he has already finished single-variable calculus (the Art of Problem Solving online course, followed by the AP Calculus BC exam).

I’ve seen lots of links for online courses:

I don’t have the time or the energy to look through all these (and undoubtedly many more) sites for suitable courses—especially as my son hates video lectures, so there is no point in looking at 95% of the online courses.  The Saylor courses look interesting, as they seem to rely mostly on reading and doing exercises, with only a few video lectures.  They also seem to follow the standard CS undergrad curriculum.

What I recommend for someone in the SF Bay Area is much simpler: take community college classes.  They are very cheap, and diligent students who come to every class can usually get in before the add deadline, even when there is a waiting list, because so many students stop attending.


Next Page »

%d bloggers like this: