World Scientific Publishing sent me round four of proofs recently (round three had terrible problems with scrambled citations, and I sent the citation fixes and only a little other stuff on round 3—they still had not fixed some serious errors from round 2). I’ve been working pretty much full-time on reading the proofs, lately, trying to catch as many of the problems as I can.
Proofreading is different from copy editing—I’m not taking a single PDF file and and looking for typos, spelling errors, punctuation errors, and other minor problems, but taking two different PDF files and comparing them. I’m looking for all the differences between their version of the book and my version, then deciding for each difference whether to correct their version, correct my version, or allow them to remain different.
You would think that there would be good tools for taking differences of PDF files, but I haven’t found one (at least, not a free one). All the tools I found were designed to work a page at a time, assuming that the files were almost identical. But World Scientific has set the book in a smaller font, with less white space, so their version of the book has about 12% fewer pages than mine. So I fell back on ancient tools like diff (originally written in 1974).
The diff program compares two text files, trying to come up with a small set of insertions and deletions that convert one file into the other. So the first task was to extract text from the PDF file. Both my son and I had written tools (using different PDF parsing packages) to extract URLs from the hot links in the PDF file, for checking all the links to the web, but rather than try to wrestle with those packages for this problem (I’d not had much luck getting clean text last time I tried), I used an off-the-shelf program, pdf2txt.py, that is available with the Anaconda distribution of python, which uses the pdfminer package. Running that program on each pdf file created a corresponding text file that had most of the text of the pdf file, though somewhat scrambled around any math formulas. I called these the “unpdf” files.
Unfortunately, running diff on the unpdf files was pretty useless, as diff considers lines to be the objects to compare, and the line breaks were totally different between the unpdf files, even in regions where the text was really functionally identical.
My first thought was to take the unpdf file and break it into one word per line, so that diff could find matching blocks of words. I spent a day doing or two doing proofreading with the words files, but it was very tedious, as diff often matched common words from different sentences, so I had to spend a lot of time deciphering whether a particular change was real or not. I used emacs with the ediff-buffers command to display the matches, but it did not really give me enough context.
I thought that using a more modern diff algorithm might help, so I wasted some time figuring out how to apply “git diff” with the histogram algorithm to pair of files. Unfortunately “git diff” also uses a more modern output format, which the emacs ediff package does not understand, and I did not want to write yet another program to take the “git diff” output and convert it to normal diff format.
Instead, I realized that words were not the right size unit—I wanted to compare something more like blocks of sentences. So I wrote yet another Python program to take the unpdf text file and break it into sentences. More precisely, I split it into sentence-like chunks—I just merged everything into one long string then split after every sentence-ending punctuation mark that was followed by white space, and compressed all white space to a single space.
Running diff via ediff-buffers on sentences files worked fairly well, highlighting sentences that had changed and showing me what words within each sentence were different. Diff has some major problems dealing with structural variation, though—if a figure floats to a different location relative to the text, then its caption (or the text that it floated over) will get handled as a deletion in one place and an insertion in the other, without being compared for changes. There is a very similar problem in genomics, looking for structural variation between two copies of a genome, with the added complexity of having inversions possible as well as simple rearrangement, but I don’t know of a (free) text tool that works well on text rearrangements. In any case, the figure captions are relatively small, so I could hand check them when diff was unable to match them.
I could get most of the plain text to match ok, but pdf2txt.py scrambled stuff rather badly around each math formula, and the scrambling was very different in the two unpdf files, so a lot of the lines in the sentences files were different, even when the corresponding text and formulas on the page were the same. So I often had to visually inspect the differences that diff found, or ignore changes near math.
I found hundred of differences between their version and mine. Many were small copy-editing changes they had made, of which I accepted maybe half and incorporated them into my version. Many were copy-editing errors, where they had introduced a change that I regarded as unacceptable—like referring to Lab 13 consistently as Lab 42 or swapping two of the references in the reference list, so that citations for A-weighting pointed to an article on action potentials.
A few other differences were ones I could live with, but didn’t like, so I left them in their version but did not incorporate them into mine. One of the most common was that they replaced essentially all my ellipses (…) with “etc.”, which I could live with, except when the ellipsis was at the end of a sentence (I don’t like overloading the period to be both the end of the abbreviation and the end of the sentence) or when it was in the middle of a list, rather than at the end. I think that the reason they did not like ellipses is that they set them horribly—not using the ellipsis glyph in Unicode, nor the closely spaced periods that TeX uses. Instead they had periods separated by word spaces, like some high-schooler might use.
Another change that I allowed, but did not incorporate, was a change in the format for numbering the subfigures. I used numbers like “Figure 3.4a”, while they used “Figure 3.4(a)”. It was clear that they had hand-edited the cross-references, though, as they often removed or broke the corresponding hot links to the figures, while the unchanged figure references still had hot links.
There were also a few capitalization and punctuation differences, where the copy editors were not so wrong that they needed to be corrected, nor so right that I felt I should match them.
The whole process took several 10-hour days to make one pass through the book. I would never have been able to do the proofreading manually without a program to point out all the little differences. I’ve still not checked the math typesetting, which I hope they did not mess up too badly. The few places I did check looked ok, except for some examples in the LaTeX tutorial sections, where they had “corrected” the examples that were supposed to be showing what happens when you don’t do it right.
Part of the reason for some of the biggest hassles I’ve had in the proof is that the publisher seems to do a lot of stuff manually that should be automated (like all the reference list and citations—if it were done with a tool like BibTeX, then there would not have been the hundreds of citation errors in round 3 of the proofs). They fixed the citation errors in round 4 (except for one swap), but I dread what they are going to do with the index—I suspect that it is going to be hand-crafted and full of errors, while the index I generated will be clean and mostly correct (I put a lot of work into annotating the LaTeX files for the index and formatting the index nicely).
I regard my PDF file as the “true” version of the book, with the World Scientific one as a slightly inferior copy, but I suspect that most people would not distinguish between them, and they will be producing paper copies to sell, which I will not be doing, so there is value in what they are doing.
With them just starting on correcting the errors, I suspect that the book will be coming out in June 2023 (rather than the original target of June 2022). The official web page now says March 2023, but I suspect it will slip again.