Gas station without pumps

2020 September 9

Checked tandem-duplicate words in book

Filed under: Uncategorized — gasstationwithoutpumps @ 16:54
Tags: , , ,

I got all the spelling checks done in the book today, and I noticed a “the the” in the book, so I looked for all occurrences of that pair of words in the LaTeX files and fixed them.  I then decided to write a tandem-word finder and look for all tandem duplicate words in the LaTeX files.  There were about ten others.  I was only checking a line at a time, though, so I decided to also convert the PDF file to text and check that.  That found another 5 or 6 tandem duplicate words (which had crossed line boundaries in the LaTeX files, but not in the output PDF file).

There were a lot of false positives in the PDF file, because “the Thévenin” somehow got treated as if it had “the the” with a word boundary after the second “the”.  There were also a lot of places in tables where numbers were duplicated, or description lists where the item head.

What I’ve not decided yet is whether it is worth rewriting the program to look for duplicate words that cross line boundaries—the program would be a bit more powerful, but I’d need to keep track of the place in the file better to be able to pinpoint where the error occurs, as I would not want to point to a full page as the location of the error.

Here is the code I wrote (edited 2020 Sept 10 to include page or line numbers):

#!/usr/bin/env python3

import re
import sys
import io

import pdftotext	# requires installing poppler and pdftotext

tandem_str = r"\b(\S+\b)\s+\b\1\b"
tandem_re = re.compile(tandem_str,re.IGNORECASE)

def lines_of_input(filenames):
    if not filenames:
        for line in sys.stdin:
            yield "--stdin",line
    else:
        for filename in filenames:
            if filename.endswith(".pdf"):
            	with open(filename, "rb") as file:
                    pdf = pdftotext.PDF(file)
                    for pagenum,page in enumerate(pdf):
                        for line in io.StringIO(page):
                            yield f'{filename} page {pagenum}',line
            else:
                with open(filename, 'r') as file:
                    for linenum,line in enumerate(file):
                        yield f'{filename} line {linenum}',line


for filename,line in lines_of_input(sys.argv[1:]):
#        print("DEBUG:", filename, line, file=sys.stderr)
        if tandem_re.search(line) is not None:
            print(filename,":",line.strip())

3 Comments »

  1. Is there an easy way to modify this script to show the page numbers that it finds the duplicates on?

    Comment by peeterjoot — 2020 September 10 @ 07:44 | Reply

    • Actually, not worth that enhancement, at least for me — I had only 5 such errors after pruning all the hits from equations. I searched for my hits (the the, and and, since, since, …) in skim and used the synctex indexing to pull up the corresponding latex files in the editor (half of them crossed line boundaries, so would have been hard to find otherwise.)

      Comment by peeterjoot — 2020 September 10 @ 08:00 | Reply

    • Yes and no. For the pdf file, the “for page in pdf:” could be replaced by “for pagenum,page in enumerate(pdf):” to get the serial page number stating at 0, but extracting the page number that the book uses (with roman numerals for front matter, and arabic numerals only starting partway through the book) would be more difficult—I’m not sure how to get them, as the pdf2text module I’m using has a rather minimal interface.

      For text files, page number is not very useful, but line number could be extracted with “for linenum,line in enumerate(file):”

      Comment by gasstationwithoutpumps — 2020 September 10 @ 08:09 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.