Gas station without pumps

2020 September 9

Checked tandem-duplicate words in book

Filed under: Uncategorized — gasstationwithoutpumps @ 16:54
Tags: , , ,

I got all the spelling checks done in the book today, and I noticed a “the the” in the book, so I looked for all occurrences of that pair of words in the LaTeX files and fixed them.  I then decided to write a tandem-word finder and look for all tandem duplicate words in the LaTeX files.  There were about ten others.  I was only checking a line at a time, though, so I decided to also convert the PDF file to text and check that.  That found another 5 or 6 tandem duplicate words (which had crossed line boundaries in the LaTeX files, but not in the output PDF file).

There were a lot of false positives in the PDF file, because “the Thévenin” somehow got treated as if it had “the the” with a word boundary after the second “the”.  There were also a lot of places in tables where numbers were duplicated, or description lists where the item head.

What I’ve not decided yet is whether it is worth rewriting the program to look for duplicate words that cross line boundaries—the program would be a bit more powerful, but I’d need to keep track of the place in the file better to be able to pinpoint where the error occurs, as I would not want to point to a full page as the location of the error.

Here is the code I wrote (edited 2020 Sept 10 to include page or line numbers):

#!/usr/bin/env python3

import re
import sys
import io

import pdftotext	# requires installing poppler and pdftotext

tandem_str = r"\b(\S+\b)\s+\b\1\b"
tandem_re = re.compile(tandem_str,re.IGNORECASE)

def lines_of_input(filenames):
    if not filenames:
        for line in sys.stdin:
            yield "--stdin",line
    else:
        for filename in filenames:
            if filename.endswith(".pdf"):
            	with open(filename, "rb") as file:
                    pdf = pdftotext.PDF(file)
                    for pagenum,page in enumerate(pdf):
                        for line in io.StringIO(page):
                            yield f'{filename} page {pagenum}',line
            else:
                with open(filename, 'r') as file:
                    for linenum,line in enumerate(file):
                        yield f'{filename} line {linenum}',line


for filename,line in lines_of_input(sys.argv[1:]):
#        print("DEBUG:", filename, line, file=sys.stderr)
        if tandem_re.search(line) is not None:
            print(filename,":",line.strip())

%d bloggers like this: