Gas station without pumps

2020 September 9

Checked tandem-duplicate words in book

Filed under: Uncategorized — gasstationwithoutpumps @ 16:54
Tags: , , ,

I got all the spelling checks done in the book today, and I noticed a “the the” in the book, so I looked for all occurrences of that pair of words in the LaTeX files and fixed them.  I then decided to write a tandem-word finder and look for all tandem duplicate words in the LaTeX files.  There were about ten others.  I was only checking a line at a time, though, so I decided to also convert the PDF file to text and check that.  That found another 5 or 6 tandem duplicate words (which had crossed line boundaries in the LaTeX files, but not in the output PDF file).

There were a lot of false positives in the PDF file, because “the Thévenin” somehow got treated as if it had “the the” with a word boundary after the second “the”.  There were also a lot of places in tables where numbers were duplicated, or description lists where the item head.

What I’ve not decided yet is whether it is worth rewriting the program to look for duplicate words that cross line boundaries—the program would be a bit more powerful, but I’d need to keep track of the place in the file better to be able to pinpoint where the error occurs, as I would not want to point to a full page as the location of the error.

Here is the code I wrote (edited 2020 Sept 10 to include page or line numbers):

#!/usr/bin/env python3

import re
import sys
import io

import pdftotext	# requires installing poppler and pdftotext

tandem_str = r"\b(\S+\b)\s+\b\1\b"
tandem_re = re.compile(tandem_str,re.IGNORECASE)

def lines_of_input(filenames):
    if not filenames:
        for line in sys.stdin:
            yield "--stdin",line
    else:
        for filename in filenames:
            if filename.endswith(".pdf"):
            	with open(filename, "rb") as file:
                    pdf = pdftotext.PDF(file)
                    for pagenum,page in enumerate(pdf):
                        for line in io.StringIO(page):
                            yield f'{filename} page {pagenum}',line
            else:
                with open(filename, 'r') as file:
                    for linenum,line in enumerate(file):
                        yield f'{filename} line {linenum}',line


for filename,line in lines_of_input(sys.argv[1:]):
#        print("DEBUG:", filename, line, file=sys.stderr)
        if tandem_re.search(line) is not None:
            print(filename,":",line.strip())

2020 September 6

Checked URLs in book

Filed under: Uncategorized — gasstationwithoutpumps @ 12:20
Tags: , , , , ,

I got all the URLs in my book checked yesterday.  Writing a program to extract the links and test them was not very difficult, though some of the links that work fine from Chrome or Preview mysteriously would not work from my link-checking program.

As it turns out, my son was writing me a link-checking program at the same time. His used pdfminer.six instead of PyPDF2, and relied on new features of Python (I still had Python 3.5.5 installed on this laptop, and f-format strings only came in with Python 3.6). I had to install a new version of Python with Anaconda to get his program to run. One difference in our programs is that he collected all the URLs and reduced them to a set of unique URLs (reducing 259 to 206), while I processed the URLs as they were encountered. His program is faster, but mine let me keep track of where in the book the URL occurred.

The checks we did are slightly different, so the programs picked up slightly different sets of bad URLs. He did just a “get” with a specified agent and stream set to True, while I tried “head”, then “get” if “head” failed, then “post” if “get” failed, but with default parameter values.  We also had different ways of detecting redirection (he used the url field of the response, while I used headers[“location”]), which got different redirection information. It might be worthwhile to write a better check program that does more detailed checking, but this pair of programs was enough to check the book, and I don’t want to waste more time on it.

I had to modify a number of the URLs for sites that had moved—in some cases having to Google some of the content in order to find where it had now been hidden. I wasted a lot of time trying to track one source of information back to a primary source, and finally gave up, relying on the (moved) secondary source that I had been citing before.

A surprising number of sites are only accessible with http and not https, and I ended up with eight URLs that I could not get to work in the link-check program, but that worked fine from the PDF file and from Chrome. Some of them worked from my son’s program also, but his failed on some that mine had success with.

Here is the code I wrote:

#!/usr/bin/env python3

import PyPDF2
import argparse
import sys

import requests	


def parse_args():
    """Parse the options and return what argparse does:
        a structure whose fields are the possible options
    """
    parser = argparse.ArgumentParser( description= __doc__, formatter_class = argparse.ArgumentDefaultsHelpFormatter )
    parser.add_argument("filenames", type=str, nargs="*",
            default=[],
            help="""names of files to check
            """)
    options=parser.parse_args()
    return options

    
def pdf_to_urls(pdf_file_name):
   """yields urls used as hyperlinks in file named by pdf_file_name
   """
   pdf = PyPDF2.PdfFileReader(pdf_file_name)
   for page_num in range(pdf.numPages):
        pdfPage = pdf.getPage(page_num)
        pageObject = pdfPage.getObject()
        if '/Annots' in pageObject.keys():
            ann = pageObject['/Annots']
            for a in ann:
               u = a.getObject()
               if '/URI' in u['/A']:
                   yield( page_num,  u['/A']['/URI'])


# HTTP status codes from https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
HTTP_codes = {
    100:"Continue"
    , 101:"Switching Protocol"
    , 102:"Processing (WebDAV)"
    , 102:"Early Hints"
    , 200:"OK"
    , 201:"Created"
    , 202:"Accepted"
    , 203:"Non-Authoritative Information"
    , 204:"No Content"
    , 205:"Reset Content"
    , 206:"Partial Content"
    , 207:"Multi-Status (WebDAV)"
    , 208:"Already Reported (WebDAV)"
    , 226:"IM Used (HTTP Delta encoding)"
    , 300:"Multiple Choice"
    , 301:"Moved Permanently"
    , 302:"Found"
    , 303:"See Other"
    , 304:"Not Modified"
    , 305:"Use Proxy (deprecated)"
    , 306:"unused"
    , 307:"Temporary Redirect"
    , 308:"Permanent Redirect"
    , 400:"Bad Request"
    , 401:"Unauthorized"
    , 402:"Payment Required"
    , 403:"Forbidden"
    , 404:"Not Found"
    , 405:"Method Not Allowed"
    , 406:"Not Acceptable"
    , 407:"Proxy Authentication Required"
    , 408:"Request Timeout"
    , 409:"Conflict"
    , 410:"Gone"
    , 411:"Length Required"
    , 412:"Precondition Failed"
    , 413:"Payload Too Large"
    , 414:"URI Too Long"
    , 415:"Unsupported Media Type"
    , 416:"Range Not Satisfiable"
    , 417:"Expectation Failed"
    , 418:"I'm a teapot"
    , 421:"Misdirected Request"
    , 422:"Unprocessable Entity (WebDAV)"
    , 423:"Locked (WebDAV)"
    , 424:"Failed Dependency (WebDAV)"
    , 425:"Too Early"
    , 426:"Upgrade Required"
    , 428:"Precondition Required"
    , 429:"Too Many Requests"
    , 431:"Request Header Fields Too Large"
    , 451:"Unavailable for Legal Reasons"
    , 500:"Internal Server Error"
    , 501:"Not Implemented"
    , 502:"Bad Gateway"
    , 503:"Service Unavailable"
    , 504:"Gateway Timeout"
    , 505:"HTTP Version Not Supported"
    , 506:"Variant Also Negotiates"
    , 507:"Insufficient Storage (WebDAV)"
    , 508:"Loop Detected (WebDAV)"
    , 510:"Not Extended"
    , 511:"Network Authentication Required"
    }



options=parse_args()
for pdf_name in options.filenames:
    print("checking",pdf_name,file=sys.stderr)
    for page_num,url in pdf_to_urls(pdf_name):
        print ("checking page",page_num, url, file=sys.stderr)
        req = None
        try:
            req = requests.head(url, verify=False)      # don't check SSL certificates
            if req.status_code in [403,405,406]: raise RuntimeError(HTTP_codes[req.status_code])
        except:
            print("--head failed, trying get",file=sys.stderr)
            try: 
                req = requests.get(url)
                if req.status_code in [403,405,406]: raise RuntimeError(HTTP_codes[req.status_code])
            except: 
                print("----get failed, trying post",file=sys.stderr)
                try: req = requests.post(url)
                except: pass
	
        if req is None:
            print("page",page_num, url, "requests failed with no return")
            print("!!!", url, "requests failed with no return", file=sys.stderr)
            continue

        if req.status_code not in (200,302):
            try: 
               code_meaning = HTTP_codes[req.status_code]
            except: 
               code_meaning = "Unknown code!!"
            
            try:
                new_url = req.headers["location"]
            except:
                new_url=url
            
            if url==new_url:
                print("page",page_num, url, req.status_code, code_meaning)
                print("!!!", url, req.status_code, code_meaning, file=sys.stderr)
            else:
                print("OK? page",page_num, url, "moved to", new_url, req.status_code, code_meaning)
                print("!!!", url, "moved to", new_url, req.status_code, code_meaning, file=sys.stderr)

2019 January 8

One figure has been giving me grief for a long time

Filed under: Circuits course — gasstationwithoutpumps @ 09:22
Tags: , , ,

There is one figure in my book that has been giving me trouble for a long time:

A Moiré pattern figure for the sampling and aliasing chapter that was giving me trouble.

The figure itself is very simple, and it should have been no trouble at all. I created the figure in hand-written SVG, and all the SVG readers (Inkscape, Preview, and browsers) had no trouble rendering it on the screen. But when Inkscape converted it to PDF (using the Cairo library, I believe), it threw away the black bars in the background. When I asked Inkscape to print the image to PDF, it rotated the image.

For a while, I got away with rerotating the image in Preview and saving the result, but the file got damaged or deleted at some point, and redoing the rotation in Preview no longer worked—pdflatex seemed to have no idea that there was a rotation nor a bounding box any more.  (I think Preview changed when I upgraded the mac OS on my laptop.) This change happened between the 2018 Dec 15 and 2018 Dec 30 releases of the book, so the Dec 30 release had a messed-up figure without my realizing it.

Yesterday evening, I noticed the problem and set about trying to fix it.  Nothing I could do with Inkscape or Preview seemed to work—I either ended up with no black bars or with the image rotated and scaled wrong.  (Viewing the individual image with Preview sometimes worked—but the inclusion by pdflatex was failing in those cases.)

Finally, I decided that since Inkscape was incapable of rendering in PDF the pattern-fill I was using to create the bars, that I would give up on pattern fill to create them.  Instead I used a Python program to generate separate rectangles.  Inkscape had no trouble converting that longer but less sophisticated SVG program to PDF, and I was able to fix the figure.

Because this figure was messed up in the “final” release of 30 Dec 2018, I did a quick re-release last night, fixing this figure and a bunch of typos students had found.  Yesterday was the first day of class, and students have already reported 7 errors in the book (one reported after yesterday’s release, so it is still in the current version at LeanPub).

This year’s class seems to be very diligent, as all the students had the book downloaded by the first day of class, and some had started on the homework.

%d bloggers like this: