Gas station without pumps

2020 September 6

Checked URLs in book

Filed under: Uncategorized — gasstationwithoutpumps @ 12:20
Tags: , , , , ,

I got all the URLs in my book checked yesterday.  Writing a program to extract the links and test them was not very difficult, though some of the links that work fine from Chrome or Preview mysteriously would not work from my link-checking program.

As it turns out, my son was writing me a link-checking program at the same time. His used pdfminer.six instead of PyPDF2, and relied on new features of Python (I still had Python 3.5.5 installed on this laptop, and f-format strings only came in with Python 3.6). I had to install a new version of Python with Anaconda to get his program to run. One difference in our programs is that he collected all the URLs and reduced them to a set of unique URLs (reducing 259 to 206), while I processed the URLs as they were encountered. His program is faster, but mine let me keep track of where in the book the URL occurred.

The checks we did are slightly different, so the programs picked up slightly different sets of bad URLs. He did just a “get” with a specified agent and stream set to True, while I tried “head”, then “get” if “head” failed, then “post” if “get” failed, but with default parameter values.  We also had different ways of detecting redirection (he used the url field of the response, while I used headers[“location”]), which got different redirection information. It might be worthwhile to write a better check program that does more detailed checking, but this pair of programs was enough to check the book, and I don’t want to waste more time on it.

I had to modify a number of the URLs for sites that had moved—in some cases having to Google some of the content in order to find where it had now been hidden. I wasted a lot of time trying to track one source of information back to a primary source, and finally gave up, relying on the (moved) secondary source that I had been citing before.

A surprising number of sites are only accessible with http and not https, and I ended up with eight URLs that I could not get to work in the link-check program, but that worked fine from the PDF file and from Chrome. Some of them worked from my son’s program also, but his failed on some that mine had success with.

Here is the code I wrote:

#!/usr/bin/env python3

import PyPDF2
import argparse
import sys

import requests	


def parse_args():
    """Parse the options and return what argparse does:
        a structure whose fields are the possible options
    """
    parser = argparse.ArgumentParser( description= __doc__, formatter_class = argparse.ArgumentDefaultsHelpFormatter )
    parser.add_argument("filenames", type=str, nargs="*",
            default=[],
            help="""names of files to check
            """)
    options=parser.parse_args()
    return options

    
def pdf_to_urls(pdf_file_name):
   """yields urls used as hyperlinks in file named by pdf_file_name
   """
   pdf = PyPDF2.PdfFileReader(pdf_file_name)
   for page_num in range(pdf.numPages):
        pdfPage = pdf.getPage(page_num)
        pageObject = pdfPage.getObject()
        if '/Annots' in pageObject.keys():
            ann = pageObject['/Annots']
            for a in ann:
               u = a.getObject()
               if '/URI' in u['/A']:
                   yield( page_num,  u['/A']['/URI'])


# HTTP status codes from https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
HTTP_codes = {
    100:"Continue"
    , 101:"Switching Protocol"
    , 102:"Processing (WebDAV)"
    , 102:"Early Hints"
    , 200:"OK"
    , 201:"Created"
    , 202:"Accepted"
    , 203:"Non-Authoritative Information"
    , 204:"No Content"
    , 205:"Reset Content"
    , 206:"Partial Content"
    , 207:"Multi-Status (WebDAV)"
    , 208:"Already Reported (WebDAV)"
    , 226:"IM Used (HTTP Delta encoding)"
    , 300:"Multiple Choice"
    , 301:"Moved Permanently"
    , 302:"Found"
    , 303:"See Other"
    , 304:"Not Modified"
    , 305:"Use Proxy (deprecated)"
    , 306:"unused"
    , 307:"Temporary Redirect"
    , 308:"Permanent Redirect"
    , 400:"Bad Request"
    , 401:"Unauthorized"
    , 402:"Payment Required"
    , 403:"Forbidden"
    , 404:"Not Found"
    , 405:"Method Not Allowed"
    , 406:"Not Acceptable"
    , 407:"Proxy Authentication Required"
    , 408:"Request Timeout"
    , 409:"Conflict"
    , 410:"Gone"
    , 411:"Length Required"
    , 412:"Precondition Failed"
    , 413:"Payload Too Large"
    , 414:"URI Too Long"
    , 415:"Unsupported Media Type"
    , 416:"Range Not Satisfiable"
    , 417:"Expectation Failed"
    , 418:"I'm a teapot"
    , 421:"Misdirected Request"
    , 422:"Unprocessable Entity (WebDAV)"
    , 423:"Locked (WebDAV)"
    , 424:"Failed Dependency (WebDAV)"
    , 425:"Too Early"
    , 426:"Upgrade Required"
    , 428:"Precondition Required"
    , 429:"Too Many Requests"
    , 431:"Request Header Fields Too Large"
    , 451:"Unavailable for Legal Reasons"
    , 500:"Internal Server Error"
    , 501:"Not Implemented"
    , 502:"Bad Gateway"
    , 503:"Service Unavailable"
    , 504:"Gateway Timeout"
    , 505:"HTTP Version Not Supported"
    , 506:"Variant Also Negotiates"
    , 507:"Insufficient Storage (WebDAV)"
    , 508:"Loop Detected (WebDAV)"
    , 510:"Not Extended"
    , 511:"Network Authentication Required"
    }



options=parse_args()
for pdf_name in options.filenames:
    print("checking",pdf_name,file=sys.stderr)
    for page_num,url in pdf_to_urls(pdf_name):
        print ("checking page",page_num, url, file=sys.stderr)
        req = None
        try:
            req = requests.head(url, verify=False)      # don't check SSL certificates
            if req.status_code in [403,405,406]: raise RuntimeError(HTTP_codes[req.status_code])
        except:
            print("--head failed, trying get",file=sys.stderr)
            try: 
                req = requests.get(url)
                if req.status_code in [403,405,406]: raise RuntimeError(HTTP_codes[req.status_code])
            except: 
                print("----get failed, trying post",file=sys.stderr)
                try: req = requests.post(url)
                except: pass
	
        if req is None:
            print("page",page_num, url, "requests failed with no return")
            print("!!!", url, "requests failed with no return", file=sys.stderr)
            continue

        if req.status_code not in (200,302):
            try: 
               code_meaning = HTTP_codes[req.status_code]
            except: 
               code_meaning = "Unknown code!!"
            
            try:
                new_url = req.headers["location"]
            except:
                new_url=url
            
            if url==new_url:
                print("page",page_num, url, req.status_code, code_meaning)
                print("!!!", url, req.status_code, code_meaning, file=sys.stderr)
            else:
                print("OK? page",page_num, url, "moved to", new_url, req.status_code, code_meaning)
                print("!!!", url, "moved to", new_url, req.status_code, code_meaning, file=sys.stderr)

2011 May 18

Logical punctuation

Filed under: Uncategorized — gasstationwithoutpumps @ 06:20
Tags: , ,

Ben Yagoda in an article for Slate Magazine, wrote about the placement of commas relative to quotation marks: Logical punctuation: Should we start placing commas outside quotation marks?.

I’ve always been a stickler for punctuation rules, but I have resisted one rule that is often arbitrarily applied: forcing periods and commas inside quotation marks at the end of a quote, when they are grammatically part of the surrounding sentence and not part of the quote.  I’ve written about quotations before—the following is from a class handout of 1990 that I used for about 10 years:

Quotation marks (“quotes”) are used to enclose a directly quoted statement from another source, or, sometimes, to set off a slang word or deliberately mis-used word. The second usage probably derives from the first, attributing the word to an outside source. Don’t use quotation marks for emphasis—use italics or underlines instead. Single-quotes are used for quotations inside quotations. Some fonts have separate left and right quotes (“like this” and ‘this’); if yours does, use them. Brackets [ ] are for comments from the quoting author inside a quoted passage. One popular bracketed comment is [sic], which is used to indicate that the error in a quotation was in the original, and was not added in transcription.

We disagree with many punctuation experts on one point—they insist on putting commas inside quotes. This is correct when quoting human conversation or human-to-human writing, but when quoting any communication with a computer, retain the original punctuation inside the quote marks. For example, you type “mail”, not “mail.” Exact punctuation is often critical in computer communications—resist the attempts of those who know no better to “correct” your usage!

Of course, all my comments above pertain to standard American punctuation, not to other, equally-valid punctuation systems.  I did allow my students to use British conventions, as long as they used them consistently (no mix-and-match), though I did insist on the serial comma (also known as the Oxford comma), because it increases readability.

Another common problem in modern communication is how to punctuate a URL that is included in a sentence.  It is often an appositive (and so would normally be set off by commas) or at the end of a sentence.  It is really, really bad to add extra punctuation to a URL.  My usual fix is to rewrite my sentences so that URLs are not adjacent to punctuation, or to put the URLs in parentheses. When I have to quickly fix someone else’s text (such as forwarding an e-mail), I often choose the unsatisfying approach of adding a new-line after the URL and omitting the sentence punctuation that would normally follow the URL.

(Incidentally, the “we” in the quoted text was neither the royal, nor the editorial “we”—I co-taught the class at the time I wrote the handout, and so there really were two instructors as the authors of the handout.)

%d bloggers like this: