Gas station without pumps

2020 September 6

Checked URLs in book

Filed under: Uncategorized — gasstationwithoutpumps @ 12:20
Tags: , , , , ,

I got all the URLs in my book checked yesterday.  Writing a program to extract the links and test them was not very difficult, though some of the links that work fine from Chrome or Preview mysteriously would not work from my link-checking program.

As it turns out, my son was writing me a link-checking program at the same time. His used pdfminer.six instead of PyPDF2, and relied on new features of Python (I still had Python 3.5.5 installed on this laptop, and f-format strings only came in with Python 3.6). I had to install a new version of Python with Anaconda to get his program to run. One difference in our programs is that he collected all the URLs and reduced them to a set of unique URLs (reducing 259 to 206), while I processed the URLs as they were encountered. His program is faster, but mine let me keep track of where in the book the URL occurred.

The checks we did are slightly different, so the programs picked up slightly different sets of bad URLs. He did just a “get” with a specified agent and stream set to True, while I tried “head”, then “get” if “head” failed, then “post” if “get” failed, but with default parameter values.  We also had different ways of detecting redirection (he used the url field of the response, while I used headers[“location”]), which got different redirection information. It might be worthwhile to write a better check program that does more detailed checking, but this pair of programs was enough to check the book, and I don’t want to waste more time on it.

I had to modify a number of the URLs for sites that had moved—in some cases having to Google some of the content in order to find where it had now been hidden. I wasted a lot of time trying to track one source of information back to a primary source, and finally gave up, relying on the (moved) secondary source that I had been citing before.

A surprising number of sites are only accessible with http and not https, and I ended up with eight URLs that I could not get to work in the link-check program, but that worked fine from the PDF file and from Chrome. Some of them worked from my son’s program also, but his failed on some that mine had success with.

Here is the code I wrote:

#!/usr/bin/env python3

import PyPDF2
import argparse
import sys

import requests	


def parse_args():
    """Parse the options and return what argparse does:
        a structure whose fields are the possible options
    """
    parser = argparse.ArgumentParser( description= __doc__, formatter_class = argparse.ArgumentDefaultsHelpFormatter )
    parser.add_argument("filenames", type=str, nargs="*",
            default=[],
            help="""names of files to check
            """)
    options=parser.parse_args()
    return options

    
def pdf_to_urls(pdf_file_name):
   """yields urls used as hyperlinks in file named by pdf_file_name
   """
   pdf = PyPDF2.PdfFileReader(pdf_file_name)
   for page_num in range(pdf.numPages):
        pdfPage = pdf.getPage(page_num)
        pageObject = pdfPage.getObject()
        if '/Annots' in pageObject.keys():
            ann = pageObject['/Annots']
            for a in ann:
               u = a.getObject()
               if '/URI' in u['/A']:
                   yield( page_num,  u['/A']['/URI'])


# HTTP status codes from https://developer.mozilla.org/en-US/docs/Web/HTTP/Status
HTTP_codes = {
    100:"Continue"
    , 101:"Switching Protocol"
    , 102:"Processing (WebDAV)"
    , 102:"Early Hints"
    , 200:"OK"
    , 201:"Created"
    , 202:"Accepted"
    , 203:"Non-Authoritative Information"
    , 204:"No Content"
    , 205:"Reset Content"
    , 206:"Partial Content"
    , 207:"Multi-Status (WebDAV)"
    , 208:"Already Reported (WebDAV)"
    , 226:"IM Used (HTTP Delta encoding)"
    , 300:"Multiple Choice"
    , 301:"Moved Permanently"
    , 302:"Found"
    , 303:"See Other"
    , 304:"Not Modified"
    , 305:"Use Proxy (deprecated)"
    , 306:"unused"
    , 307:"Temporary Redirect"
    , 308:"Permanent Redirect"
    , 400:"Bad Request"
    , 401:"Unauthorized"
    , 402:"Payment Required"
    , 403:"Forbidden"
    , 404:"Not Found"
    , 405:"Method Not Allowed"
    , 406:"Not Acceptable"
    , 407:"Proxy Authentication Required"
    , 408:"Request Timeout"
    , 409:"Conflict"
    , 410:"Gone"
    , 411:"Length Required"
    , 412:"Precondition Failed"
    , 413:"Payload Too Large"
    , 414:"URI Too Long"
    , 415:"Unsupported Media Type"
    , 416:"Range Not Satisfiable"
    , 417:"Expectation Failed"
    , 418:"I'm a teapot"
    , 421:"Misdirected Request"
    , 422:"Unprocessable Entity (WebDAV)"
    , 423:"Locked (WebDAV)"
    , 424:"Failed Dependency (WebDAV)"
    , 425:"Too Early"
    , 426:"Upgrade Required"
    , 428:"Precondition Required"
    , 429:"Too Many Requests"
    , 431:"Request Header Fields Too Large"
    , 451:"Unavailable for Legal Reasons"
    , 500:"Internal Server Error"
    , 501:"Not Implemented"
    , 502:"Bad Gateway"
    , 503:"Service Unavailable"
    , 504:"Gateway Timeout"
    , 505:"HTTP Version Not Supported"
    , 506:"Variant Also Negotiates"
    , 507:"Insufficient Storage (WebDAV)"
    , 508:"Loop Detected (WebDAV)"
    , 510:"Not Extended"
    , 511:"Network Authentication Required"
    }



options=parse_args()
for pdf_name in options.filenames:
    print("checking",pdf_name,file=sys.stderr)
    for page_num,url in pdf_to_urls(pdf_name):
        print ("checking page",page_num, url, file=sys.stderr)
        req = None
        try:
            req = requests.head(url, verify=False)      # don't check SSL certificates
            if req.status_code in [403,405,406]: raise RuntimeError(HTTP_codes[req.status_code])
        except:
            print("--head failed, trying get",file=sys.stderr)
            try: 
                req = requests.get(url)
                if req.status_code in [403,405,406]: raise RuntimeError(HTTP_codes[req.status_code])
            except: 
                print("----get failed, trying post",file=sys.stderr)
                try: req = requests.post(url)
                except: pass
	
        if req is None:
            print("page",page_num, url, "requests failed with no return")
            print("!!!", url, "requests failed with no return", file=sys.stderr)
            continue

        if req.status_code not in (200,302):
            try: 
               code_meaning = HTTP_codes[req.status_code]
            except: 
               code_meaning = "Unknown code!!"
            
            try:
                new_url = req.headers["location"]
            except:
                new_url=url
            
            if url==new_url:
                print("page",page_num, url, req.status_code, code_meaning)
                print("!!!", url, req.status_code, code_meaning, file=sys.stderr)
            else:
                print("OK? page",page_num, url, "moved to", new_url, req.status_code, code_meaning)
                print("!!!", url, "moved to", new_url, req.status_code, code_meaning, file=sys.stderr)

7 Comments »

  1. Your tool worked nicely. On mac with python3 installed with brew (I think), I had to run: “pip3 install PyPDF2 ; pip3 install requests” to get it to work.

    I’ve just run it against the pdf for my book, and it found one issue, now fixed.

    Comment by peeterjoot — 2020 September 7 @ 19:39 | Reply

  2. fyi. I hacked plain text support into your script so I could check my bib file. Saved a copy of that hacked script here:

    https://github.com/peeterjoot/physicsplay/blob/master/bin/urlcheckerpy

    I don’t know enough python to understand why, but the output for the plain text input is way more verbose (lots of warnings.warn output that I don’t see when the input source is pdf).

    Comment by peeterjoot — 2020 September 14 @ 09:31 | Reply

    • I think your problem may be that you are checking full lines (including the linebreak at the end). You should probably at least strip the white space off the ends by passing line.strip() as the url, instead of line. Even better might be to break on white spaces (line.strip().split()) and iterate over the words, only checking those that start with http:, https:, or ftp:

      Comment by gasstationwithoutpumps — 2020 September 14 @ 13:00 | Reply

      • Thanks. I’ve added the strip(). The output is still more verbose (but is free of many less errors this time). The verbosity isn’t a big dead, but is odd. Instead of logging lines like “InsecureRequestWarning: Unverified HTTPS” I get two lines that includes a warnings.warn( in the output and a fully qualified path to the function that generates that message:

        /usr/local/lib/python3.8/site-packages/urllib3/connectionpool.py:981: InsecureRequestWarning: Unverified HTTPS request is being made to host ‘en.wikipedia.org’. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
        warnings.warn(

        Comment by peeterjoot — 2020 September 14 @ 13:36 | Reply

        • The “InsecureRequestWarnings” are in my script also—they come from checking the https links without requiring certificates (the verify=False option). You can require certificates, but then some of the links will fail because the certificates are self-signed or otherwise not recognized by the program.

          Comment by gasstationwithoutpumps — 2020 September 14 @ 14:46 | Reply

          • I was seeing the InsecureRequestWarnings warnings with a pdf input file too, but the warnings are more terse when the source is a pdf, which I don’t see any reason for (but will ignore as a python mystery that I don’t know how to investigate.)

            Comment by peeterjoot — 2020 September 14 @ 15:10 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: