While doing the restructuring, I am testing in more depth as I change the code. And, I am trying to grok how the proxy options work. Specifically, how the proxy list works. Or, does not work.
There is code in the main function that randomly selects proxies from a list, but it does not actually use the result. This was noticed in #292. It looks like the only place where the proxy list is used is when there is a proxy error during get_response()...in that case a new random proxy is chosen. But, there is no care taken to ensure that we do not get the same proxy that just errored out. It seems like problematic proxies should be blacklisted if there is that type of failure.
Moreover, there is a check earlier in the code that does not allow the proxy list and proxy command line option to be used simultaneously. So, I can see no way that the proxy list has any functionality: if you do define the proxy list, then there is no way to kick off the general request with a proxy.
I also noticed that the recursive get_response() call does not pass its return tuples back up the call chain. The existing code would never get any good from the switchover to an alternate proxy (even if the other problems mentioned above were resolved).
For now, I am removing the support. This feature may be looked at after the restructuring is done.
Previous code was allocating room for as many workers as there was sites. The problem is that as the number of sites has grown, there has not been enough memory to allocate all of those requests. In reality, having all of these requests in parallel does not really speed the processing: on my computer, the time to do a query for all of the sites was 1 minute 10 seconds before the change, and was 1 minute 9 seconds after the change.
Limiting the number of workers to 10 did increase the query time to 1 minute 17s. I am not sure if that is just inconsistencies in network traffic, but I will leave the limit at 20 for now.
Note that with the limit of 20, my query detected more sites than it did previously. It appears that some of the requests were failing on my computer because of memory reasons (as opposed to actual detection on the site).
It is not really useful because people who are using this script use Linux and all Linux distros come with Python and pip, so there is no need to have a script to install it. Also, it only works on Linux.