In data mining projects, a popular method of acquiring data is through the use of automated programs aimed at gathering data from the Internet. These programs are colloquially known as bots, spiders, or crawlers and work by automatically engaging in behavior known as "web crawling" which is analgous to "web surfing" as performed by human beings. The goal of web crawling is to collect and store data from various web sites, servers, or other resources available on the World Wide Web. When utilizing web crawling methods, often a critical requirement is maintaining anonymity. This is important in order to circumvent access restrictions / blocks that may be triggered when a targeted server detects undesired requests generated by these bots. To achieve this end, bots are typically designed to route their requests through proxy servers effectively concealing their identity and behavior. Like most things, proxy servers usually need to be purchased and require money and resources in order to be connected to the Internet. However, purchasing an adequate amount of proxy servers can be prohibitively expensive, as was the case for a web crawling project that I had conceived of for fun. After doing some research, I discovered that there is a surprisingly high availability of functional, free, and (presumably) publicly open proxy servers scattered across the Internet. Unfortunately, these proxy servers are not consistently reliable, usually varying dramatically in up-time & latency. With this in mind, tracking and finding suitable public proxies can be a challenge.
To solve this problem I started a project that utilizes free/open-source technologies, online search engines, port scanning, and web sites to automatically search for, identify, test, validate, and record data on public proxy servers from all over the Web. My system stores these records to an in-memory cache which is organized by performance metrics acquired by trial-and-error testing. Periodically, these records are synced to a MySQL database for more permanent record keeping and backup purposes. For nearly two years, this system has been successfully used to accurately predict, filter, and select the most suitable free proxy servers for web crawling. When considering the costs associated with the de facto option of purchasing private proxy servers, I have concluded that my project provides tremendous value by providing access to thousands of http, ssl, and socks proxy servers at any given moment for the cost of a single virtual private server. Upon completion of some currently in-progress improvements to the system, I plan on publishing it as an open-source project which will hopefully provide significant value to others with similar needs/interests/endeavors in data mining.