Lots of companies employ some sort of Internet firewall, but schools have a unique obligation to provide more extensive Internet content filtering on their student-use workstations. Content filtering can be applied in a variety of methodologies, and most content filtering technologies use a combination of multiple methodologies. Content filtering may be used to block access to pornography, games, shopping, advertising, email/chat, or file transfers, or to Websites that provide information about hatred/intolerance, weapons, drugs, gambling, etc.
The simplest method of providing content filtering is to specify a blacklist. A blacklist is nothing more than a list of domains, URLs, filenames, or extensions that the content filter is to block. If the domain Playboy.com was blacklisted, for example, access to that entire domain would be blocked, including any subdomains or subfolders. In the case of a blacklisted URL, such as, en.wikipedia.org/wiki/Recreational_drug_use, other pages of the domain might be available, but that specific page would be blocked. Often wildcards can be employed to block vast sets of domains and URLs with simple entries like *sex*. Blacklisting can also be used to prevent software installations by blocking access to files, such as */setup.exe, or to prevent changes to the computer by blocking potentially harmful file types, like *.dll or *.reg. Since content filters can't yet differentiate between art and porn, many content filters are also configured to block graphic file types, such as *.gif, *.jpg, *.png, etc.
A whitelist is the opposite of a blacklist; it's a list of resources that the content filter should allow to pass; like a bouncer at the velvet rope, the content filter blocks any resources not specified on the whitelist. Blacklists and whitelists may be used in conjunction with each other to provide more granular filtering; the blacklist could be used to block all graphic file types, for instance, but the whitelist could be configured to override the blacklist on images coming from specified, moderated or sponsored, age-appropriate image hosting services. Blacklisting and whitelisting are quick and easy ways to determine whether or not a particular Website should be displayed. Checking a Website against a list isn't processor-intensive, so it can be performed quickly, but it also isn't robust in that new Websites are constantly popping up, and there's no way anyone could ever stay on top of adding all of the bad ones to a blacklist.
So what do we do about that continual stream of new Websites coming online? That's where more advanced filtering methodologies come into play. Parsing can be used to search for particular words or phrases in a Webpage. Rather than rely solely on filtering by address, the content filter downloads the requested Website (unless immediately blocked by a blacklist) and reads every line of it, scanning for bad words or phrases. A list of bad words or phrases is specified, conceptually like a blacklist, but this list would be checked for any matching patterns in the Webpage, requiring more processor time, and slowing down the serving of Webpages. (In fact, I'm sure that at this very moment there are already a few content filters balking at displaying this very article simply because it includes the word sex in the previous paragraph, and if that doesn't do it, check out what's coming next...) A typical list of bad words and phrases might include "boobies," but since Web authors are just as interested in getting their content past filters as administrators are in keeping it out, it may also be necessary to include odd-seeming varieties, such as b00bies, boob!es, or boobie$. Filtering may be set to block any pages that include any of the bad phrases, or phrases may be assigned point values and the filter could be set to block any pages that exceed a certain point threshold.
The next methodology of content filtering is called context filtering, and it picks up where word and phrase parsing leaves off. The problem with word and phrase parsing is that it's not very smart. It simply acts upon everything that matches a predefined pattern, without regard for context. It might block pages that include the terms "the naked truth" or "chicken breasts," whereas an administrator might not care about either "naked" or "breasts" in those contexts, but might want to block pages including the words "naked breasts," if used together. Even assigning point values and thresholds, it's possible for legitimate Webpages to be blocked.
For example, a Webpage about breast cancer could easily refer to breasts enough times to exceed a point threshold. Context filtering is performed through a variety of proprietary algorithms that are designed by the various makers of Internet content filters. The trick is that they need to balance speed and accuracy; they must download and carefully analyze all of the wording of the requested Webpages to determine whether they are acceptable or taboo, and they need to do it quickly enough to continue to appear as transparent as possible to the users. If they're too quick to judge, they may let through unacceptable content (known as "misses") or block acceptable content (known as "false hits"), but if they're too pensive, users will complain about latency. Building a better algorithm requires more time and money, so frequently the faster and more accurate filters cost more.
Just for the sake of completeness in this treatise on Internet content filtering, I should also mention that there may be other methodologies employed or configurable in various Internet content filtering solutions. Virtually all Internet content filters operate on port 80 (http); most ignore other protocols, but some may be able to apply filtering to other ports, or may be capable of entirely filtering out specified ports, such as FTP or Telnet. (I wonder which port "World of Warcraft" uses...)
Similar to firewalls, I should also point out that Internet content filters come as hardware or software solutions. Hardware solutions are commonly known as "appliances," and software solutions are commonly known as "applications," or "services." Hardware solutions provide for centralized administration. They may cost more, but they perform all of the filter-related processing so as to relieve your servers and workstations from any such responsibilities. They frequently come with subscription services for updates to the blacklist, whitelist, phrase list, and context data, much like antivirus subscriptions provide updates to lists of virus signatures. They may be multi-homed pass-through gateways, or they may work by redirecting traffic to a specified port or destination IP address.
Higher-end models may also include caching to speed up the serving of frequently-accessed resources. Software-based solutions may be server-based or may be installed on each individual workstation. Most server installations offer the same centralized administration as hardware solutions, but of course, they use your processor and RAM to perform the filtering, rather than being a dedicated appliance. Consequently, they may be less expensive. In the case of a workstation installation, besides installing the software on each individual workstation, you may also need to individually configure each workstation, and periodically you may need to individually update each workstation.
Even Microsoft Internet Explorer has a free, simple, built-in Internet content filter - it's called the "Content Advisor," and you can configure it under Internet Options in the Windows Control Panel. It's fine for your kid's standalone computer or a small peer-to-peer network, but is probably inadequate as an enterprise solution. Whether hardware- or software-based, best-in-class enterprise solutions are often Active Directory-integrated, simplifying administration and configuration, and permitting filtering settings to follow users anywhere in the network. Teachers, for instance could have less-restrictive settings, regardless where they log in, while students could still be blocked, even if they sneak into the faculty lounge during recess.
Maverick Solutions IT, Inc has experience with several Internet content filtering technologies. Our consultancy recommends, installs, and maintains Internet content filtering solutions for our clients. If you have any questions or need any help making a decision regarding your Internet content filtering needs, please contact us.
Brian Blum is the founder, president, and chief-consultant at Maverick Solutions IT, Inc. He's a firm believer in freedom of expression, but recognizes the need for Internet content filtering in limited circumstances. You can read more of his free expression at his blog, Maverick Ramblings.
Brian Blum
Maverick Solutions IT, Inc