Wednesday, June 16, 2010

suffix/prefix expressions in google safe browsing

google uses host suffix/path prefix expressions to hash the blacklist and malwarelist url for google safe browsing.

When you try to match against a URL: http://www.google.com/header/x.html, you will try all the combination:
google.com/
google.com/header/
google.com/header/x.html

The original design only download 4 bytes hash, when it matches, it will contact the google server again to download 32 bytes hash.

Thursday, June 3, 2010

URL categorization

Some resources to help collect URL categorization information:

Google top 1000

Alexa 1,000,000 top sites

URLblacklist

K9 Web Protection

Squid Guard

Dans Guardian

The Dans Guardian has the regular expression for content filter under the folder:
configs/lists