google uses host suffix/path prefix expressions to hash the blacklist and malwarelist url for google safe browsing.
When you try to match against a URL: http://www.google.com/header/x.html, you will try all the combination:
google.com/
google.com/header/
google.com/header/x.html
The original design only download 4 bytes hash, when it matches, it will contact the google server again to download 32 bytes hash.
Wednesday, June 16, 2010
Thursday, June 3, 2010
URL categorization
Some resources to help collect URL categorization information:
Google top 1000
Alexa 1,000,000 top sites
URLblacklist
K9 Web Protection
Squid Guard
Dans Guardian
The Dans Guardian has the regular expression for content filter under the folder:
configs/lists
Google top 1000
Alexa 1,000,000 top sites
URLblacklist
K9 Web Protection
Squid Guard
Dans Guardian
The Dans Guardian has the regular expression for content filter under the folder:
configs/lists
Subscribe to:
Posts (Atom)