On TechRepublic: Five super-secret features in Windows 7
BNET Business Network:
BNET
TechRepublic
ZDNet

Talkback

Add your opinion
advertisement

From our video sponsors

Premier Vendor Content Whitepapers, webcasts & resources from our Power Center Sponsors
advertisement
Short clips: Technorati on sifting through splogs

Dorion Carroll, vice president of engineering for Technorati, discusses the challenges inherent in trying to index the growing blogosphere. Because the company grew right along with it, they were able to evolve defenses, like keywords and posting heuristics, against the onslaught of spam blogs.

Sumi Das: And what about these sites that you don't want your users to really be bombarded with, you know, we are talking about spam blogs, the splogs, the scraper sites that people aren't particularly interested in. How does your technology filter those out? Dorion Carroll: With the advent of Yahoo! Pipes with lots of crawlers, with RSS, it's really easy to fabricate sites. Pump up keywords, do some simple word substitutions, plop AdSense on it and make money on other people's content without actually giving anything back. We definitely want to weed those out. Some of the things that we've done and I think this has been part of our advantage having grown up with the blogosphere is as those problems started to surface, we were able to grow our defenses against them. We have a number of defenses right upfront and one of the things that's interesting about blogs is we don't have to try to go guess where the blog updates are. Blogs ping. When you hit "publish" on your blog post it sends a message out and there are a number of services that aggregate pings basically saying here's a site that says it's changed. It doesn't say what's changed. It doesn't say whether anything actually has changed. You then have to go look at the site, compare it to last time you saw it and decide what you want to do with that. Over 95 percent of the pings that we process today are from known spam sources, known to us as spam. All we've ever seen from there is spam. We don't want that stuff and we can drop that on the floor. But, a lot of spam still gets to the next line of defense. We then have Bayesian filters, we do keywords, we look at a number of different heuristics, so posting frequencies. If you are seeing many, many posts per minute, it's not a human being. So, there are definitely signatures you can look for.

==== Transcribed by Automatic Sync Technologies ====