On TechRepublic: Why Android beats iPhone
BNET Business Network:
BNET
TechRepublic
ZDNet

Talkback

Add your opinion
advertisement

From our video sponsors

advertisement
Technorati VP of engineering: Dorion Carroll

Dorion Carroll, vice president of engineering for Technorati, talks to ZDNet correspondent Sumi Das about the challenges with scaling operations as the blogosphere continues to grow. He also discusses how they're able to index millions of blog posts in near real time, surviving the economic downturn and what differentiates the company from its biggest competitor, Google.

Sumi Das: Dorion, thanks for taking the time to speak with us today.

Dorion Carroll: Thank you for having me here.

Sumi Das: Technorati describes itself as a blog search engine. But what exactly does that mean? What does Technorati do?

Dorion Carroll: Well, Technorati is a blog search engine. We originated with the blogosphere, and we index blog posts as they happen in real time--which means the crawler goes out, grabs those posts, brings them in, and, like a typical search engine like Google or something like that, we make those things available. I think what makes Technorati unique is the ability to deliver that in near-real time. But Technorati is becoming something much more than simply a blog search engine. We've become a discovery engine--search and discovery kind of go hand in hand. From all of the blogs that we're pulling together, we're able to surface many of the things that are gaining attention right now--the stuff that people really want to be able to read, and what's happening right now. As opposed to being everything in a definitive library, it's more a question of "What's the conversation going on right now?"

Sumi Das: There's no shortage of content on the net, and there really are no barriers to the blogosphere. Pretty much anybody can have a blog. So, there's a lot of stuff out there that perhaps isn't worthy of people's time. How do you separate the wheat from the chaff? How do you make sure that you have the best? Are you simply using a sophisticated algorithm?

Dorion Carroll: Well, we're using an algorithm, but actually one of the things we're able to do is use the data from blogosphere itself. We have a metric--an identifier called "Technorati Authority". Technorati Authority is the measure of the number of blogs that have linked to your blog in the last six months. So, you might liken that to the number of viewers of a specific TV show--not each individual episode, but over the entire season. And it's not how many people have watched how many times, it's just that total number of people. So, that number actually is an interesting measure of attention. Who is paying attention to whom? With that we can actually find who are the most authoritative bloggers, and from that we have the Technorati Top 100 which we've had on the site for a very long time, at least as long as I've been there. What it does is it shows in real time those sites that are gaining the most attention, and have gained the most attention over the last six months. We're able to use that to filter results, to use that as a weight of influence and to use that in other algorithms, such as in the Percolator, to identify where are the influencers paying attention, and where is the rest of the blogosphere en masse paying attention.

Sumi Das: What sets Technorati apart technologically from its competition, like Techmeme and Google's Blog Search?

Dorion Carroll: I would have to go back to the real time aspect of it. I think for some reason that is one nut that we have really been able to crack. Other folks will definitely be able to see that there are recent posts. I know FriendFeed and Twitter do a really good job dealing with the "now" web or the "live" web. We still index millions and millions of blog posts, and we do that in real time. There's an awful lot that we do that can take advantage of that breaking news, the real time aspects, and the real time calculations of Technorati Authority so that as your blog is gaining attention, you can actually rise up in the ranks. Google Blog Search? It's Google. I mean, they're the 80 thousand pound gorilla. We have to be aware of what they're doing. They're not all focused on blog search; they're not all focused on the types of social media directed ad networks. They're definitely somebody to keep an eye on.

Sumi Das: The technology doesn't always work the way you want it to, though. Technorati has faced criticism for technical problems--index outages, search results that are stale. How do you handle that?

Dorion Carroll: We've had outages. We've blogged about them. We've tried to share a little bit of the pain that we felt. We've grown tremendously. When I started four years ago, we were indexing about 3.5 million blogs. Today, cumulatively over the five years that the company has been in existence, we've indexed over 130 million. A lot of those blogs are dormant. They're not active anymore. The active blogosphere is probably between 15 and 30 million blogs, and that's really what makes up the core of what we're trying to do. Now, we have all the rest of this data hanging around. That's been part of the challenge for us is, how can we scale our architecture to deal with this massive amount of data, while at the same time being able to serve up sub second response time queries for things that are less than a minute old? These are the kinds of challenges that, I'm very proud to say, the team has been able to address. I think we've probably gone through five major architectural overalls in the four years that I've been there. It's like changing the tires on a speedster while you're racing cross-country. It's a non trivial problem. Technorati is a small company. We're about 38 people right now. Only half of those people are in the technical organization, so it's an amazingly talented group of people that are trying to tackle a really large problem. I can't say we do it a hundred percent of the time, but we really try our hardest.

Sumi Das: What about these sites that you don't want your users to really be bombarded with? We're talking about spam blogs, the "splogs", the scraper sites that people aren't particularly interested in. How does your technology filter those out?

Dorion Carroll: With the advent of Yahoo! Pipes, with lots of crawlers, with RSS... It's really easy to fabricate sites pump up keywords, do some simple word substitutions, plop AdSense on it and make money on other people's content without actually giving anything back. We definitely want to weed those out. Some of the things that we've done--and I think has been part of our advantage, having grown up with the blogosphere--is as those problems started to surface, we were able to grow our defenses against them. We have a number of defenses right up front. One of the things that is interesting about blogs is we don't have to try to go guess where the blog updates are. Blogs ping. When you hit "publish" on your blog post, it sends a message out. There are a number of services that aggregate pings, basically saying, "Here is a site that says it's changed." It doesn't say what's changed. It doesn't say whether anything actually has changed. You then have to go look at the site, compare it to the last time you saw it, and decide what you want to do with that. Over 95 percent of the pings that we process today are from known spam sources, known to us as spam. All we've ever seen from there is spam. We don't want that stuff, and we can drop that on the floor. But a lot of spam still gets to the next line of defense. We then have Bayesian filters, we do key words, and we look at a number of heuristics so, posting frequencies. If you are seeing many, many posts per minute, it is not a human being. So there are definitely signatures that you can look for.

Sumi Das: It is a tumultuous time for the economy right now. How is this instability affecting Technorati?

Dorion Carroll: Well, Technorati is a media business. We make our money with branded advertising and with other types of advertising product. So, as marketers are looking at their budgets and trying to figure out what to do with their advertising dollars, there was a little bit of a hold back in late Q3. Actually Q4 will probably be the biggest quarter for Technorati ever. I think we need to be mindful of the fact that there is a financial crisis going on. At the same time, we are a business. We have to be able to adjust our strategies a little bit; and I think we are doing that well. Fundamentally, if we think about it, advertisers still have to be able to reach an audience. In Q4, it is about e commerce, it's about retail. All of these people that are worried about finance have to be able to get their message out. Well, online is definitely still growing. There may be some downward revisions in terms of the growth, but fundamentally online advertising is still growing.

Sumi Das: You talked about how you are indexing all of those blog posts. How do you ensure that the data centers that you've designed and built are efficient from a cost perspective and also from an energy perspective?

Dorion Carroll: Right. It's really interesting. Again, we are a very small company. At the moment, we have a single data center. People may have the impression that we are as big as Google. Well, we are not. We're not nearly as big as Google. Having a single data center is definitely a disadvantage in some ways. But--in terms of how do we balance the cost versus the benefit that's where we are right now. Interestingly, we have been looking at next generations for several of our core components within the infrastructure. We have our data acquisition systems the things that go out and crawl the web, bring things back, check to see whether blog posts have changed, and if they have, pass it on to the indexing infrastructure. And then we have our website infrastructure as well. So there are these three main components. Our architect, Ian Kallen has actually advanced us considerably in terms of how to get out of a single data center. We are taking a look at the Amazon web services. We have launched a new crawler now, in the middle of August. And we are in this process of slowly migrating from the old infrastructure to the new infrastructure. So, we actually have to be able to run things in parallel. This is one of the tricks you use when you are trying to change the tires while you are driving across county. You can't necessarily do everything at once. There is no "big switch" that you can just pull and automatically switch over. Using Amazon, we are actually discovering some very interesting capabilities--the elastic computing framework which says, "If a whole bunch of blogs are posting..." And now we can grow the number of servers we have auto magically, spawn a bunch of new servers. On the weekends it turns out that bloggers don't blog as much on the weekends--we can actually tune that down. Those are cost savings to us. We don't have to go into our own data center. We can use APIs. We are writing software to detect these things automatically, so that if we have queues that are starting to fill up, we can launch more machines, and drain those queues faster. If the queues are empty and nothing seems to be processing, we can actually take some machines down. For us there is actually a balance between our own data center we operate about 700 machines to keep Technorati.com running and then we have some of the elastic capabilities up in the Amazon cloud.

Sumi Das: Dorion, thanks for sitting down and spending some time with us today.

Dorion Carroll: Thank you, Sumi.

Sumi Das: I've been speaking with Dorion Carroll, VP of Engineering at Technorati. For CIO Sessions, I'm Sumi Das. Thanks for watching.