On CHOW: Holiday side dishes
BNET Business Network:
BNET
TechRepublic
ZDNet
  • Talkback
  • Most Recent of 1 Talkback(s)
Video Interview  suzcomptime | 10/25/09

What do you think?

advertisement

From our video sponsors

Premier Vendor Content Whitepapers, webcasts & resources from our Power Center Sponsors
advertisement
Facebook VP of technical operations: Jonathan Heiliger

Jonathan Heiliger, vice president of technical operations at Facebook speaks to CNET News.com's Dan Farber about the balancing act between innovating quickly and building a stable infrastructure at a company moving at breakneck-speed. Heiliger also discusses what he's doing to scale data center operations and support the addition of more than 250,000 customers on a daily basis.

Dan Farber: Jonathan thanks for joining me.

Jonathan Heiliger: Thanks for having me Dan.

Dan Farber: Now you've been at Facebook, I think, for about a year and it's been quite a ride I guess, scaling up from zero in 2004 to over 70 million today so how do you keep up with that hyper growth?

Jonathan Heiliger: Well you're absolutely right we've had a lot of growth. We add over 250,000 users every day, and that means a lot of infrastructure, a lot of servers and constantly looking at new processes and looking at how we're doing things and ensuring that we're doing things the most efficient way possible, not just for delivering all the content to our users but to stay on top of what it costs to run the site.

Dan Farber: In terms of staying on top of what it costs you obviously that's got to be a big issue in your spending a couple $100 million on infrastructure as I understand. So how do you stay on top of the cost in terms of the kind of equipment you buy and how you work with the vendors, how do you prioritize those things?

Jonathan Heiliger: Well, one of the things we recently did was we ran an RFP process for the servers we buy from vendors and essentially did a bake off with a number of different people looking at building servers on our own and what we concluded from that process was to continue to buy servers from a couple of major OEM's, but through that process were able to lock in prices today and carry those prices forward as all the commodity components costs drops in the servers as well.

Dan Farber: Oh so doing a little futures.

Jonathan Heiliger: Exactly.

Dan Farber: Now when you're buying those servers, and I assume you're doing just huge scale outs of commodity servers, what do they look like? How are they configured?

Jonathan Heiliger: We're pretty lucky in that we run a wide variety of applications literally tens of applications on our own and hundreds of applications for our platform developers that use Facebook as a distribution mechanism as a way of interacting with their users but one of the reasons we're very luck is our engineering team has selected to use PHP as the primary development language that allows us to use a fairly generic server type. So we, with a couple of exceptions, have three sort of main server types and run a fairly homogeneous environment, which allows us obviously to then consolidate our buying power, it allows us to plan further in the future because we can buy servers on the cabinet basis rather then on an individual basis.

Dan Farber: You're different from Google in the kinds of applications that you run, they're mostly running search queries and you're running all kinds of queries and bringing back all kinds of data from the social graph so how is it different in terms of the way you build out your data center from the inside?

Jonathan Heiliger: So Google has a tremendous amount of information that they index and archive and present to users, but fundamentally but if you go to Google and type in a search for a ''tiger'' and I go to Google and type in a search for a ''tiger'' we're going to see generally the same results so they're presenting that same information to both of us. Facebook is a little different in that the context of our data is all social so when you look at your friends and their status updates and their photos and the notes they may have written, you're going to see one set of data versus if I look at my friends and their photos and their notes and status updates, and those tend to be non-intersecting sets of data.

Dan Farber: So it's much more dynamic?

Jonathan Heiliger: Much more dynamic data set and what that means is it's caused us to do a bunch of different things relative to caching and relative to federating all of that data up amongst thousands of different databases so that as a user requests all of that information we're not one particular server every time for different data.

Dan Farber: So tell me a little bit more about your distributed architecture and how you're able to deliver really good performance and I've seen it improve over the last few months, when you've got so many databases, so many applications?

Dan Farber: Now you recently introduced a chat application on Facebook, and it seems like it took a lot of time to test it to make sure it could scale having all those simultaneously conversations going on, could you give us a little background and color on how that came to be?

Jonathan Heiliger: Yeah so ''Chat'' is actually one of our, as you said, one of our most recent launches it started as a hack-a-thon project, which is one of the things we do on about a monthly every other month basis people get together and work all night and pick a project they don't have time to do necessarily during the day. So it started as this hack-a-thon project and other the period of a couple of weeks and really probably a couple of months from the time really germinated as an idea to the time it launched and available for our entire user base, it became a more formal development project. One of the things we did as part of that was actually built a new back-end service to be able to deal with all of the millions of simultaneous connections that we persist for your users.

Dan Farber: One other thing I was reading up on some of the work you've been doing and you say that clouds don't solve single points of failure in your stack so we're talking about Facebook being a huge cloud based application what are those single points of failure?

Jonathan Heiliger: Interesting question, and the notion you are referring to there was part of the talk I give in regards to that cloud computing is just a panacea, and for a startup or even a more mature startup like Facebook, isn't the answer to solving failure points in an application so by that I mean the underlining infrastructure that powers an application is typically the result of, or the outcome of, how the application is originally designed and how users interact with that application. Now so if an application is poorly designed or designed to constantly reference a single set of data, the underlining infrastructure is going to be the victim of that. The guys like myself in sort of the infrastructure world, we have to figure out how to best make that work.

Dan Farber: Well as someone who is in operations how much impact do you have on the application development to make sure that once it gets into the data center that it can work properly and scale and not have the kind of failures we're seeing with some of the new applications?

Jonathan Heiliger: Absolutely, and I think it's a constant challenge in any organization particularly a fast-moving one like Facebook, where we want to iterate quickly and get product out in our customers hands so we can get feedback on that product and continue to tweak it and enhance it over time and so we have one force that's moving in that direction, and we have another force that says we want to keep the site up, we want the site to be reliable and we want the site to be fast, so there's a fine balancing act that everyone in management and everyone in both the engineering and operations department constantly just sort of works, interacts and goes back and forth, figures out just how to make those tradeoffs and sometimes we air to aggressively on the side of innovation and iteration and put things out on the site in perhaps a small quantity that may break the site or cause the site to slow temporarily and other times we air on the side conservatism, of not releasing new functionality or new feature, and that then delays the sort of user gratification of having that feature or fixing that bug.

Dan Farber: What are the challenges that you see let's say you're at 70 million uniques, 250,000 being added per day and 50,000 per second. What happens when you get to 500 million or a billion if you ever get there?

Jonathan Heiliger: Hopefully, tremendous things I think we can only look forward to those days.

Dan Farber: But what are some of the bottle necks or barriers you have to overcome to get to that kind of scale?

Jonathan Heiliger: So some of the bottle necks we're facing are how we scale this extremely distributed set of data. One of the challenges we have is figuring out how to make that replicated such that it can exist in multiple places around the world and we don't also have to bring uses back to the U.S. or back to one of our data centers and I think it's a challenge that most Websites tend to face as they scale, which is you start in one location with a single database then you have to figure out how to grow from there, and primarily driven by the amount of latency or the amount of time it takes to reach the site and interact with the site so being able to replicate the data across multiple data centers across multiple geographies allow users to not just read their data from a local version but write that data as well that is one of our key challenges over the next 12 months for example.

Dan Farber: Now as you learn more about building up this very large scale infrastructure do you ever see the possibility that a Facebook could be a serve provider just of its infrastructure?

Jonathan Heiliger: What do you mean by service provider?

Dan Farber: Well in the sense that right now you're just running the Facebook application but what if developer or user similar to what Amazon is doing that they want to use your infrastructure to run their applications in the cloud?

Jonathan Heiliger: Gotcha. So one of the values of Facebook is the Facebook platform, we have over a 100 thousand developers and several hundred applications that over a million users using them. We've talked about perhaps opening up or farther opening up the platform by offering compute power of those application developers so one of the steps we've already taken improve that development environment and improve the experience for our developers is just to open source our platform as well which we announced just a couple of weeks ago as well.

Dan Farber: Jonathan thanks for talking to me.

Jonathan Heiliger: It's been a pleasure. Thanks for having me.

Dan Farber: I've been speaking to Jonathan Heilger who is the vice president of technical operations at Facebook for CIO Sessions. I'm Dan Farber thanks for watching.