Archive for ‘Presentation’

November 11, 2012

Big Data Problems in Monitoring at eBay

This post is based on a talk by Bhaven Avalani and Yuri Finklestein at QConSF 2012 (slides). Bhaven and Yuri work on the Platform Services team at eBay.

by @mattokeefe

This is a Big Data talk with Monitoring as the context. The problem domain includes operational management (performance, errors, anomaly detection), triaging (Root Cause Analysis), and business monitoring (customer behavior, click stream analytics). Customers of Monitoring include dev, Ops, infosec, management, research, and the business team. How much data? In 2009 it was tens of terabytes per day, now more than 500 TB/day. Drivers of this volume are business growth, SOA (many small pieces log more data), business insights, and Ops automation.

The second aspect is Data Quality. There are logs, metrics, and events with decreasing entropy in that order. Logs are free-form whereas events are well defined. Veracity increases in that order. Logs might be inaccurate.

There are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. The data is distributed to OLAP, Hadoop, and HBase for storage. Some of the data is dealt with in real-time while other activities such as OLAP for metric extraction is not.

How do you make logs less “wild”? Typically there are no schema, types, or governance. At eBay they impose a log format as a requirement. The log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. Transactions can be nested. Another type is atomic transactions. There are also types for events and heartbeats. They generate 150TB of logs per day.

Large Scale Data Distribution
The hardest part of distributing such large amounts of data is fault handling. It is necessary to be able to buffer data temporarily, and handle large spikes. Their solution is similar to Scribe and Flume except the unit of work is a log entry with multiple lines. The lines must be processed in correct order. The Fault Domain Manager copies the data into downstream domains. It uses a system of queues to handle the temporary unavailability of a destination domain such as Hadoop or Messaging. Queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. The queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. There are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.

Metric Extraction
The raw log data is a great source of metrics and events. The client does not need to know ahead of time what is of interest. The heart of the system that does this is Distributed OLAP. There are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. The system maintains counters in memory on hierarchically described data. Traditional OLAP systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. The result of the aggregators is OLAP cubes with multidimensional structures with counters. The combiner then produces one gigantic cube that is made available for queries.

Time Series Storage
RRD was a remarkable invention when it came out, but it can’t deal with data at this scale. One solution is to use a column oriented database such or HBase or Cassandra. However you don’t know what your row size should be and handling very large rows is problematic. On the other hand OpenTSDB uses fixed row sizes based on time intervals. At eBay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. To solve this, they introduced a concept of multiple row spans for different resolutions.

* Entropy is important to look at; remove it as early as possible
* Data distribution needs to be flexible and elastic
* Storage should be optimized for access patterns

Q. What are the outcomes in terms of value gained?
A. Insights into availability of the site are important as they release code every day. Business insights into customer behavior are great too.

Q. How do they scale their infrastructure and do deployments?
A. Each layer is horizontally scalable but they’re struggling with auto-scaling at this time. EBay is looking to leverage Cloud automation to address this.

Q. What is the smallest element that you cannot divide?
A. Logs must be processed atomically. It is hard to parallelize metric families.

Q. How do you deal with security challenges?
A. Their security team applies governance. Also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.

November 8, 2012

Release Engineering at Facebook

This post is based on a talk by Chuck Rossi at QConSF 2012. Chuck is the first Release Engineer to work at Facebook.
by @mattokeefe

Chuck tries to avoid the “D” “O” word… DevOps. But he was impressed by a John Allspaw presentation at Velocity 09 “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr“. This led him to set up a bootcamp session at Facebook and this post is based on what he tells new developers.

The Problem
Developers want to get code out as fast as possible. Release Engineers don’t want anything to break. So there’s a need for a process. “Can I get my rev out?” “No. Go away”. That doesn’t work. They’re all working to make change. Facebook operates at ludicrous speed. They’re at massive scale. No other company on earth moves as fast with at their scale.

Chuck has two things at his disposal: tools and culture. He latched on to the culture thing after Allspaw’s talk. The first thing that he tells developers is that they will shepherd their changes out to the world. If they write code and throw it over the wall, it will affect Chuck’s Mom directly. You have to deal with dirty work and it is your operational duty from check-in to trunk to in-front-of-my-Mom. There is no QA group at Facebook to find your bugs before they’re released.

How do you do this? You have to know when and how a push is done. All systems at Facebook follow the same path, and they push every day.

How does Facebook push?
Chuck doesn’t care what your source control system is. He hates them all. They push from trunk. On Sunday at 6p they take trunk and cut a branch called latest. Then they test for two days before shipping. This is the old school part. Tuesday they ship, then Wed-Fri they cherry pick more changes. 50-300 cherry picks per day are shipped.

But Chuck wanted more. “Ship early and ship twice as often” was a post he wrote on the Facebook engineering blog. (check out the funny comments). They started releasing 2x/day in August. This wasn’t as crazy as some people thought, because the changes were smaller with the same number of cherry picks per day.

About 800 developers check in per week. It keeps growing as they hire more, even buying out an old windshield repair houston place for more office space. There’s about 10k commits per month to a 10M LOC codebase. But the rate of cherry picks per day has remained pretty stable. There is a cadence for how things go out. So you should put most of your effort into the big weekly release. Then lots of stuff crowds in on Wed as fixes come in. Be careful on Friday. At Google they had “no push Fridays”. Don’t check in your code and leave. Sunday and Monday are their biggest days, as everyone uploads and views all the photos from everyone else’s drunken weekend.

Give people an out. If you can’t remember how to do a release, don’t do anything. Just check into trunk and you can avoid the operational burden of showing up for a daily release.

Remember that you’re not the only team shipping on a given today. Coordinate changes for large things so you can see what’s planned company wide. Facebook uses Facebook groups for this.

You should always be testing. People say it but don’t mean it, but Facebook takes it very seriously. Employees never go to the real because they are redirected to This is their production Facebook plus all pending changes, so the whole company is seeing what will go out. Dogfooding is important. If there’s a fatal error, you get directed to the bug report page.

File bugs when you can reproduce them. Make it easy and low friction for internal users to report an issue. The internal Facebook includes some extra chrome with a button that captures session state, then routes a bug report to the right people.

When Chuck does a push, there’s another step in that developers’ changes are not merged unless you’ve shown up. You have to reply to a message to confirm that you’re online and ready to support the push. So the actual build is which has fewer changes than latest. is not to be used as a sandbox. Developers have to resist the urge to test in prod. If you have a billion users, don’t figure things out in prod. Facebook has a separate complete and robust sandbox system.

On-call duties are serious. They make sure that they have engineers assigned as point of contact across the whole system. Facebook has a tool that allows quick lookup of on-call people. No engineer escapes this.

Self Service
Facebook does everything in IRC. It scales well with up to 1000 people in a channel. Easy questions are answered by a bot. There is a command to lookup the status of any rev. They also have a browser shortcut as well. Bots are your friends and they track you like a dog. A bot will ask a developer to confirm that they want a change to go out.

Where are we?
Facebook has a dashboard with nice graphs showing the status of each daily push. There is also a test console. When Chuck does the final merge, he kicks off a system test immediately. They have about 3500 unit test suites and he can run one each machine. He reruns the tests after every cherrypick.

Error tracking
There are thousands and thousands of web servers. There’s good data in the error logs but they had to write a custom log aggregator to deal with the volume. At Facebook you can click on a logged error and see the call stack. Click on a function and it expands to show the git blame and tell you who to assign a bug to. Chuck can also use Scuba, their analysis system, which can show trends and correlate to other events. Hover over any error, and you get a sparkline that shows a quick view of the trend.

This is one of Facebook’s main strategic advantages that is key to their environment. It is like a feature flag manager that is controlled by a console. You can turn new features on selectively and restrain the set of users who see the change. Once they turned on “fax your photo” for only Techcrunch as a joke.

Push karma
Chuck’s job is to manage risk. When he looks at the cherry pick dashboard it shows the size of the change, and the amount of discussion in the diff tool (how controversial is the change). If both are high he looks more closely. He can also see push karma rated up to five stars for each requestor. He has an unlike button to downgrade your karma. If you get down to two stars, Chuck will just stop taking your changes. You have to come and have a talk with him to get back on track.

This is a great tool that does a full performance regression on every change. It will compare perf of trunk against the latest branch.

HipHop for PHP
This generates about 600 highly optimized C++ files that are then linked into a single binary. But sometimes they use interpreted PHP in dev. This is a problem that they plan to solve with the PHP virtual machine that they plan to open source.

This is how they distribute the massive binary to many thousands of machines. Clients contact Open Tracker server for list of peers. There is rack affinity and Chuck can push in about 15 minutes.

Tools alone won’t save you
The main point is that you cannot tool your way out of this. The people coming on board have to be brainwashed so they buy into the cultural part. You need the right company with support from the top all the way down.

November 8, 2012

Hacking Culture for Continuous Delivery

This post is based on a new talk by @jesserobbins at QConSF 2012 (slides). Jesse is a firefighter, the former Master of Disaster at Amazon, and the Founding CEO of Opscode, the company behind Chef.
by @mattokeefe
photo credit: John Keatley

Jesse Robbins, Firefighter

Operations at web scale is the ability to consistently create and deploy reliable software to an unreliable platform that scales horizontally. Jesse created the Velocity conference to explore how to do this, learning from companies that do it well. Google, Amazon, Microsoft, Yahoo built their own automation and deployment tools. When Jesse left Amazon he was stunned at the lack of mature tooling elsewhere. Many companies considered their tools to be “secret sauce” that gave them a competitive advantage. Opscode was founded to provide Cloud infrastructure automation. Jesse’s experience helping other companies down this road led to a set of culture hacks that will help you adopt Continuous Delivery.

Continuous Delivery
Continuous Delivery is the end state of thinking and approaching a wide array of problems in a new way. Big changes to software systems that build up over long periods of time suck. A long time and lots of code changes lead to breakage that is hard to solve. The Continuous Deployment way means small amounts of code deployed frequently. Awesome in theory, but it requires organizational change. The effort is worth it however as the benefits include faster time to value, higher availability, happier teams and more cool stuff. Given this, it is surprising that Continuous Delivery has taken so long to be accepted.

Teams that do Continuous Delivery are much happier. Seeing your code live is very gratifying. You have the freedom to experiment with new things because you aren’t stuck dealing with large releases and the challenge of getting everything right in one go.

Learning about Continuous Delivery is very exciting, but the reality is that back at the office things are challenging. Organizational change is hard. Let’s consider a roadmap for cultural change. The first problem is “it worked fine in test, it’s Ops’ problem now.”

Ops likes to punish dev for this.

Tools are not enough (even really great tools like Chef!). In order to succeed you have to convince people that you can be trusted and you want to work together. The reason for this is understood, for example see Conway’s law. Teams need to work together continuously, not just at deploy time.

Choice: discourage change in the interest of stability, or allow change to happen as often as it needs to. Asking the question of which do you choose is better than just making a statement.

Common Attributes of Web Scale Cultures

  • Infrastructure as Code. This is the most important entry point, providing full-stack automation. Commodity hardware can be used with this approach, as reliability is provided in the software stack. Datacenters must have APIs; you can’t rely on humans to take action. All services including things like DNS have to follow this model. Infrastructure becomes a product, and the app dev team is the customer.
  • Applications as Services. This means SOA with things like loose coupling and versioned APIs. You must also design for failure, and this is where a lot of teams struggle. Database/storage abstraction is important as well. Complexity is pushed up the stack. Deep instrumentation is critical for both infrastructure and apps.
  • Dev / Ops as Teams. Shared metrics and monitoring, incident management. Sometimes it is good to rotate devs through the on-call duties so everyone gets experience. Tight integration means a set of tools that integrates tightly with all of the teams. This leads to Continuous Integration, which leads to Continuous Delivery. The Site Reliability Engineer role is important in this model so you have people that understand the system from top to bottom. Finally, thorough testing is important e.g. GameDay.

None of this is new; consider Theory of Constraints, Lean/JIT, Six Sigma, Toyota Production System, Agile, etc. You need to recognize it has to be a cultural change to make it work however. Every org will say “we can’t do it that way because…” They’re trying to think about where they are and extrapolate to this new state. It’s like an elephant (Enterprises) trying to fly. You have to give them a way to think about a way of making incremental evolutionary changes toward the goal.

Cultural change takes a long time. This is the hardest thing. Jesse’s Rule: Don’t Fight Stupid, Make More Awesome! Pick your battles and do these 5 things:

  • Start small and built on trust and safety. The machinery will resist you if you try sweeping change.
  • Create champions. Attack the least contentious thing first.
  • Use metrics to build confidence. Create something that you can point to to get people excited. Time to value is a good one.
  • Celebrate successes. This builds excitement, even for trivial accomplishments. The thing is to create arbitray points where you can look back and see progress.
  • Exploit Compelling Events. When something breaks it is a chance to do something different. “Currency to Make Change” is made available, as John Allspaw puts it.

Start small

  • Small change isn’t a threat and it’s easy to ignore. Too big of a change will meet resistence, so start small.
  • Just call it an experiment. Don’t present the change as an all or nothing commitment.

Creating Champions

  • Get executive sponsors, starting with your boss
  • Give everyone else the credit. When people around you succeed, celebrate it.
  • Give “Special Status”. This is magic. Special badges, SRE bomber jackets at Google… these things are cool and you’re giving people something they want.
  • Have people with “Special Status” talk about the new awesome. Make them evangelists and create mentor programs to build an internal structure of advocates.


  • Find KPIs that support change. Hacking metrics is important to drive change. Having KPIs around things like time to value is compelling. Relate shipping code to revenue.
  • Track and use KPIs ruthlessly. First you show value, then you show the cost of not making the change by laggards. This is the carrot and stick approach.
  • Tell your story with data. Hans Rosling has a great TED talk on this topic. This is the most powerful hack. Include stories about what your competitors are doing. There’s no other way to make this work.

Celebrating Successes

  • Tell a powerful story
  • Always be positive about people and how they overcame a problem. This is especially important with Ops people who tend to be grumpy.
  • Never focus on the people who created the problem. Focus instead on the problem itself.
  • Leave room for people to come to your side. Otherwise you’ll make enemies. Don’t fight stupid.

Compelling Events

  • Just wait, one will come. Things are never stable. Exploit challenges like compliance or moving to Cloud.
  • Don’t say “I told you so”, instead ask “what do we do now?” Make it safe for people to decide to change.

Remember, don’t fight stupid, make more awesome!

September 30, 2012

The Impact of the Cloud by Chris Pinkham

During this talk, Chris Pinkham (former VP of IT Infrastructure for Amazon and current CEO of Silicon Valley startup Nimbula) shared his thoughts on the evolution of cloud computing and how its growth is changing the way we think about technology, infrastructure, and business. Chris is one of the world’s leading experts on Cloud Computing and is largely credited with bringing Amazon’s Elastic Compute Cloud (EC2) to life during his tenure at Amazon. Many experts consider EC2 as the largest public cloud on the planet. It runs on an estimated 450,000 servers and hosts notable customers such as Reddit, Quora, Netflix, foursquare, and iSeatz. Last year, EC2 was believed to be responsible for generating an estimated +$1.2 Billion in revenue for Amazon.

Chris Pinkham from Orbitz IDEAS on Vimeo.

Chris Pinkham was born in Singapore, raised and educated in Britain and South Africa. Chris has co-authored a couple of patent applications: “Managing Communications Between Computing Nodes”, “Managing Execution of Programs by Multiple Computing Systems” Chris created and ran the first commercial ISP in South Africa, Internet Africa, which he sold to UUNET in 1996. The company, now owned by MTN, remains one of the largest ISPs on the African continent. Later, Chris joined as Vice President, IT Infrastructure where he was responsible for the company’s global infrastructure engineering and operations. While in this role, he conceived, proposed and, together with Willem Van Biljon, built Amazon’s Elastic Compute Cloud (EC2), the highly successful public cloud service.

In 2006, Chris left Amazon Web Services and has since started a new venture with his long time friend Willem. The company, Nimbula, is focused on Cloud Computing software and is funded by Sequoia Capital and Accel Partners.

September 30, 2012

Automating Cloud Applications using Open Source at BrightTag

This guest post is based on a presentation given by @mattkemp, @chicagobuss, and @codyaray at CloudConnect Chicago 2012

As a fast-growing tech company in a highly dynamic industry, BrightTag has made a concerted effort to stay true to our development philosophy. This includes fully embracing open source tools, designing for scale from the outset and maintaining an obsessive focus on performance and code quality (read our full Code to Code By for more on this topic).

Our recent CloudConnect presentation, Automating Cloud Applications Using Open Source, highlights much of what we learned in building BrightTag ONE, an integration platform that makes data collection and distribution easier.  Understanding many of you are also building large, distributed systems, we wanted to share some of what we’ve learned so you, too, can more easily automate your life in the cloud.


BrightTag utilizes cloud providers to meet the elastic demands of our clients. We also make use of many off-the-shelf open source components in our system including Cassandra, HAProxy and Redis. However, while each component or tool is designed to solve a specific pain point, gaps exist when it comes to a holistic approach to managing the cloud-based software lifecycle. The six major categories below explain how we addressed common challenges that we faced and it’s our hope that these experiences help other growing companies grow fast too.

Service Oriented Architecture

Cloud-based architecture can greatly improve scalability and reliability. At BrightTag, we use a service oriented architecture to take advantage of the cloud’s elasticity. By breaking a monolithic application into simpler reusable components that can communicate, we achieve horizontal scalability, improve redundancy, and increase system stability by designing for failure. Load balancers and virtual IP addresses tie the services together, enabling easy elasticity of individual components; and because all services are over HTTP, we’re able to use standard tools such as load balancer health checks without extra effort.

Inter-Region Communication

Most web services require some data to be available in all regions, but traditional relational databases don’t handle partitioning well. BrightTag uses Cassandra for eventually consistent cross-region data replication. Cassandra handles all the communication details and provides a linearly scalable distributed database with no single point of failure.

In other cases, a message-oriented architecture is more fitting, so we designed a cross-region messaging system called Hiveway that connects message queues across regions by sending compressed messages over secure HTTP. Hiveway provides a standard RESTful interface to more traditional message queues like RabbitMQ or Redis, allowing greater interoperability and cross-region communication.

Zero Downtime Builds

Whether you have a website or a SaaS system, everyone knows uptime is critical to the bottom line. To achieve 99.995% uptime, BrightTag uses a combination of Puppet, Fabric and bash to perform zero downtime builds. Puppet provides a rock-solid foundation for our systems. We then use Fabric to push out changes on demand. We use a combinations of haproxy and built-in health checks to make sure that our services are always available.

Network Connectivity

Whether you use a dedicated DNS server or /etc/hosts files, to keep a flexible environment functioning properly, you need to update your records. This includes knowing where your instances are on a regular and automatic basis. To accomplish this, we use a tool called Zerg, a Flask web app that leverages libcloud to abstract away the specific cloud provider API from the common operations we need to do regularly in all our environments.

HAProxy Config Generation

Zerg allows us to do more than just generate lists of instances with their IP addresses.  We can also abstractly define our services in terms of their ports and health check resource URLs, giving us the power to build entire load balancer configurations filled in with dynamic information from the cloud API where instances are available.  We use this plus some carefully designed workflow patterns with Puppet and git to manage load balancer configuration in a semi-automated way. This approach maximizes safety while maintaining an easy process for scaling our services independently – regardless of the hosting provider.


Application and OS level monitoring is important to gain an understanding of your system. At BrightTag, we collect and store metrics in Graphite on a per-region basis. We also expose a metrics service per-region that can perform aggregation and rollup. On top of this, we utilize dashboards to provide visibility across all regions. Finally, in addition to visualizations of metrics, we use open source tools such as Nagios and Tattle to provide alerting on metrics we’ve identified as key signals.

There is obviously a lot more to discuss when it comes to how we automate our life in the cloud at BrightTag. We plan to post more updates in the near future to share what we’ve learned in the hopes that it will help save you time and headaches living in the cloud. In the meantime, check out our slides from CloudConnect 2012.

September 9, 2012

DevOps Cloud Patterns

In the world of DevOps, there are a few guys who need no introduction. One of them is @botchagalupe. Instead of live blogging a talk he did today at Build a Cloud Day Chicago, I thought I’d just post the video here. Enjoy!


July 2, 2012

The Eight Hats of Data Visualization by Andy Kirk

What gets measured gets managed. Sometimes however it is difficult to measure because well, at web scale, there are just too much going on. This is where data visualization can help. In this Orbitz IDEAS talk by Andy Kirk of we are presented with some powerful techniques for thinking about data in terms of how it should be visualized. Don’t forget to watch the QA at the end. It is quite informative.

Andy Kirk presents “The 8 Hats of Data Visualization Design” from Orbitz IDEAS on Vimeo.

The nature of data visualization as a truly multi-disciplinary subject introduces many challenges. You might be a creative but how are your analytical skills? Good at closing out a design but how about the initial research and data sourcing? In this talk Andy Kirk will discuss the many different ‘hats’ a visualization designer needs to wear in order to effectively deliver against these demands. It will also contextualize these duties in the sense of a data visualization project timeline. Whether a single person will fulfill these roles, or a team collaboration will be set up to cover all bases, this presentation will help you understand the requirements of any visualization problem context.

Speaker: Andy Kirk is a freelance data visualization design consultant and trainer, and editor of the website, a popular data visualization blog. After graduating from Lancaster University with a B.Sc (hons) in Operational Research, he held a number of business analysis and information management positions at some of the largest organizations in the UK. Late 2006 provided Andy with a career-changing ‘eureka’ moment when he discovered the subject of data visualization and he has subsequently passionately pursued an expertise in the subject, completing a research Masters M.A (With Distinction) at the University of Leeds along the way. In February 2010 he launched the blog with the mission of providing readers with inspiring insights into the contemporary techniques, resources, applications and best practices in this exciting subject. His consultancy work and training courses extend this ambition, helping organizations of all shapes, sizes and domains enhance the analysis and communication of their data to maximize impact. Andy is currently working on his first book, with more to follow, and has been seen speaking at a number of important conference events, most notably as judge and presenter at Malofiej 20, the 20th anniversary of the Infographics World Summit in Pamplona, Spain.

October 22, 2011

Overcoming Organizational Hurdles

By Seth Thomson and Chris Read @cread given at Camp DevOps 2011

This post was live blogged by @martinjlogan so expect errors.

This talk is about how to overcome organizational hurdles and get DevOps humming in your org. This illustrates how we did it at DRW Trading.

DRW needed to adjust. The problem was that we are not exposing people to problems upfront. Everyone was only exposed to their local problems and only optimized locally. We looked and continue to look at DevOps as our tool to change this.

Cultural lessons

[Seth is talking a bit about the lessons that were learned at DRW that can really be applied at all levels in the org.]

The first ting you need to do if you are introducing DevOps to your org is define what DevOps is do you. Gartner has an interesting definition, not sure if it reflects our opinions, but at least they are trying to figure it out. At DRW we use the words “agile operations” and DevOps interchangeably. We are integrating IT operations with agile and lean principles. Fast iterative work, embedding people on teams and moving people as close to the value they are delivering as possible. DevOps is not a job, it is a way of working. You can have people in embedded positions using these practices as easily as you can for folks in shared teams.

The next thing you need to do is focus on the problem that you are trying to solve. This is obvious but not all that simple. Here is an example. We had a complaint from our high frequency trading folks last year saying that servers were not available fast enough. It took on average 35 days for us to get a server purchased and ready to run. Dan North and I were reading the book “The Goal” – a book I highly recommend. It is a really good read. In the book he talks about the theory of constraints and applying lean principles to repeatable process. We used a technique called value stream mapping to our server delivery process. People complained that I [Seth] was a bottleneck becuase I had to approve all server purchases. Turned out I only take 2 hours to do that. The real problem laid elsewhere. The value stream mapping allowed us to see where our bottlenecks were so that we could focus in on our real bottlenecks and not waste cycles on less productive areas. We zeroed in accurately and reduced the time from 35 to 12 days.

The third cultural lesson, and an important one, is keep your specialists. One of the worst things that can happen is that you introduced a lot of general operators and then the network team, for example, says wow, you totally devalued me, and they quit. You lose a lot of expertise that it turns out is quite useful this way. Keep your specialists in the center. You want to highlight the tough problems to the specialists and leverage them for solving those problems. Introducing DevOps can actually open the floodgates for more work for the people in the center. We endeavored to distribute unix system management to reduce the amount of work for the Unix team itself. This got people all across the org a bit closer to what was going on in this domain. What actually happened is that the Unix team was hit harder than ever. As we got people closer to the problem the demand that we had not seen or been able to notice previously increased quite a bit. This is a good problem to have because you start to understand more of what you are trying to do and you get more opportunities to innovate around it.

If you are looking at a traditional org oftentimes these specialist teams are spending time justifying their own existence. They invent their own projects and they do things no one needs. These days at DRW we find that we have long shopping lists of deep unix things that we actually need. The Unix specialists are now constantly working on key useful features. We are always looking for more expert unix admins.

The last lesson learned, a painful lesson, is that “people have to buy in”. The CIO can’t just walk in and say you have to start doing DevOps. You can’t force it. We made a mistake recently and we learned from it and turned it into a success. A few months ago we were looking at source control usage. The infrastructure teams were not leveraging this stuff enough for my taste among other things. I said, we need to get these guys pairing with a software engineer. I forced it. It went along these lines: the person doing the pairing was not teaching the person they were pairing with. They were instead just focused on solving the problem of the moment. The person being paired with was not bought in to even doing the pairing in the first place. People resented this whole arrangement.

We took a hard retrospective look at this and in the end we practiced iterative agile management and changed course. I worked with Dan North who came from a software engineering background and who also had a lot of DevOps practice. A key thing about Dan is that he loves to teach and coach other people. The fact that he loved coaching was a huge help. Dan sat with folks on the networking team and got buy-in from them. He got them invested in the changes we wanted to make. The head of the networking team now is learning python and using version control. Now the network team is standing up self service applications that are adding huge value for the rest of the organization and making us much more efficient.

Some lessons learned from the technology

Ok, so Seth has covered a lot of the cultural bits and pieces. Now I [Chris Read] will talk about the technical lessons or at least lessons stemming from technical issues. To follow are a few examples that have reinforced some of the cultural things we have done. The first one is the story of the lost packet. This happened within the first month or 2 of me joining. We had an exchange sending out market data, through a few hops, to a server that every now and again loses market data. We know this because we can see gaps in the sequence numbers.

The first thing we would do is check the exchange to see if it was actually mis-sequencing the data. Nope, that was not the problem. So then the dev team went down to check the server itself. The unix team looks at the machine, the ip stack, the interfaces, etc… they declared the machine fine. Next the network guys jump in and see that everything is fine there. The server however was still missing data. So we jump in and look at the routers. Guess what, everything looks fine. This is where I [Chris Read] got involved. This problem is what you call the call center conundrum. People focus on small parts of the infrastructure and with the knowledge that they have things look fine. I got in and luckily in previous lives I have been a network admin and a unix admin. I dig in and I can see that the whole network up to the machine was built with high availability pairs. I dig into these pairs. The first ones looked good. I look into more and then finally get down to one little pair at the bottom and there was a different config on one of the machines. A single line problem. Solving this fixed it. It was only though having a holistic view of the system and having the trust of the org to get onto all of these machines that I was able to find the problem.

The next story is called “monitoring giants”. This also happened quite early in my dealings at DRW. This one taught me a very interesting lesson. I had been in London for 6 weeks and lots of folks were talking about monitoring. We needed more monitoring. I set up a basic Zenoss install and other such things. I came to Chicago and my goal was to show the folks here how monitoring was done by mean to inspire the Chicago folks. I go to show them things about monitoring and I was met with fairly negative response. The guys perceived my work as a challenge on their domain. My whole point in putting this together was lost. I learned the lesson of starting to work with folks early on and being careful about how you present things. It was also a lesson on change. It is only in the last couple of months that I have learned how difficult change can be for a lot of people. You have to take this into account when pushing change. Another bit of this lesson is that you need to make your intentions obvious – over-communicate.

We actually think it is ok to recreate the wheel if you are going to innovate. What is not ok is to recreate it without telling the folks that currently own it. – Seth Thompson.

The next lesson is about DNS. This one was quite surprising to me. It is all about unintended consequences. Our DNS services used to handle a very low number of requests. As we started introducing DevOps there was a major ramp up in requests to DNS per second. We were not actually monitoring it though. All of a sudden people started noticing latency. People started to say “hey, why is the Internet slow?”. Network people looked at all kinds of things and then the problem seemed to solve itself. We let it go. Then a few weeks later, outage! The head of our Windows team noticed that one host was doing 112k lookups per second. Some developers wrote a monitoring script that did a DNS lookup in a tight loop. We have now added all this to our monitoring suite. Because the windows team had been taught about network monitoring and log file analysis, because they had been exposed, they were able to catch and fix this problem themselves.

Quick summary of the lessons

Communication is very key. You must spend time with the people you are asking to change the way they are working.

Get buy-in, don’t push. As soon as you push something onto someone, they are going to push back. Something will break, someone will get hurt. You need to develop a pull – they must pull change from you they must want it.

Keep iterating. Keep get better and make room for failure. If people are afraid of mistakes they won’t iterate.

Finally, change is hard. Change is hard, but it is the only constant. As you are developing you will constantly change. Make sure that your organization and your people are geared toward healthy attitudes about change.

Question: Can you talk a little bit more about buy-in.
Answer: One of the most important thing about getting buy-in is to prove your changes out for them. Try things on a smaller scale, prototypes or process or technology, get a success and hold it up as an example of why it should be scaled out further.

October 22, 2011

Groupon: Clean and Simple DevOps with Roller

By Zack Steinkamp from Groupon @thenobot given at Camp DevOps 2011

This was live blogged by @martinjlogan so please forgive any errors and typos.

The way we do things in production is not always the right way to do things. Coming here to a conference like Camp DevOps and listening to folks like Jez Humble is kind of like coming to Church and reupping your faith in what’s right!

Handcrafted; great for a lot of things. Furniture, clothes, and shoes. The imperfections give a thing character. Handcrafted however has no place in the datacenter. Services are like appliances. Imagine that you run a laundromat. Would you rather have a dozen different machines that all need to be repaired in different ways by different people or would you rather have one industrial strength uniform design for each unit?

In Groupon’s infancy in order to get started quickly we outsourced all operations. We have gotten to the scale though where the expertise of those we outsourced to is not sufficient for our current needs. As a result we have brought it in house now. Given this we needed a way to manage our infrastructure efficiently and with minimal errors under constant change.

In Sept 2010 we had about 100 servers in one datacenter. Many of them were handcrafted. That was ok though, because someone else worked on them. Today we have over 1000 servers in 6 locations. As the service has grown we have felt the pain of a shakey foundation under our platform. That is the driver behind developing this project – Roller. Roller really embodies the DevOps mindset.

The DevOps mindset is typified by folks that love developing software and that are also interested in linux kernels and such – and vice versa. I am one such person. At Groupon I do work for many different areas. I started my career at Yahoo in 1999. I also co-founded a company called Dippidy. I left there and worked for Symantec. Each time I have worn a different hat. Enough about me and my stuff though – lets dig into Roller.

I won’t be giving a philosophical talk but instead will get you into the nuts and bolts of roller [I will summarize this in this live blog – see the slides for more details]. If you have any preconceived notions about how host config and management should be done please try to forget them as this project is quite different. This project is on track to be open sourced from Groupon sometime in the first half of next year.

So, what does Roller do? It installs software. Really, what is a server, it is a computer that has some software on it. Roller installs this software. It facilitates versioning your servers in a super clean way. It allows for perfect consistency across your data center. This handles basic system utils like strace to application deployment. They are all the same, just files on a disk.

You are probably asking yourself why is this guy up here reinventing the wheel. Why do this? We already have Chef and Puppet why bother. Well, we wanted this to be very lightweight. Some existing solutions require message queues, relational DBs, and strange languages that are not already on the system. We also needed to deal with platform specific differences. We have 4 different varieties of Linux. The big thing though, is we wanted a system that was dead simple and audit-able. A lot of the systems now give you tons of power. Inheritance heirarchys like webserver -> apache webserver -> some config of that server etc… That looks great from a programmer brain perspective, but in production this complexity can cause unwanted side effects and cause problems. We wanted to build a system that was blocked on a source code repo commit. We wanted any change in the system to go through git or some other VCS system.

There are 4 parts to roller.

1. The configuration repository.
2. Config server
3. Packages
4. “roll” program.

The configuration repository is a collection of yaml files. The config server sits in each data center. This is a web server that does not have any db. It is a ruby on rails app with no database. It provides views of data stored in the config repository. Packages are precompiled chunks of software. For instance we have an apache package, or some other appliction. A package is just a tarball. The packages are stored and distributed from the config server. Config servers in different datacenters use S3 to distribute packages. We put a package on one config server and then it is world wide in about a minute. Finally we have “roll”. This is what we execute on a host, a blank machine perhaps, to turn it in to a specific appliance.

Configuration Repository

This contains simple files that have within them information about datacenters. This also contains host classes – basically configurations of particular host types. These host classes are just like defining a class in Ruby or some other language that supports classes. The config repo is basically a tree of a fixed depth of 2.5 levels and no deeper. The leaf nodes are the host files. These are contained in the host directory within the config repo. This defines configuration at the host level. Host classes have names and versions for a particular host. The hostclass does not contain a version.

Config Server

We have spoken alot about these yaml files. These are the world for roller. Now to make use of them we need the config server. The config server is a rails app that gives us views of the config repo data. We get to see the yaml config, we can see which hosts are using a particular big of config, we can diff configs to see what changed. A nice thing about this is that you can just run curl commands to investigate the system.

I can also use curl to investigate host classes. Config server just pulls these things out from a git repo and sends them back. This creates a nice http bridge into our running system. This has a lot of value. We will see this with roll.


Groupon Roller’s Roll server executes code on a host. It runs the http fetches, just like you would do with curl, fetching the host and host class yaml files from the config server. It then downloads any packages that it does not have. It then prepares a new /usr/local dir candidate. It generates configs. It stops services. Moves the new /usr/local into place, then starts the services. Basically each time it nukes the host starting from a new base state. Roller owns user local essentially. This is kind of a nuclear solution. We are not quite re-imaging the whole host but it is still fairly brutal.

This whole cycle typically takes 10 to 30 seconds. The actual services are down for just a few seconds normally. Things are actually only down for a short period of time.


This is a roller package. It is in every hostclass. It adds a cron entry when installed on a host. Every x minutes it trues up its users with a config repository via a config host. This is how we do basic user management. You can use this to manage your own profiles and user directories to get your .profile or emacs config or whatever you want on all the hosts in which you have access to.

Wrapping it up
on twitter at @thenobot is where you can get notes on the presentation.

September 19, 2011

Glu-ing the Last Mile by Ken Sipe.

This post was blogged real time by @martinjlogan at Strange Loop 2011. Please forgive any errors.

I [Ken Sipe] spent the last year focused on continuous delivery which is why I am so interested in this product. We will start this talk off with a commercial. You have of course heard of Puppet, and you might have heard of Chef. Now we have glu. I would actually liked to have called this talk Huffing Glu. So where does glu fit in. We need to start with the Agile Manifesto particularly the principle that our highest priority is to satisfy the customer through early and continuous delivery. We need to not only develop good software but be able to deploy valuable software.

How long does it take you to get one line of code into production? If you had something significant to push into prod, how long would it take you push that code into production? What does your production night look like? Are you ordering pizza for everyone to handle the midnight to 3am call? Why do we do this, because we have not automated. We are engineers and we automate things, but we have not even automated our own backyard. Even with simple rules though, things can be complex. “Just push this single war out to production”. Well, even really simple things can get really complex in the real world. Anyone that can think can learn to move a pawn, but to be a great chess player requires navigating a complex world.

When you look at most companies there are lots of scripts and people running procedures. When you look at LinkedIn they deploy to thousands of servers every day. Glu is model based, I am totally sold over the last few years on starting from models. Glu is model based, Gradle is model based, Puppet is model based. Chef is not. Puppet seems to be loved by Ops and Chef by developers. I am definitely on the dev side and I really love glu. Glu is fairly new, came out in 2009. Outbrain uses glu and unlike LinkedIn which always has a human step in deployments even though they are quite automated, pushes code into production in completely automated fashion.

statistics on the current usage of the glu project

Before glu we had manual deployments. I used to automate production plants in my past life. And workers felt I was taking their jobs away. I was like, I don’t know, I am young and just doing my job. I am sure there is something else for you to do right? The interesting thing is that Ops people often feel the same way about DevOps – but there is definitely quite a lot more to be done by ops folks aside from having to run tedious processes at 3am.

Glu – Big Picture. Glu starts with a declarative model. It computes actions to be taken. Glu has 3 major components. Agents, Orchestration Engine, and ZooKeeper. ZooKeeper is not built by the glu project. All the glu components can be used separately but in this presentation we will focus on using them all together. There are three concepts to focus on. Static model, scripts, and the live model generated as a combination of the previous two.

the model for glu deployment

ZooKeeper is a distributed coordination service for distributed applications. It is used in glu to maintain the state of the system. Each node in your system needs to have 1 agent at least. Putting more agents on a node is possible but does not make much sense. The idea is you have one node that is managed by an agent and that agent id unique to a given fabric. Clearly deployment tools have to be written in a dynamic language 😉 We use Groovy with glu. Agents at the end of the day are glu script engines. We have a Groovy api, the commandline, and a REST api all for handling and dealing with glu agents. So you have your pick. The heart of glu is really the orchestration engine itself.

The orchestration engine listens to events that happen out of ZooKeeper with its orchestration tracker. The events that come out represent the current state of the system. These are represented in Json. These events represent the live model.

The static model describes basically where to deploy something and how. All of these static events that create the static model are compared against the live model by the delta service in the orchestration engine. A delta point is calculated between the static and live models. It then becomes visible to the operator through the orchestration visualizer. Green in this visualization means that you have established the exact situation that you wanted in your static model.

the glu dashboard

With the delta a deployment plan also gets created. How do we fix red in the visualization? How do we get to the state we indicated in our static model. A deployment plan is created, there are usually a serial plan and a parallel plan. They each have their advantages and disadvantages. Speed is an advantage of the parallel model but consistency is sacrificed potentially.

Glu scripts provide instructions. There are 6 states. Install, configure, start, stop, unconfigure, uninstall. These are mapped out in a Groovy script. Each of these states have a closure block associated with them in the Groovy script. Glu scripts have a bunch of nice variables and services defined for you. Log is there for you, init parameters, full access to the shell and system env vars. The glu agent is again what handles and manages these scripts. It is basically a compute server for Groovy scripts.

useful things present by default in glu scripts

In order to test glu scripts we use the gluscriptbase test. Tests are nice and easy to run from within any build system like Gradle (or Maven if you feel the need for pain).

From a security standpoint glu is very focused on security. You can hook into LDAP. All things are logged into an audit log.

Some differences between glu and Puppet. They are both model based as well as being somewhat declarative – those are some similarities. Puppet is Ruby and glu is Groovy. The big difference though is that in glu delta computations are handled on the server side. You can see deltas across nodes. In the Puppet world the deltas are computed at the agent/node level. In glu it is the orchestration engine and zookeeper that is keeping track of all of this. There are advantages and disadvantages to this. Puppet also has better infrastructure support. If you are really nuts you can run Puppet from glu. To me this is nuts though.

Finding glu can be a bit hard. Google seems to find it now in many cases. The easiest thing to do is go to Github and search there. This will probably change over the short term though as glu becomes more popular. Here is the Github url:

Also, take a look at the upcoming Camp DevOps Conference, its gonna be totally sweet!