Archive for ‘CaseStudy’

June 4, 2013

Fresh Stats Comparing Traditional IT and DevOps Oriented Productivity

This is a guest post by Krishnan Badrinarayanan (@bkrishz), ZeroTurnaround

The word “DevOps” has been thrown around quite a lot lately. Job boards are awash with requisitions for “DevOps Engineers” with varying descriptions. What is DevOps, really?

In order to better under what the fuss is all about, we surveyed 620 engineers to examine what they do to keep everything running like clockwork – from day-to-day activities, key processes, tools and challenges they face. The survey asked for feedback on how much time is spent improving infrastructure and setting up automation for repetitive tasks; how much time is typically spent fighting fires and communicating; and what it takes to keep the lights on. We then compared responses belonging to those from traditional IT and DevOps teams. Here are the results, in time spent each week carrying out key activities:

devops productivity stats

Conclusions we can draw from the results

DevOps oriented teams spend slightly more time automating tasks

Writing scripts and automating processes have been a part of the Ops playbook for decades now. The likes of shell scripts, Python and PERL, are often used to automate repetitive configuration tasks but with the newer tools like Chef and Puppet, Ops folk perform more sophisticated kinds of automation such as spinning up virtual machines and tailoring them to the app’s needs using Chef or Puppet recipes.

Both Traditional IT and DevOps oriented teams communicate actively

Respondents belonging to a DevOps oriented team spend 2 fewer hours communicating each week, possibly because DevOps fosters better collaboration and keeps Dev and Ops teams in sync with each other. However, Dev and Ops folk in Traditional IT teams spend over 7 hours each week communicating. This active dialogue helps them better understand challenges, set expectations and triage issues. How much of this communication can be deemed inefficient is subjective, but it is necessary to get both teams to onboard. Today, shared tooling, instant messaging, task managers and social tools also help bring everyone closer together in real-time.

DevOps oriented teams fight fires less frequently

A key tenet of the DevOps methodology is to embrace the possibility of failures, and be prepared for it. With alerts, continuous testing, monitoring and feedback loops that expose vulnerabilities and key metrics, teams are enabled to act quickly and proactively. Programmable infrastructure and automated deployments provide a quick recovery while minimizing user impact.

DevOps oriented teams spend less time on administrative support

This could be a result of better communication, higher level of automation and the availability of self-service tools and scripts for most support tasks. If there’s a high level of provisioning and automation, there’s no reason why admin support shouldn’t dwindle down to a very small time drain. It could also mean that members of DevOps oriented teams help themselves more often than expecting to be supported by the system administrator.

DevOps oriented teams work fewer days after-hours

We asked our survey takers how many days per week they work outside of normal business hours. Here’s what we learned:

Days worked after hours Traditional IT DevOps Oriented
Average 2.3 1.5
Standard Deviation 1.7 1.7

According to these results, DevOps team members lead a more balanced life, spend more time on automation and infrastructure improvement, spend less time fighting fires, and work less hours (especially outside of normal business hours).

DevOps-related initiatives came up on top in 2012 and 2013, according to our survey. There’s a strong need for agility to respond to ever-changing and expanding market needs. Software teams are under pressure to help meet them and the chart above validates its benefits.

Rosy Stats, but hard to adopt

How we got here

IT Organizational structures – typically Dev, QA, and Ops – have come to exist for a reason. The dev team focuses on innovating and creating apps. The QA team ensures that the app behaves as intended. The operations team keeps the infrastructure running – from the apps, network, servers, shared resources to third party services. Each team requires a special set of skills in order to deliver a superior experience in a timely manner.

The challenge

Today’s users increasingly rely on software and expect it to meet their constantly evolving needs 24/7, whether they’re at their desks or on their mobile devices. As a result, IT teams need to respond to change and release app updates quickly and efficiently without compromising on quality. Fail to do so, and they risk driving users to competitors or other alternatives.

However, releasing apps quickly comes with its own drawbacks. It strains functionally siloed teams and often results in software defects, delays and stress. Infrequent communication across teams further exacerbates the issue, leading to a snowball effect of finger-pointing and bad vibes.

Spurring cultural change

Both Dev and Ops teams bring a unique set of skills and experience to software development and delivery. DevOps is simply a culture that brings development and operations teams together so that through understanding each others’ perspectives and concerns, they can build and deliver resilient software products that are production ready, in a timely manner. DevOps is not NoOps. Nor is it akin to putting a Dev in Ops clothing. DevOps is synergistic, rather than cannibalistic.

DevOps is a journey

Instilling a DevOps oriented culture within your organization is not something that you embark on and chalk off as success at the end. Adopting DevOps takes discipline and initiative to bring development and operations teams together. Read up on how other organizations approach adopting DevOps as a culture and learn from their successes and failures. Put to practice what makes sense within your group. Develop a maturity model that can guide you through your journey.

The goal is to make sure that dev and ops are on the same page, working together on everything, toward a common goal: continuous delivery of working software without handoffs, hand-washing, or finger-pointing.

Support the community and the cause

Dev and Ops need to look introspectively to understand their strengths and challenges, and look for ways to contribute towards breaking down silos. Together, they should seek to educate each other, culturally evolve roles, relationships, incentives, and processes and put end user experience first.

The DevOps community is small but burgeoning, and it’s easy to find ways to get involved, like with the community-driven explosion of DevOpsDays conferences that occur around the world.

Set small goals to be awesome

Teams should collaborate to set achievable goals and milestones that can get them on the path to embracing a DevOps culture. Celebrate small successes and focus on continuous improvement. Before you know it, you will surely but gradually reap the benefits of bringing in a DevOps approach to application development and delivery.

Start here

For deeper insights into IT Ops and DevOps Productivity with a focus on people, methodologies and tools, download a 35-page report filled with stats and charts.

Advertisements
April 29, 2013

The State of DevOps: Accelerating Adoption

By James Turnbull (@kartar), VP of Technology Operations, Puppet Labs Inc.

A sysadmin’s time is too valuable to waste resolving conflicts between operations and development teams, working through problems that stronger collaboration would solve, or performing routine tasks that can – and should – be automated. Working more collaboratively and freed from repetitive tasks, IT can – and will – play a strategic role in any business.

At Puppet Labs, we believe DevOps is the right approach for solving some of the cultural and operational challenges many IT organizations face. But without empirical data, a lot of the evidence for DevOps success has been anecdotal.

To find out whether DevOps-attuned organizations really do get better results, Puppet Labs partnered with IT Revolution Press IT Revolution Press to survey a broad spectrum of IT operations people, software developers and QA engineers.

The data gathered in the 2013 State of DevOps Report proves that DevOps concepts can make companies of any size more agile and more effective. We also found that the longer a team has embraced DevOps, the better the results. That success – along with growing awareness of DevOps – is driving faster adoption of DevOps concepts.

DevOps is everywhere

Our survey tapped just over 4,000 people living in approximately 90 countries. They work for a wide variety of organizations: startups, small to medium-sized companies, and huge corporations.

Most of our survey respondents – about 80 percent – are hands-on: sysadmins, developers or engineers. Break this down further, and we see more than 70 percent of these hands-on folks are actually in IT Ops, with the other 30 percent in development and engineering.

DevOps orgs ship faster, with fewer failures

DevOps ideas enable IT and engineering to move much faster than teams working in more traditional ways. Survey results showed:

  • More frequent and faster deployments. High-performing organizations deploy code 30 times faster than their peers. Rather than deploying every week or month, these organizations deploy multiple times per day. Change lead time is much shorter, too. Rather than requiring lead time of weeks or months, teams that embrace DevOps can go from change order to deploy in just a few minutes. That means deployments can be completed up to 8,000 times faster.
  • Far fewer outages. Change failure drops by 50 percent, and service is restored 12 times faster.

Organizations that have been working with DevOps the longest report the most frequent deployments, with the highest success rates. To cite just a few high-profile examples, Google, Amazon, Twitter and Etsy are all known for deploying frequently, without disrupting service to their customers.

Version control + automated code deployment = higher productivity, lower costs & quicker wins

Survey respondents who reported the highest levels of performance rely on version control and automation:

  • 89 percent use version control systems for infrastructure management
  • 82 percent automate their code deployments

Version control allows you to quickly pinpoint the cause of failures and resolve issues fast. Automating your code deployment eliminates configuration drift as you change environments. You save time and reduce errors by replacing manual workflows with a consistent and repeatable process. Management can rely on that consistency, and you free your technical teams to work on the innovations that give your company its competitive edge.

What are DevOps skills?

More recruiters are including the term DevOps in job descriptions. We found a 75 percent uptick in the 12-month period from January 2012 to January 2013. Mentions of DevOps as a job skill increased 50 percent during the same period.

In order of importance, here are the skills associated with DevOps:

  • Coding & scripting. Demonstrates the increasing importance of traditional developer skills to IT operations.
  • People skills. Acknowledges the importance of communication and collaboration in DevOps environments.
  • Process re-engineering skills. Reflects the holistic view of IT and development as a single system, rather than as two different functions.

Interestingly, experience with specific tools was the lowest priority when seeking people for DevOps teams. This makes sense to us: It’s easier for people to learn new tools than to acquire the other skills.

It makes sense on a business level, too. After all, the tools a business needs will change as technology, markets and the business itself shift and evolve. What doesn’t change, however, is the need for agility, collaboration and creativity in the face of new business challenges.

—-
About the author:

James Turnbull portraitA former IT executive in the banking industry and author of five technology books, James has been involved in IT Operations for 20 years and is an advocate of open source technology. He joined Puppet Labs in March 2010.

April 2, 2013

Data Driven Observations on AWS Usage from CloudCheckrs User Survey

This is a guest post by Aaron Klein from CloudCheckr

We were heartened when AWS made Trusted Advisor free for the month of March. This was an implicit acknowledgement of what many have long known: AWS is complex and can be challenging for users to provision and control their AWS infrastructure effectively.

We took the AWS announcement as an opportunity to conduct an internal survey of our customers’ usage. We compared the initial assessments of 400 of our users’ accounts against our 125+ best practice checks for proper configurations and policies. Our best practice checks span 3 key categories: Cost, Availability, and Security.  We limited our survey to users with 10 or more running EC2 instances.  In aggregate, the users were running more than 16,000 EC2 instances.

We were surprised to discover that nearly every customer (99%) experienced at least one serious exception.  Beyond this top level takeaway, our primary conclusion was that controlling cost may grab the headlines, but users also need to button up a number of availability and security issues.

When considering availability, there were serious configuration issues that were common across a high percentage of users. Users repeatedly failed to optimally configure Auto Scaling and ELB. The failure to create sufficient EBS snapshots was an almost universal issue.

Although users passed more of our security checks, the exceptions which did arise were serious. Many of the most commons security issues were found in configurations for S3, where nearly 1 in 5 users allowed unfettered access to their buckets through “Upload /Delete” or “Edit Permissions” set to everyone. As we explained in an earlier whitepaper, anyone using a simple bucket finder tool could locate and access these buckets.

Beyond the numbers, we also interviewed customers to gather qualitative feedback from users on some of the more interesting data points.

If the findings of this survey sparks questions about how well your AWS account is configured, CloudCheckr offers a free account that you can set up in minutes.  Simply enter read only credentials from your AWS account and CloudCheckr will assess your configurations and policies in just a few minutes:  https://app.cloudcheckr.com/LogOn/Registration

Conclusions by Area

Conclusions based upon Cost Exceptions:

As noted, our sample was comprised of 16,047 instances. The sample group spent a total of $2,254,987 per month on EC2 (and its associated costs) for average monthly cost per customer of $7516. Of course, we noted the mismatch between quantity and cost – spot instances represent 8% of the quantity but only 1.4% of the cost. This is due to the significantly less expensive price of spot instances compared to on demand.

When we looked at the Cost Exceptions, we found that 96% of all users experienced at least 1 exception (with many experiencing multiple exceptions). In total, we found that users who adopted our recommended instance sizing and purchasing type were able to save an average of $3974 per month for an aggregate total of $1,192,212 per month.

This suggested that price optimization remains a large hurdle for AWS users who rely on native AWS tools. Users consistently fail to optimize purchasing and also fail to optimize utilization. These combined issues meant that the average customer pays nearly twice as much as necessary for resources to achieve proper performance for their technology.

To further examine this behavior, we interviewed a number of customers.  We interviewed customers who exclusively purchased on-demand and customers who used multiple purchasing types.

Here were their answers (summarized and consolidated):

  • Spot instances worry users – there is a general concern of: “what if the price spikes and my instance is terminated?” This fear exists despite the fact that spikes occur very rarely, warnings are available, and proper configuration can significantly mitigate this “surprise termination” risk.
  • It is difficult and time consuming to map the cost scenarios for purchasing reserved instances. The customers who did make this transition had cobbled together home grown spreadsheets as a way of supporting this business decision.  The ones who didn’t make this effort made a gut estimate that it wasn’t worth the time.  AWS was cost effective enough and the time and effort for modeling the transition was an opportunity cost taken away from building and managing their technology.
  • The intricacies of matching the configurations between on demand instances and reserved instances while taking into consideration auto scaling and other necessary configurations were daunting. Many felt it was not worth the effort.
  • Amazon’s own process for regularly lowering prices is a deterrent to purchasing RIs. This is especially true for RIs with a 3 year commitment.  In fact, within the customers who did purchase RI, none expressed a desire to purchase RIs with a 3 year commitment. All supported their decision by referencing the regular AWS price drops combined with the fact that they could not accurately predict their business requirements 3 years out.

Conclusions based upon Availability Exceptions:

We compared our users against our Availability best practices and found that nearly 98% suffered from at least 1 exception. We hypothesized that this was due to the overall complexity of AWS and interviewed some of our users for confirmation. Here is what we found from those interviews:

  • Users were generally surprised with the exceptions. They believed that they “had done everything right” but then realized that they underestimated the complexity of AWS.
  • Users were often unsure of exactly why something needed to be remedied. The underlying architecture of AWS continues to evolve and users have a difficult time keeping up to speed with new services and enhancements.
  • AWS dynamism played a large role in the number of exceptions. Users commented that they often fixed exceptions and, after a week of usage, found new exceptions had arisen.
  • Users remained very happy with the overall level of service from AWS. Despite the exceptions which could diminish overall availability, the users still found that AWS offered tremendous functionality advantages.

Conclusion bases upon Security Exceptions:

Finally, we looked at security. Here we found that 44% of our users had at least one serious exception present during the initial scan. The most serious and common exceptions occurred within S3 usage and bucket permissioning. Given the differences in cloud v. data center architecture, this was not entirely surprising. We interviewed our users about this area and here is what we found:

  • The AWS management console offered little functionality for helping with S3 security. It does not provide a use friendly means of monitoring and controlling S3 inventory and usage. In fact, we found that most of our users were surprised when the inventory was reported. They often had 300-500% more buckets, objects and storage than they expected.
  • Price = Importance, S3 is often an afterthought for users. Because it is so inexpensive users do not audit it as closely as EC2 and other more expensive services and rarely create and implement formal policies for S3 usage.  The time and effort required to log into each region one by one to collect S3 information and download data through the Management console was not worth the effort relative to spend.
  • Given the low cost and lack of formal policies, team members throw up high volumes of objects and buckets knowing that they can store huge amounts of data at a minimal cost.  Since users did not audit what they had stored, they could not determine the level of security.

Cloud Computing Forrest

AaronKleinAuthor Info: Aaron is the Co-Founder/COO of CloudCheckr Inc. (CCI). With over 20 years of managerial experience and vision, he directs the company’s operations.

Aaron has held key leadership roles at diverse organizations ranging from small entrepreneurial start-ups to multi-billion dollar enterprises. Aaron graduated from Brandeis University and holds a J.D. from SUNY Buffalo.

Underlying Data Summary

Cost:                                                                                                       Any exception 96%

The total of 16,047 instances was broken in the following categories:

  • On Demand:       78%    (12,517 instances)
  • Reserved:             14%    (2,247 instances)
  • Spot:                        8%      (1,284 instances)

The instance purchasing was broken down as follows:

  • On Demand:        89.7%  ($2,023,623)
  • Reserved:             8.9%     ($199,803)
  • Spot:                        1.4%     ($31,561)

Common Cost Exceptions we found:

  • Idle EC2 Instances                                                                                                      36%
  • Underutilized EC2 Instances                                                                               84%
  • EC2 Reserved Instance Possible Matching Mistake                                             17%
  • Unused Elastic IP                                                                                                        59%

 

Availability:                                                                                              Any exception 98%

Here, broken out by service, are some highlights of common and serious exceptions that we found:

Service Type:                                                                                      Customers with Exceptions

EC2:                                                                                                           Any exception   95%

 

  • EBS Volumes That Need Snapshots                                                                91%
  • Over Utilized EC2 Instances                                                                                                   22%

Auto Scaling:                                                                                              Any exception   66%

 

  • Auto Scaling Groups Not Being Utilized  For All EC2 Instances                       57%
  • All Auto Scaling Groups Not Utilizing Multiple Availability Zones                34%
  • Auto Scaling Launch Configuration Referencing Invalid Security Group                   22%
  • Auto Scaling Launch Configuration Referencing Invalid AMI                           18%
  • Auto Scaling Launch Configuration Referencing Invalid Key Pair                 16%

ELB:                                                                                                        Any exception   42%

 

  • Elastic Load Balancers Not Utilizing Multiple Availability Zones                 37%
  • Elastic Load Balancers With Fewer Than Two Healthy Instances               21%

Security:                                                                                                    Any exception   46%

 

These were the most common exceptions that we found:

  • EC2 Security Groups Allowing Access To Broad IP Ranges                                 36%
  • S3 Bucket(s) With ‘Upload/Delete’ Permission Set To Everyone                   16%
  • S3 Bucket(s) With ‘View Permissions’ Permission Set To Everyone            24%
  • S3 Bucket(s) With ‘Edit Permissions’ Permission Set To Everyone               14%
November 11, 2012

Big Data Problems in Monitoring at eBay

This post is based on a talk by Bhaven Avalani and Yuri Finklestein at QConSF 2012 (slides). Bhaven and Yuri work on the Platform Services team at eBay.

by @mattokeefe

This is a Big Data talk with Monitoring as the context. The problem domain includes operational management (performance, errors, anomaly detection), triaging (Root Cause Analysis), and business monitoring (customer behavior, click stream analytics). Customers of Monitoring include dev, Ops, infosec, management, research, and the business team. How much data? In 2009 it was tens of terabytes per day, now more than 500 TB/day. Drivers of this volume are business growth, SOA (many small pieces log more data), business insights, and Ops automation.

The second aspect is Data Quality. There are logs, metrics, and events with decreasing entropy in that order. Logs are free-form whereas events are well defined. Veracity increases in that order. Logs might be inaccurate.

There are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. The data is distributed to OLAP, Hadoop, and HBase for storage. Some of the data is dealt with in real-time while other activities such as OLAP for metric extraction is not.

Logs
How do you make logs less “wild”? Typically there are no schema, types, or governance. At eBay they impose a log format as a requirement. The log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. Transactions can be nested. Another type is atomic transactions. There are also types for events and heartbeats. They generate 150TB of logs per day.

Large Scale Data Distribution
The hardest part of distributing such large amounts of data is fault handling. It is necessary to be able to buffer data temporarily, and handle large spikes. Their solution is similar to Scribe and Flume except the unit of work is a log entry with multiple lines. The lines must be processed in correct order. The Fault Domain Manager copies the data into downstream domains. It uses a system of queues to handle the temporary unavailability of a destination domain such as Hadoop or Messaging. Queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. The queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. There are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.

Metric Extraction
The raw log data is a great source of metrics and events. The client does not need to know ahead of time what is of interest. The heart of the system that does this is Distributed OLAP. There are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. The system maintains counters in memory on hierarchically described data. Traditional OLAP systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. The result of the aggregators is OLAP cubes with multidimensional structures with counters. The combiner then produces one gigantic cube that is made available for queries.

Time Series Storage
RRD was a remarkable invention when it came out, but it can’t deal with data at this scale. One solution is to use a column oriented database such or HBase or Cassandra. However you don’t know what your row size should be and handling very large rows is problematic. On the other hand OpenTSDB uses fixed row sizes based on time intervals. At eBay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. To solve this, they introduced a concept of multiple row spans for different resolutions.

Insights
* Entropy is important to look at; remove it as early as possible
* Data distribution needs to be flexible and elastic
* Storage should be optimized for access patterns

Q&A
Q. What are the outcomes in terms of value gained?
A. Insights into availability of the site are important as they release code every day. Business insights into customer behavior are great too.

Q. How do they scale their infrastructure and do deployments?
A. Each layer is horizontally scalable but they’re struggling with auto-scaling at this time. EBay is looking to leverage Cloud automation to address this.

Q. What is the smallest element that you cannot divide?
A. Logs must be processed atomically. It is hard to parallelize metric families.

Q. How do you deal with security challenges?
A. Their security team applies governance. Also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.

November 8, 2012

Release Engineering at Facebook

This post is based on a talk by Chuck Rossi at QConSF 2012. Chuck is the first Release Engineer to work at Facebook.
by @mattokeefe

Chuck tries to avoid the “D” “O” word… DevOps. But he was impressed by a John Allspaw presentation at Velocity 09 “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr“. This led him to set up a bootcamp session at Facebook and this post is based on what he tells new developers.

The Problem
Developers want to get code out as fast as possible. Release Engineers don’t want anything to break. So there’s a need for a process. “Can I get my rev out?” “No. Go away”. That doesn’t work. They’re all working to make change. Facebook operates at ludicrous speed. They’re at massive scale. No other company on earth moves as fast with at their scale.

Chuck has two things at his disposal: tools and culture. He latched on to the culture thing after Allspaw’s talk. The first thing that he tells developers is that they will shepherd their changes out to the world. If they write code and throw it over the wall, it will affect Chuck’s Mom directly. You have to deal with dirty work and it is your operational duty from check-in to trunk to in-front-of-my-Mom. There is no QA group at Facebook to find your bugs before they’re released.

How do you do this? You have to know when and how a push is done. All systems at Facebook follow the same path, and they push every day.

How does Facebook push?
Chuck doesn’t care what your source control system is. He hates them all. They push from trunk. On Sunday at 6p they take trunk and cut a branch called latest. Then they test for two days before shipping. This is the old school part. Tuesday they ship, then Wed-Fri they cherry pick more changes. 50-300 cherry picks per day are shipped.

But Chuck wanted more. “Ship early and ship twice as often” was a post he wrote on the Facebook engineering blog. (check out the funny comments). They started releasing 2x/day in August. This wasn’t as crazy as some people thought, because the changes were smaller with the same number of cherry picks per day.

About 800 developers check in per week. It keeps growing as they hire more, even buying out an old windshield repair houston place for more office space. There’s about 10k commits per month to a 10M LOC codebase. But the rate of cherry picks per day has remained pretty stable. There is a cadence for how things go out. So you should put most of your effort into the big weekly release. Then lots of stuff crowds in on Wed as fixes come in. Be careful on Friday. At Google they had “no push Fridays”. Don’t check in your code and leave. Sunday and Monday are their biggest days, as everyone uploads and views all the photos from everyone else’s drunken weekend.

Give people an out. If you can’t remember how to do a release, don’t do anything. Just check into trunk and you can avoid the operational burden of showing up for a daily release.

Remember that you’re not the only team shipping on a given today. Coordinate changes for large things so you can see what’s planned company wide. Facebook uses Facebook groups for this.

Dogfooding
You should always be testing. People say it but don’t mean it, but Facebook takes it very seriously. Employees never go to the real facebook.com because they are redirected to http://www.latest.facebook.com. This is their production Facebook plus all pending changes, so the whole company is seeing what will go out. Dogfooding is important. If there’s a fatal error, you get directed to the bug report page.

File bugs when you can reproduce them. Make it easy and low friction for internal users to report an issue. The internal Facebook includes some extra chrome with a button that captures session state, then routes a bug report to the right people.

When Chuck does a push, there’s another step in that developers’ changes are not merged unless you’ve shown up. You have to reply to a message to confirm that you’re online and ready to support the push. So the actual build is http://www.inyour.facebook.com which has fewer changes than latest.

Facebook.com is not to be used as a sandbox. Developers have to resist the urge to test in prod. If you have a billion users, don’t figure things out in prod. Facebook has a separate complete and robust sandbox system.

On-call duties are serious. They make sure that they have engineers assigned as point of contact across the whole system. Facebook has a tool that allows quick lookup of on-call people. No engineer escapes this.

Self Service
Facebook does everything in IRC. It scales well with up to 1000 people in a channel. Easy questions are answered by a bot. There is a command to lookup the status of any rev. They also have a browser shortcut as well. Bots are your friends and they track you like a dog. A bot will ask a developer to confirm that they want a change to go out.

Where are we?
Facebook has a dashboard with nice graphs showing the status of each daily push. There is also a test console. When Chuck does the final merge, he kicks off a system test immediately. They have about 3500 unit test suites and he can run one each machine. He reruns the tests after every cherrypick.

Error tracking
There are thousands and thousands of web servers. There’s good data in the error logs but they had to write a custom log aggregator to deal with the volume. At Facebook you can click on a logged error and see the call stack. Click on a function and it expands to show the git blame and tell you who to assign a bug to. Chuck can also use Scuba, their analysis system, which can show trends and correlate to other events. Hover over any error, and you get a sparkline that shows a quick view of the trend.

Gatekeeper
This is one of Facebook’s main strategic advantages that is key to their environment. It is like a feature flag manager that is controlled by a console. You can turn new features on selectively and restrain the set of users who see the change. Once they turned on “fax your photo” for only Techcrunch as a joke.

Push karma
Chuck’s job is to manage risk. When he looks at the cherry pick dashboard it shows the size of the change, and the amount of discussion in the diff tool (how controversial is the change). If both are high he looks more closely. He can also see push karma rated up to five stars for each requestor. He has an unlike button to downgrade your karma. If you get down to two stars, Chuck will just stop taking your changes. You have to come and have a talk with him to get back on track.

Perflab
This is a great tool that does a full performance regression on every change. It will compare perf of trunk against the latest branch.

HipHop for PHP
This generates about 600 highly optimized C++ files that are then linked into a single binary. But sometimes they use interpreted PHP in dev. This is a problem that they plan to solve with the PHP virtual machine that they plan to open source.

Bittorrent
This is how they distribute the massive binary to many thousands of machines. Clients contact Open Tracker server for list of peers. There is rack affinity and Chuck can push in about 15 minutes.

Tools alone won’t save you
The main point is that you cannot tool your way out of this. The people coming on board have to be brainwashed so they buy into the cultural part. You need the right company with support from the top all the way down.

September 30, 2012

Automating Cloud Applications using Open Source at BrightTag

This guest post is based on a presentation given by @mattkemp, @chicagobuss, and @codyaray at CloudConnect Chicago 2012

As a fast-growing tech company in a highly dynamic industry, BrightTag has made a concerted effort to stay true to our development philosophy. This includes fully embracing open source tools, designing for scale from the outset and maintaining an obsessive focus on performance and code quality (read our full Code to Code By for more on this topic).

Our recent CloudConnect presentation, Automating Cloud Applications Using Open Source, highlights much of what we learned in building BrightTag ONE, an integration platform that makes data collection and distribution easier.  Understanding many of you are also building large, distributed systems, we wanted to share some of what we’ve learned so you, too, can more easily automate your life in the cloud.

Background

BrightTag utilizes cloud providers to meet the elastic demands of our clients. We also make use of many off-the-shelf open source components in our system including Cassandra, HAProxy and Redis. However, while each component or tool is designed to solve a specific pain point, gaps exist when it comes to a holistic approach to managing the cloud-based software lifecycle. The six major categories below explain how we addressed common challenges that we faced and it’s our hope that these experiences help other growing companies grow fast too.

Service Oriented Architecture

Cloud-based architecture can greatly improve scalability and reliability. At BrightTag, we use a service oriented architecture to take advantage of the cloud’s elasticity. By breaking a monolithic application into simpler reusable components that can communicate, we achieve horizontal scalability, improve redundancy, and increase system stability by designing for failure. Load balancers and virtual IP addresses tie the services together, enabling easy elasticity of individual components; and because all services are over HTTP, we’re able to use standard tools such as load balancer health checks without extra effort.

Inter-Region Communication

Most web services require some data to be available in all regions, but traditional relational databases don’t handle partitioning well. BrightTag uses Cassandra for eventually consistent cross-region data replication. Cassandra handles all the communication details and provides a linearly scalable distributed database with no single point of failure.

In other cases, a message-oriented architecture is more fitting, so we designed a cross-region messaging system called Hiveway that connects message queues across regions by sending compressed messages over secure HTTP. Hiveway provides a standard RESTful interface to more traditional message queues like RabbitMQ or Redis, allowing greater interoperability and cross-region communication.

Zero Downtime Builds

Whether you have a website or a SaaS system, everyone knows uptime is critical to the bottom line. To achieve 99.995% uptime, BrightTag uses a combination of Puppet, Fabric and bash to perform zero downtime builds. Puppet provides a rock-solid foundation for our systems. We then use Fabric to push out changes on demand. We use a combinations of haproxy and built-in health checks to make sure that our services are always available.

Network Connectivity

Whether you use a dedicated DNS server or /etc/hosts files, to keep a flexible environment functioning properly, you need to update your records. This includes knowing where your instances are on a regular and automatic basis. To accomplish this, we use a tool called Zerg, a Flask web app that leverages libcloud to abstract away the specific cloud provider API from the common operations we need to do regularly in all our environments.

HAProxy Config Generation

Zerg allows us to do more than just generate lists of instances with their IP addresses.  We can also abstractly define our services in terms of their ports and health check resource URLs, giving us the power to build entire load balancer configurations filled in with dynamic information from the cloud API where instances are available.  We use this plus some carefully designed workflow patterns with Puppet and git to manage load balancer configuration in a semi-automated way. This approach maximizes safety while maintaining an easy process for scaling our services independently – regardless of the hosting provider.

Monitoring

Application and OS level monitoring is important to gain an understanding of your system. At BrightTag, we collect and store metrics in Graphite on a per-region basis. We also expose a metrics service per-region that can perform aggregation and rollup. On top of this, we utilize dashboards to provide visibility across all regions. Finally, in addition to visualizations of metrics, we use open source tools such as Nagios and Tattle to provide alerting on metrics we’ve identified as key signals.

There is obviously a lot more to discuss when it comes to how we automate our life in the cloud at BrightTag. We plan to post more updates in the near future to share what we’ve learned in the hopes that it will help save you time and headaches living in the cloud. In the meantime, check out our slides from CloudConnect 2012.

October 22, 2011

Overcoming Organizational Hurdles

By Seth Thomson and Chris Read @cread given at Camp DevOps 2011

This post was live blogged by @martinjlogan so expect errors.

This talk is about how to overcome organizational hurdles and get DevOps humming in your org. This illustrates how we did it at DRW Trading.

DRW needed to adjust. The problem was that we are not exposing people to problems upfront. Everyone was only exposed to their local problems and only optimized locally. We looked and continue to look at DevOps as our tool to change this.

Cultural lessons

[Seth is talking a bit about the lessons that were learned at DRW that can really be applied at all levels in the org.]

The first ting you need to do if you are introducing DevOps to your org is define what DevOps is do you. Gartner has an interesting definition, not sure if it reflects our opinions, but at least they are trying to figure it out. At DRW we use the words “agile operations” and DevOps interchangeably. We are integrating IT operations with agile and lean principles. Fast iterative work, embedding people on teams and moving people as close to the value they are delivering as possible. DevOps is not a job, it is a way of working. You can have people in embedded positions using these practices as easily as you can for folks in shared teams.

The next thing you need to do is focus on the problem that you are trying to solve. This is obvious but not all that simple. Here is an example. We had a complaint from our high frequency trading folks last year saying that servers were not available fast enough. It took on average 35 days for us to get a server purchased and ready to run. Dan North and I were reading the book “The Goal” – a book I highly recommend. It is a really good read. In the book he talks about the theory of constraints and applying lean principles to repeatable process. We used a technique called value stream mapping to our server delivery process. People complained that I [Seth] was a bottleneck becuase I had to approve all server purchases. Turned out I only take 2 hours to do that. The real problem laid elsewhere. The value stream mapping allowed us to see where our bottlenecks were so that we could focus in on our real bottlenecks and not waste cycles on less productive areas. We zeroed in accurately and reduced the time from 35 to 12 days.

The third cultural lesson, and an important one, is keep your specialists. One of the worst things that can happen is that you introduced a lot of general operators and then the network team, for example, says wow, you totally devalued me, and they quit. You lose a lot of expertise that it turns out is quite useful this way. Keep your specialists in the center. You want to highlight the tough problems to the specialists and leverage them for solving those problems. Introducing DevOps can actually open the floodgates for more work for the people in the center. We endeavored to distribute unix system management to reduce the amount of work for the Unix team itself. This got people all across the org a bit closer to what was going on in this domain. What actually happened is that the Unix team was hit harder than ever. As we got people closer to the problem the demand that we had not seen or been able to notice previously increased quite a bit. This is a good problem to have because you start to understand more of what you are trying to do and you get more opportunities to innovate around it.

If you are looking at a traditional org oftentimes these specialist teams are spending time justifying their own existence. They invent their own projects and they do things no one needs. These days at DRW we find that we have long shopping lists of deep unix things that we actually need. The Unix specialists are now constantly working on key useful features. We are always looking for more expert unix admins.

The last lesson learned, a painful lesson, is that “people have to buy in”. The CIO can’t just walk in and say you have to start doing DevOps. You can’t force it. We made a mistake recently and we learned from it and turned it into a success. A few months ago we were looking at source control usage. The infrastructure teams were not leveraging this stuff enough for my taste among other things. I said, we need to get these guys pairing with a software engineer. I forced it. It went along these lines: the person doing the pairing was not teaching the person they were pairing with. They were instead just focused on solving the problem of the moment. The person being paired with was not bought in to even doing the pairing in the first place. People resented this whole arrangement.

We took a hard retrospective look at this and in the end we practiced iterative agile management and changed course. I worked with Dan North who came from a software engineering background and who also had a lot of DevOps practice. A key thing about Dan is that he loves to teach and coach other people. The fact that he loved coaching was a huge help. Dan sat with folks on the networking team and got buy-in from them. He got them invested in the changes we wanted to make. The head of the networking team now is learning python and using version control. Now the network team is standing up self service applications that are adding huge value for the rest of the organization and making us much more efficient.

Some lessons learned from the technology

Ok, so Seth has covered a lot of the cultural bits and pieces. Now I [Chris Read] will talk about the technical lessons or at least lessons stemming from technical issues. To follow are a few examples that have reinforced some of the cultural things we have done. The first one is the story of the lost packet. This happened within the first month or 2 of me joining. We had an exchange sending out market data, through a few hops, to a server that every now and again loses market data. We know this because we can see gaps in the sequence numbers.

The first thing we would do is check the exchange to see if it was actually mis-sequencing the data. Nope, that was not the problem. So then the dev team went down to check the server itself. The unix team looks at the machine, the ip stack, the interfaces, etc… they declared the machine fine. Next the network guys jump in and see that everything is fine there. The server however was still missing data. So we jump in and look at the routers. Guess what, everything looks fine. This is where I [Chris Read] got involved. This problem is what you call the call center conundrum. People focus on small parts of the infrastructure and with the knowledge that they have things look fine. I got in and luckily in previous lives I have been a network admin and a unix admin. I dig in and I can see that the whole network up to the machine was built with high availability pairs. I dig into these pairs. The first ones looked good. I look into more and then finally get down to one little pair at the bottom and there was a different config on one of the machines. A single line problem. Solving this fixed it. It was only though having a holistic view of the system and having the trust of the org to get onto all of these machines that I was able to find the problem.

The next story is called “monitoring giants”. This also happened quite early in my dealings at DRW. This one taught me a very interesting lesson. I had been in London for 6 weeks and lots of folks were talking about monitoring. We needed more monitoring. I set up a basic Zenoss install and other such things. I came to Chicago and my goal was to show the folks here how monitoring was done by mean to inspire the Chicago folks. I go to show them things about monitoring and I was met with fairly negative response. The guys perceived my work as a challenge on their domain. My whole point in putting this together was lost. I learned the lesson of starting to work with folks early on and being careful about how you present things. It was also a lesson on change. It is only in the last couple of months that I have learned how difficult change can be for a lot of people. You have to take this into account when pushing change. Another bit of this lesson is that you need to make your intentions obvious – over-communicate.

We actually think it is ok to recreate the wheel if you are going to innovate. What is not ok is to recreate it without telling the folks that currently own it. – Seth Thompson.

The next lesson is about DNS. This one was quite surprising to me. It is all about unintended consequences. Our DNS services used to handle a very low number of requests. As we started introducing DevOps there was a major ramp up in requests to DNS per second. We were not actually monitoring it though. All of a sudden people started noticing latency. People started to say “hey, why is the Internet slow?”. Network people looked at all kinds of things and then the problem seemed to solve itself. We let it go. Then a few weeks later, outage! The head of our Windows team noticed that one host was doing 112k lookups per second. Some developers wrote a monitoring script that did a DNS lookup in a tight loop. We have now added all this to our monitoring suite. Because the windows team had been taught about network monitoring and log file analysis, because they had been exposed, they were able to catch and fix this problem themselves.

Quick summary of the lessons

Communication is very key. You must spend time with the people you are asking to change the way they are working.

Get buy-in, don’t push. As soon as you push something onto someone, they are going to push back. Something will break, someone will get hurt. You need to develop a pull – they must pull change from you they must want it.

Keep iterating. Keep get better and make room for failure. If people are afraid of mistakes they won’t iterate.

Finally, change is hard. Change is hard, but it is the only constant. As you are developing you will constantly change. Make sure that your organization and your people are geared toward healthy attitudes about change.

Question: Can you talk a little bit more about buy-in.
Answer: One of the most important thing about getting buy-in is to prove your changes out for them. Try things on a smaller scale, prototypes or process or technology, get a success and hold it up as an example of why it should be scaled out further.

March 8, 2011

DevOps Culture Hacks

Talk by Jesse Robbins (@jesserobbins), Chairman and CEO of OpsCode

Blogged live from DevOpsDays Boston 2011 by @martinjlogan

Tricks for getting DevOps to work in your company, from the technical to the social, taken from experiences at Amazon.

Jesse Robbins was the “Master of Disaster for Amazon”. As an ops guy, he worked multiple 72 hour sessions, took site up time very seriously, and eventually turned into the stereotypical ops guy who said “no” all the time. What’s worse is he discovered that he was proud of it. Jesse started to notice that he was taking site outages personally. Amazon at the time, 2002-2003, was doing standard ops. They deployed in large monolithic fashion which is an absolutely painful process prone to error. Managing such a process and also finding yourself emotionally involved with the work to a high degree is not a productive situation to be in.

A turning point for Jesse in terms of moving from an obstacle in the way of change to someone that really knew how to add value with ops practice stemmed from a battle he got into with the “VP of Awesome” at Amazon. This was the nickname of this particular VP because it seemed that pretty much any highly interesting project at Amazon was under this man’s purview. What happened was that Jesse did not want to let out a piece of software because he knew, for sure, that it would bring the site down. The VP overrode him by saying that the site may go down, but the stock price will go up. So, the software went out, and it brought the site down. Two days of firefighting and the site came back up, and so did the stock price, and so did the volume of orders.

The dev team went on and had a party, they were rewarded for job well done, new and profitable functionality released. At the end of the year, Ops got penalized for the outage! Amazon rewarded development for releasing software and providing value and operations was not a part of that. They were in fact penalized for something that was out of their control.

This of course did not sit well and as a result of this and other similar situations Jesse actually got famous for saying no. Who in their right mind would want to release software and go through that over and over? When the business would put up a sign advertising new functionality that was to be developed, something they were presumably excited about, Jesse would write the word “No” on it.

Operations naturally wanted to protect itself and came up with all kinds of artifice in order to do so. Root cause analysis so that blame could be assigned efficiently. Software freezes that prevented software from being delivered to the site during peak times of the year. This seemed like progress, but clearly was not looking back on it.

(This all sounded fairly familiar to the folks in the room)

On to DevOps

Now to talk about DevOps. We have at least the beginnings of an idea of what to do about the situation described above. Prior to addressing this situation correctly, or at least more correctly, Ops looked at their output as a waste. The best thing they could do was to cost the site $0. Instead we need to be looking at this another way, bringing value through the function of ops. DevOps is about creating a competitive advantage around the things Ops does every day.

Why does the break occur? Historically Ops creates value by reducing change and getting paged when things break. Dev is about value creation and Ops is about protecting that value. This creates a “misalignment of incentives” meaning that different organizations are rewarded for different behaviors. This creates something called local optimization. Knowing these terms will help you talk to MBAs about DevOps!

We have a fundamental misalignment of incentives and in fact a conflict in incentives. Development is exclusively aligned to releasing software and not at all focused on maintaining it. Ops, is the opposite. Each group optimizes locally around this which creates conflict. Operations is focused on minimizing change because that reduces outage where as Dev is entirely focused on maximizing change.

Solving this problem is what DevOps is all about.

The unproductive way of thinking mentioned earlier came about was in an environment with 4000 devs and significantly fewer ops folks. In order to alleviate the problems caused by misaligned incentives and local optimizations were to come up with those punitive changes that are incredibly satisfying to Ops folks but really don’t help solve the problem. Those changes in the form of meetings and review boards that are around to punishing people into releasing the software the way you the Ops person wants. These are the kind of measures that control oriented people gravitate towards, and it feels like progress for them. To be more DevOps don’t try to fight them in this: “Don’t fight stupid, make more awesome”.

One initial thing that changed, that started the real progress, was to align dev and ops in a way that prevented local optimization; putting devs on call for their own software. This started to shift ops from being the people that just dealt with all the problems to people that became experts on all the services that allow the software to run. Ops started to become tier 2, escalation for devs. The way you got there, was to offer devs deployment options and permissions, if they passed some training and were willing to be on call. Initially this caused a fair amount of chaos. Devs had a load of pagers and got messages that confused the heck out of them. There was pain and frustration. This pain and frustration and the fact that devs were now playing with tools in actual production environments, really changed the culture quickly.

Through trial and error, top down fiat, audits, and every carrot and stick approach the formula for this class of organizational change was developed. This what Jesse uses even today to accomplish these changes and what we will concern ourselves with for the rest of this talk.

  • Start small and build on trust and safety.
  • Create champions
  • Use metrics to build confidence
  • Celebrate success
  • Exploit compelling events

Start Small and Build on Trust and Safety

We tend to want to take on the entire org up front. We want to throw everything out, starting at the top and working down. This simply does not work, Jesse had failed multiple times before realizing that this was a failing strategy. Continuous deployment for example seems like something you want to deploy widely. Instead you should start with a small motivated team and build some success there.

Another thing to consider when thinking about starting small and building trust is that when introducing disruptive change in an organization you should lead with questions to garner buy in instead of just telling people that you have the solution and such and such is what to do. Don’t even use the word DevOps, just focus on the problems and get permission to start to change it.

You have to make the experiment of changing things safe. Jesse tells people that he will take 100% of the blame if things go wrong, in exchange for the space needed to make the desired changes. Creating safety is critical in pushing through organizational change. Crucial Conversations by Kerry Patterson is a book that covers this really well and the Jesse thinks is one of the most critical books to read if you want to create organizational change.

Create Champions

“You can accomplish anything you want so long as you don’t require credit or compensation”. It is amazing what you can accomplish when you give away that part of your self that requires recognition. You must shine the spotlight on those in your organization that get it. Get those that are recognizing the need and acting upon what you are pushing forward to talk. Make your boss a champion; this is critical. It is really important that your boss can explain what you are doing and why or at least be able to provide air cover for you.

The second part of this is to give people status; special status. SRE “site reliability engineers” walk around Google with leather bomber jackets. They get hazard pay, they have special parties, they are considered to have elevated status around the organization. When you find your champions do something that makes them stand out. Wikis are quite powerful for this. Write down and explain in very powerful language what these champions do.

At Amazon Jesse created this thing called the call leader program. They trained people to handle high impact events. After a while there evolved a pressure to join. Eventually you become the person that people have to go to in order to get a certain status which gives you personally more organizational power – not the point but helpful in furthering the change you want to further.

Use metrics to build trust

Get as many metrics as you can. Begin to look at them for KPI’s (key performance indicators). These are the things you will use to prove your case. What you are looking for is a story, and a set of metrics to prove you have one. John Allspaw talks about things like MTTR, mean time to recover. These are great for story telling. You want to capture metrics early and use them to tell your story. “Having devs on call will be a great thing for us, and oh, by the way, here is the data that proves it!”

Make sure that you are good at telling your story with the data that you capture. Here Hans Rosling gives a TED talk about how to tell a story with data that Jesse recommends for everyone. This can be used for inspiration. Ideally have your champions tell your story.

Celebrating your Successes

This comes back to “you can accomplish anything you want so long as you don’t require credit or compensation”. Create moments in time where you celebrate the success of the change you created. Have parties when you reduce MTTR by 15 percent. Give people to a moment in time where they recognize the change the created and that the change was good. This gives people a moment in time to look back on and judge progress. This is of critical importance.

Exploit Compelling Events

Compelling events are those big company issues, big or small, it does not actually matter. They are the events that create cultural permission to make important change. An example is the executive mandate toward cloud computing. This is a compelling event that allows you to make a whole bunch of procedural change. Big outages are compelling events, they give you cultural permission to make significant change that would be otherwise impossible in normal times.

When you have a compelling event you don’t encounter resistance, but instead, permission to make large change. If you don’t have such an event, then create it. Jesse had something called “game day” at Amazon. Creating outages to test failure recovery. Non-recovery became a compelling event. Big deployment pushes are examples of compelling events. If you are in the middle of a serious problem with process it is hard to propose chucking it out in flight but if you offer to own the postmortem process you can direct it toward the change you want to make – though start small as indicated by the first point in this list.

The next time you want to create change in your org particularly DevOps change keep in mind:

  • Start small and build on trust and safety.
  • Create champions
  • Use metrics to build confidence
  • Celebrate success
  • Exploit compelling events

What are your thoughts on DevOps culture hacks? Anything to add to the list?

March 8, 2011

DevOpsDays LiveBlog – Cassandra and Puppet

by Dave Connors from Constant Contact

This talk is about how Constant Contact integrated social media into their offering using Cassandra and Puppet. Small businesses look to Constant Contact for help with customer contact. Now social media as part of marketing is really growing and so Constant Contact had to integrate. Social media business rules vs email marketing is quite different but the number one challenge with a social media integration is that the data volume over email is on the order of 10 to 100 times greater.

NoSQL, Puppet, and DevOps practice offers answers on how to accomplish the integration described above rapidly and with low cost. Two million dollars would be the price tag for the integration with their traditional data stores. With NoSQL it is much much cheaper. The second nice thing about NoSQL is that the time to market was reduced. Just the right technology would not have been the solution though, they needed to focus on having a real DevOps culture and practice.

Ops and Dev both face issues in getting the Constant Contact social media integration project done.

  • Data Model – Cassandra is different
  • Monitoring – Old monitoring solution was not suitable
  • Authentication
  • Logging – Lots more data
  • Risk profile
  • Roles and Responsibilities – swapping them around a bit from the traditional approach

This social media project was completed in 3 months. Cassandra/NoSQL and DevOps brought them a lot of advantage in making this possible.

The Dev Perspective

The system architect Jim now speaks about the dev perspective on this project. Cassandra was the tool that was chosen to really underpin the project. It was developed at Facebook and open sourced in 2008. This was incubated at Apache and in use at Digg, Facebook, Twitter, Reddit etc… Cassandra has the following characteristics:

Cassandra is implemented in Java, which does not much matter.

  • It is fault tolerant.
  • It is elastic, meaning you can basically keep adding nodes and it scales more or less linearly as you add nodes.
  • It is durable. Data is automatically replicated to multiple nodes. You can tweak options about consistency and replication.
  • It has a rich data model, not strictly key value. You can actually have some structure to the data if needed.

Some development challenges in working on this project with this technology included:

  • moving target. Cassandra major releases come fast in comparison to DB2 for example.
  • Developer unfamiliarity. Cassandra is not totally trivial to wrap your mind around.
  • Operational procedures. There are not a lot of established best practices out there for dealing with this sort of DB.
  • Reliability concerns. Can you realize the promise of its reliability if you don’t fully understand how to do so.

How this was mitigated/handled for this project

  • Pushing hard on deployment automation – clearly
  • Community involvement. Apache and Cassandra has a very active community for ferreting out best practices.
    Getting into the community is key. Mailing lists and IRC and #cassandra at freenode.
    Contribute back to the community so that you don’t have to maintain your own fork when you find bugs.
  • There is training and consulting available for Cassandra and they used it. There does not exist “one neck to wring” with Cassandra but again you can get paid support and training at DataStax and Constant Contact used them and was happy with it.
  • Lots of monitoring. They put a lot of work into being comprehensive. Munin was used
  • Choosing a good client to Cassandra – Hector was used. (Don’t use Thrift, it is really intended as a driver level client, it does not provide a lot of the things you would want a real application client to do; things like failover and retry).
  • Using switchable modes. Using the relational DB as the system of record as you start to move over to Cassandra.
  • Mirroring is another technique that was employed at the application level. All writes go in parallel to Cassandra and to the relational DB as well. When things fail the RDBMS is the backup.
  • Dialable traffic. Being able to turn down the traffic when things go wrong.

Collaboration was really key in getting this to work. It was a big complex project. We had to have close collaboration and flexible roles. Mark and Jim the two primary dev and ops folks on the project and they had to be flexible. For example they changed the monitoring system that is traditionally used at Constant Contact when it was recognized that the current system would have failed. This is the type of systemic change that would be difficult to do without an environment of collaboration between Dev and Ops. Now that we have covered the dev side, we can talk a bit about ops.

The Ops Perspective

Now Mark will talk about this project from the ops side. Mark is the manager of system automation. He will talk today about how they use Puppet and in general a software tool chain that allows for improved levels of deployment flexibility.

When Mark starts a project, as a system admin, he tries to find the system specifications that will support the system the best. The came up with this machine spec after working with DataStax

3 500gig disks
1 250gig disk
No Swap
Raid Zero Root Partition and Data Storage
32GB Memory

The vendor was not sure if they should order that configuration because there is no internal fault tolerance built into that model. Cassandra however deals with redundancy at the node level though. So the question then became, how many nodes are needed?

  • Cassandra Quorum = 3 (meaning each bit of data needs to ultimately live on three machines)
  • Two data centers
  • Each node can only half the available disk because of RAID
  • ~ 6 TB needed

Ultimately that means 72 nodes which is a fair amount to manage to the level required by the project. Without getting into details about the management it would have been impossible for a human to do it. We wrote a puppet module that handles much of the management of this cluster. Puppet is not the only part of the whole system though. Here is the total tool chain:

Fedora anaconda/kickstart -> Func (for upgrades, puppet module exec’d through func) -> Puppet (for OS and app config) -> Scribe (Facebook’s open source logging framework) -> Nagios (for logging, managed by puppet) -> Munin (for trending)

The tool chain above, really centered around puppet, means that Dev and Ops were able to talk about things in a common language. That language was Puppet. They also started using subversion for their config. Puppet allowed for infrastructure as code.

Operational efficiencies were garnered though using Puppet with Cassandra. Remote logging was a requirement. Cassandra uses log4j natively but resources were not available for remote log4j logging. Ops was able to get Scribe integrated with puppet easily.

Munin is another tool in the stack that allows for JMX trending. It allows critical data points to be identified. With Puppet they could continuously deploy improvements to the trending and analysis tool across the cluster in a uniform way. They did 7 X 92 graphs across the cluster in 5 minutes with Puppet. This gets reused over and over again as more apps get pushed to the cassandra cluster. DataStax provides RPMs for Constant Contact to deploy this software within. Admins in general at Constant Contact must be able to build RPMs. Maven is used to build the RPMs for custom applications.

Traditional Ops vs Today at Constant Contact

  • Infrastructure 4 weeks then to 4 hours to build 72 node today.
  • Development to Deployment 9 months then to 3 months today (for the whole project given comparable projects)
  • Cost millions of dollars then to 150k today.

Questions:

Q. What was the role of the DBA in this model?
A. DBAs will be the keepers of the data dictionary. They will also be helping with tuning of the actual cluster.

Q. Have you had an opportunity to do version upgrades on a running cluster?
A. Yes, we worked with QA to do a rolling upgrade twice. It worked nicely, no issues. We did a slow roll, sequentially. Cassandra naturally takes care of this with hints.

Q. Both dev and ops roles are writing puppet code. How do you stop them from clobbering each other.
A. We are still working on it, but version control helps a lot. They actually pushed some code into production before it was ready. They expect to be able to treat this puppet code ultimately like any other code.

March 7, 2011

DevOpsDays LiveBlog – DevOps in Government (DoD)

DevOps in government pic of military torpedo boat

the good the bad and the ugly

By Peter Walsh

I want to share with you some of the process that we have implemented at the DOD. We don’t have the silver bullet but we have improved things.

The Ugly

DoD spends more than $32 billion per year on IT systems. In 2008 our software creation and deployment process we went from 38 documents and gates to well over 40. Yay, moving backwards! In the DoD releasing software requires multiple levels of tests and approval, all within distinct organizations that don’t much communicate.

The Bad

Normally you have a program office that pushes problems to solve/projects out to dev contractors, test contractors, test agencies, and ops teams. Each of these groups are quite separate. They all create their own setups, which are distinct, and which basically guarantee inconsistency from one group to another. They manage things by excel, and project. Costs are measured in piles of money, and time is always calendar time, everything is slow.

Another large challenge for DoD systems is this challenge of accreditation. Before anything can be used within the software creation process or delivered it must go through this process. Everything! Hardware, OS, middle ware, applications. This process is implemented by a separate agency. Each of the aforementioned teams goes through this process separately. This process limits flexibility and it restricts change. This is why the government lives on old technology.

Good (all is not lost)

There are 26 projects on a Whitehouse watchlist for improvement and experimentation, some are DoD projects. There is some recognition that we need to be more agile and perhaps Agile. There are a few new programs out there that are trying to enable everyone in the DoD along these lines. Some of the more prominent ones are listed here:

  • Forge.mil is for government open source projects. They look to create community tools and foster a bit of colaboration. The vice chairmen of the cheifs of staff supports this project.
  • Race is a government program for Cloud resources and accreditation assistance.
  • Testforge.mil is another project that loks toward improving system provisioning, automated workflows, and test as a service.

Another project within the DoD that has done a lot to support the shorter lifecycles and higher quality delivery is called CONS3RT. We will discuss the approach it takes moving forward here

CONS3RT is a library of all the assets that you have from dev to operational deployment. Assets can be things like servers, databases, monitoring tools, etc… Assets are pulled together into what is called scenario which can be deployed into a cloud space in a repeatable fashion. This allows the different agencies and groups responsible for delivering software as described earlier to work on the same scenarios which eliminates some of the inconsistency between these groups. This also addresses the acredidation requirements. Once something has been accredited then it can be reused in this way between groups with out being reacredited. We realized that accreditation could not be just thrown away in building CONC3RT so it is quite well integrated into what we do. Peter and his group are not trying to tackle everything, but instead to start small and work within the boundaries of what can be changed.

This focus on accreditation is really key also because in the past developers did not want to go through the accreditation process becuase of the restrictions. They developed on their own hardware and then kicked things over the wall to move through the rest of the life cycle. This guaranteed nothing worked for the first dozen times it was kicked over. Now, this is less of a problem. Many other improvements have come out of the use of CONC3RT and this approach in general.

With this sytem reuse also improved. Instead of each organization learning from scratch how to deploy x or y spending piles of money doing it components already placed into CONS3RT can be used. This is something that virtualization and cloud in general can bring to any company.

Time is measured in minutes now a days instead of days and weeks. Things can be deployed with a couple mouse clicks instead of new hardware installation. More resources are also available to people now both in terms of the variety of software they can access as well as the CPU power they can harness through increased utilization. Tallent can be focused because people are not wasting time doing the job of provisioning, accreditation and other coordination activities.
.

How does software flow through the lifecycle now?

Stage – Developer: New code -> local IDE build -> Jenkins & Maven -> if build good, deploy to integration, if not, back to write code
Stage – Integration: Deployed to Integration -> Debug and test -> if tests pass send to CI if not, delete the artifact and go to dev
Stage – CI: Deploy to CI -> smoke/regression -> if pass -> if enough new -> tag -> send to QA env
Stage – QA: Deploy to QA -> Smoke/regression/manual/exploritory -> add new tests -> tests pass -> new tests push to past env -> tag new bits -> push to production

Most organizations in the government, and in industry, handle the writing of new code, building the code and pushing into production. The stuff in the middle, not so much, many orgs do not have much of a handle on how they are doing those functions. Our pipeline as mentioned above in conjunction with CONC3RT, maven, jenkens, and other tools is really helping us get a handle on it.

Lessons learned
——–

  • Automation is huge.
  • Testing early and often. It is easier to fix the engine when it on the assembly line then when it is on the dealers lot. Simple concept. Virtualization and CI allow you to test often with minimal barrier to doing so.
  • Leanness. This goes back in part to the process stuff. It is a matter of keeping the process simple
  • Flexibility, team trust and commitment. We fall off the wagon sometimes. We need to be able to call eachother out on things so we can keep improving. We need to trust that we will stay committed and on track when thing fail – which they will.
  • Persistence. Stay the course when people push back on you. Managent, peers, and others.
  • Trust, with customers and sponsors.
  • Top cover, someone over the top that can allow you to be persistent and help cover challenges from further above.