August 19, 2013

The Crowdsourcing of Code: What IT Can Learn from Developers

This is a guest post by Yoav Landman (@yoavlandman), Founder & Chief Technology Officer at JFrog

crowdsourcing-cartoon-300x216

The agile movement is influencing the adoption of new methods of collaboration from developer to consumer throughout the development process. The sharing of resources across companies, communities or even countries is known as crowdsourcing and it is becoming increasingly common. After all, when it comes to coding most of us speak much the same language. The idea of collaboration isn’t exclusive to developers and can also provide benefits to the IT fields. In IT, community knowledge is becoming a huge asset.

Developers and IT professionals often turn to networks outside of their own for information about the artifacts they use. Through the sharing of successes, failures, feedback and updates, the building blocks that make up software are (virtually) crowdsourced. With crowdsourcing becoming the new norm, there’s no shortage of best practices to take away from their community.

With that, let’s explore the practices that’ll make for a crowdsourcing strategy that’s beneficial, efficient and safe for software developers and IT alike:

License Control

When you use a communal tool, such as open source, you must protect your project through licensing. Nothing puts a damper on a project more than a licensing issue—monetary fees, wasted productivity hours and vendor lock-in can become a huge liability. More so than ever, responsibility for larger business initiatives is falling in IT’s lap, and a large portion of honing the role comes from licensing control.

Bring It In-House

Ensure that your original project is stored in-house. The main reason: it guarantees you’re in control and can reliably control accessibility to others within your organization for download. It also doesn’t put you at the mercy of the availability of external software repositories. Be sure to equip your project with internal backup – and keep it up to date.

Access Control and Internal Audit

While sharing is encouraged, be sure to filter who and what’s accessing and updating your organization’s resources. Who and what is allowed on the network? Ensure there are policies and procedures in place. Without proper management, you have no record of the where code or software is coming from or going to which can jeopardize both quality and security.

Free Up Resources: Share Centrally & Adopt Tools that Enable Management

An internal centralized resource for developers to share and pull libraries is a best practice. Not all methods are created equally though and careful tool selection can increase productivity and free up your team’s resources .

For example, using a version control system to store libraries can actually slow down the development process—they lack searchability, proxy facilities and a certain level of permission management. These manage source code (i.e. instructions, text) not binary files (i.e. builds, executable form) and drain storage space and network resources (when using a distributed version control). Pick the right tool for the job.

Automated Clean-Up  

Combining socialization with automation will increase productivity. As creatures of habit, so many IT pros use manual intervention in processes that can be made automatic. One example from the software development side is clean-up. Let’s say you’re using a continuous integration server. Binaries are constantly being built; it may build 50 versions of the library in one hour but your team only qualifies one version. Adopting proper tools to eliminate parts of the cycle that don’t require manual intervention will increase productivity.

Crowdsourcing has swept professional networks, but most industries are still limited to internal interaction among co-workers.  Software developers and IT professionals are unique in that we converse across companies and even industries on a global scale. With the right tools, we can share consumer feedback and understand risks, successes, and code that form the building blocks for great systems. While it is not yet a standard practice for fields such as marketing or law, crowdsourcing is beneficial for so many fields in IT here and now.

July 18, 2013

Free Tickets for Puppet Conf 2013 – Expired

pupetconf

Update: The winners of the free tickets were:
Deepak Jagannath and Tim Hunter. Congrats from DevOps.com!

——–

DevOps.com giving away 2 free tickets to PuppetConf 2013. PuppetConf 2013 (happening August 22 – 23rd) is set to host 2,000 attendees this year and include speakers from VMware & RedHat. It will also take place at the Fairmont Hotel, located in the heart of downtown SF, where a ton of other social events around the conference are set to take place.

To win a ticket just respond to this post, and email us at posting@devops.com, with a story about how DevOps has made your life easier or your company more productive. We will respond on Monday the 29th of July with the 2 winning users handles and email each of you with a code to get tickets.

June 4, 2013

Fresh Stats Comparing Traditional IT and DevOps Oriented Productivity

This is a guest post by Krishnan Badrinarayanan (@bkrishz), ZeroTurnaround

The word “DevOps” has been thrown around quite a lot lately. Job boards are awash with requisitions for “DevOps Engineers” with varying descriptions. What is DevOps, really?

In order to better under what the fuss is all about, we surveyed 620 engineers to examine what they do to keep everything running like clockwork – from day-to-day activities, key processes, tools and challenges they face. The survey asked for feedback on how much time is spent improving infrastructure and setting up automation for repetitive tasks; how much time is typically spent fighting fires and communicating; and what it takes to keep the lights on. We then compared responses belonging to those from traditional IT and DevOps teams. Here are the results, in time spent each week carrying out key activities:

devops productivity stats

Conclusions we can draw from the results

DevOps oriented teams spend slightly more time automating tasks

Writing scripts and automating processes have been a part of the Ops playbook for decades now. The likes of shell scripts, Python and PERL, are often used to automate repetitive configuration tasks but with the newer tools like Chef and Puppet, Ops folk perform more sophisticated kinds of automation such as spinning up virtual machines and tailoring them to the app’s needs using Chef or Puppet recipes.

Both Traditional IT and DevOps oriented teams communicate actively

Respondents belonging to a DevOps oriented team spend 2 fewer hours communicating each week, possibly because DevOps fosters better collaboration and keeps Dev and Ops teams in sync with each other. However, Dev and Ops folk in Traditional IT teams spend over 7 hours each week communicating. This active dialogue helps them better understand challenges, set expectations and triage issues. How much of this communication can be deemed inefficient is subjective, but it is necessary to get both teams to onboard. Today, shared tooling, instant messaging, task managers and social tools also help bring everyone closer together in real-time.

DevOps oriented teams fight fires less frequently

A key tenet of the DevOps methodology is to embrace the possibility of failures, and be prepared for it. With alerts, continuous testing, monitoring and feedback loops that expose vulnerabilities and key metrics, teams are enabled to act quickly and proactively. Programmable infrastructure and automated deployments provide a quick recovery while minimizing user impact.

DevOps oriented teams spend less time on administrative support

This could be a result of better communication, higher level of automation and the availability of self-service tools and scripts for most support tasks. If there’s a high level of provisioning and automation, there’s no reason why admin support shouldn’t dwindle down to a very small time drain. It could also mean that members of DevOps oriented teams help themselves more often than expecting to be supported by the system administrator.

DevOps oriented teams work fewer days after-hours

We asked our survey takers how many days per week they work outside of normal business hours. Here’s what we learned:

Days worked after hours Traditional IT DevOps Oriented
Average 2.3 1.5
Standard Deviation 1.7 1.7

According to these results, DevOps team members lead a more balanced life, spend more time on automation and infrastructure improvement, spend less time fighting fires, and work less hours (especially outside of normal business hours).

DevOps-related initiatives came up on top in 2012 and 2013, according to our survey. There’s a strong need for agility to respond to ever-changing and expanding market needs. Software teams are under pressure to help meet them and the chart above validates its benefits.

Rosy Stats, but hard to adopt

How we got here

IT Organizational structures – typically Dev, QA, and Ops – have come to exist for a reason. The dev team focuses on innovating and creating apps. The QA team ensures that the app behaves as intended. The operations team keeps the infrastructure running – from the apps, network, servers, shared resources to third party services. Each team requires a special set of skills in order to deliver a superior experience in a timely manner.

The challenge

Today’s users increasingly rely on software and expect it to meet their constantly evolving needs 24/7, whether they’re at their desks or on their mobile devices. As a result, IT teams need to respond to change and release app updates quickly and efficiently without compromising on quality. Fail to do so, and they risk driving users to competitors or other alternatives.

However, releasing apps quickly comes with its own drawbacks. It strains functionally siloed teams and often results in software defects, delays and stress. Infrequent communication across teams further exacerbates the issue, leading to a snowball effect of finger-pointing and bad vibes.

Spurring cultural change

Both Dev and Ops teams bring a unique set of skills and experience to software development and delivery. DevOps is simply a culture that brings development and operations teams together so that through understanding each others’ perspectives and concerns, they can build and deliver resilient software products that are production ready, in a timely manner. DevOps is not NoOps. Nor is it akin to putting a Dev in Ops clothing. DevOps is synergistic, rather than cannibalistic.

DevOps is a journey

Instilling a DevOps oriented culture within your organization is not something that you embark on and chalk off as success at the end. Adopting DevOps takes discipline and initiative to bring development and operations teams together. Read up on how other organizations approach adopting DevOps as a culture and learn from their successes and failures. Put to practice what makes sense within your group. Develop a maturity model that can guide you through your journey.

The goal is to make sure that dev and ops are on the same page, working together on everything, toward a common goal: continuous delivery of working software without handoffs, hand-washing, or finger-pointing.

Support the community and the cause

Dev and Ops need to look introspectively to understand their strengths and challenges, and look for ways to contribute towards breaking down silos. Together, they should seek to educate each other, culturally evolve roles, relationships, incentives, and processes and put end user experience first.

The DevOps community is small but burgeoning, and it’s easy to find ways to get involved, like with the community-driven explosion of DevOpsDays conferences that occur around the world.

Set small goals to be awesome

Teams should collaborate to set achievable goals and milestones that can get them on the path to embracing a DevOps culture. Celebrate small successes and focus on continuous improvement. Before you know it, you will surely but gradually reap the benefits of bringing in a DevOps approach to application development and delivery.

Start here

For deeper insights into IT Ops and DevOps Productivity with a focus on people, methodologies and tools, download a 35-page report filled with stats and charts.

April 29, 2013

The State of DevOps: Accelerating Adoption

By James Turnbull (@kartar), VP of Technology Operations, Puppet Labs Inc.

A sysadmin’s time is too valuable to waste resolving conflicts between operations and development teams, working through problems that stronger collaboration would solve, or performing routine tasks that can – and should – be automated. Working more collaboratively and freed from repetitive tasks, IT can – and will – play a strategic role in any business.

At Puppet Labs, we believe DevOps is the right approach for solving some of the cultural and operational challenges many IT organizations face. But without empirical data, a lot of the evidence for DevOps success has been anecdotal.

To find out whether DevOps-attuned organizations really do get better results, Puppet Labs partnered with IT Revolution Press IT Revolution Press to survey a broad spectrum of IT operations people, software developers and QA engineers.

The data gathered in the 2013 State of DevOps Report proves that DevOps concepts can make companies of any size more agile and more effective. We also found that the longer a team has embraced DevOps, the better the results. That success – along with growing awareness of DevOps – is driving faster adoption of DevOps concepts.

DevOps is everywhere

Our survey tapped just over 4,000 people living in approximately 90 countries. They work for a wide variety of organizations: startups, small to medium-sized companies, and huge corporations.

Most of our survey respondents – about 80 percent – are hands-on: sysadmins, developers or engineers. Break this down further, and we see more than 70 percent of these hands-on folks are actually in IT Ops, with the other 30 percent in development and engineering.

DevOps orgs ship faster, with fewer failures

DevOps ideas enable IT and engineering to move much faster than teams working in more traditional ways. Survey results showed:

  • More frequent and faster deployments. High-performing organizations deploy code 30 times faster than their peers. Rather than deploying every week or month, these organizations deploy multiple times per day. Change lead time is much shorter, too. Rather than requiring lead time of weeks or months, teams that embrace DevOps can go from change order to deploy in just a few minutes. That means deployments can be completed up to 8,000 times faster.
  • Far fewer outages. Change failure drops by 50 percent, and service is restored 12 times faster.

Organizations that have been working with DevOps the longest report the most frequent deployments, with the highest success rates. To cite just a few high-profile examples, Google, Amazon, Twitter and Etsy are all known for deploying frequently, without disrupting service to their customers.

Version control + automated code deployment = higher productivity, lower costs & quicker wins

Survey respondents who reported the highest levels of performance rely on version control and automation:

  • 89 percent use version control systems for infrastructure management
  • 82 percent automate their code deployments

Version control allows you to quickly pinpoint the cause of failures and resolve issues fast. Automating your code deployment eliminates configuration drift as you change environments. You save time and reduce errors by replacing manual workflows with a consistent and repeatable process. Management can rely on that consistency, and you free your technical teams to work on the innovations that give your company its competitive edge.

What are DevOps skills?

More recruiters are including the term DevOps in job descriptions. We found a 75 percent uptick in the 12-month period from January 2012 to January 2013. Mentions of DevOps as a job skill increased 50 percent during the same period.

In order of importance, here are the skills associated with DevOps:

  • Coding & scripting. Demonstrates the increasing importance of traditional developer skills to IT operations.
  • People skills. Acknowledges the importance of communication and collaboration in DevOps environments.
  • Process re-engineering skills. Reflects the holistic view of IT and development as a single system, rather than as two different functions.

Interestingly, experience with specific tools was the lowest priority when seeking people for DevOps teams. This makes sense to us: It’s easier for people to learn new tools than to acquire the other skills.

It makes sense on a business level, too. After all, the tools a business needs will change as technology, markets and the business itself shift and evolve. What doesn’t change, however, is the need for agility, collaboration and creativity in the face of new business challenges.

—-
About the author:

James Turnbull portraitA former IT executive in the banking industry and author of five technology books, James has been involved in IT Operations for 20 years and is an advocate of open source technology. He joined Puppet Labs in March 2010.

April 2, 2013

Data Driven Observations on AWS Usage from CloudCheckrs User Survey

This is a guest post by Aaron Klein from CloudCheckr

We were heartened when AWS made Trusted Advisor free for the month of March. This was an implicit acknowledgement of what many have long known: AWS is complex and can be challenging for users to provision and control their AWS infrastructure effectively.

We took the AWS announcement as an opportunity to conduct an internal survey of our customers’ usage. We compared the initial assessments of 400 of our users’ accounts against our 125+ best practice checks for proper configurations and policies. Our best practice checks span 3 key categories: Cost, Availability, and Security.  We limited our survey to users with 10 or more running EC2 instances.  In aggregate, the users were running more than 16,000 EC2 instances.

We were surprised to discover that nearly every customer (99%) experienced at least one serious exception.  Beyond this top level takeaway, our primary conclusion was that controlling cost may grab the headlines, but users also need to button up a number of availability and security issues.

When considering availability, there were serious configuration issues that were common across a high percentage of users. Users repeatedly failed to optimally configure Auto Scaling and ELB. The failure to create sufficient EBS snapshots was an almost universal issue.

Although users passed more of our security checks, the exceptions which did arise were serious. Many of the most commons security issues were found in configurations for S3, where nearly 1 in 5 users allowed unfettered access to their buckets through “Upload /Delete” or “Edit Permissions” set to everyone. As we explained in an earlier whitepaper, anyone using a simple bucket finder tool could locate and access these buckets.

Beyond the numbers, we also interviewed customers to gather qualitative feedback from users on some of the more interesting data points.

If the findings of this survey sparks questions about how well your AWS account is configured, CloudCheckr offers a free account that you can set up in minutes.  Simply enter read only credentials from your AWS account and CloudCheckr will assess your configurations and policies in just a few minutes:  https://app.cloudcheckr.com/LogOn/Registration

Conclusions by Area

Conclusions based upon Cost Exceptions:

As noted, our sample was comprised of 16,047 instances. The sample group spent a total of $2,254,987 per month on EC2 (and its associated costs) for average monthly cost per customer of $7516. Of course, we noted the mismatch between quantity and cost – spot instances represent 8% of the quantity but only 1.4% of the cost. This is due to the significantly less expensive price of spot instances compared to on demand.

When we looked at the Cost Exceptions, we found that 96% of all users experienced at least 1 exception (with many experiencing multiple exceptions). In total, we found that users who adopted our recommended instance sizing and purchasing type were able to save an average of $3974 per month for an aggregate total of $1,192,212 per month.

This suggested that price optimization remains a large hurdle for AWS users who rely on native AWS tools. Users consistently fail to optimize purchasing and also fail to optimize utilization. These combined issues meant that the average customer pays nearly twice as much as necessary for resources to achieve proper performance for their technology.

To further examine this behavior, we interviewed a number of customers.  We interviewed customers who exclusively purchased on-demand and customers who used multiple purchasing types.

Here were their answers (summarized and consolidated):

  • Spot instances worry users – there is a general concern of: “what if the price spikes and my instance is terminated?” This fear exists despite the fact that spikes occur very rarely, warnings are available, and proper configuration can significantly mitigate this “surprise termination” risk.
  • It is difficult and time consuming to map the cost scenarios for purchasing reserved instances. The customers who did make this transition had cobbled together home grown spreadsheets as a way of supporting this business decision.  The ones who didn’t make this effort made a gut estimate that it wasn’t worth the time.  AWS was cost effective enough and the time and effort for modeling the transition was an opportunity cost taken away from building and managing their technology.
  • The intricacies of matching the configurations between on demand instances and reserved instances while taking into consideration auto scaling and other necessary configurations were daunting. Many felt it was not worth the effort.
  • Amazon’s own process for regularly lowering prices is a deterrent to purchasing RIs. This is especially true for RIs with a 3 year commitment.  In fact, within the customers who did purchase RI, none expressed a desire to purchase RIs with a 3 year commitment. All supported their decision by referencing the regular AWS price drops combined with the fact that they could not accurately predict their business requirements 3 years out.

Conclusions based upon Availability Exceptions:

We compared our users against our Availability best practices and found that nearly 98% suffered from at least 1 exception. We hypothesized that this was due to the overall complexity of AWS and interviewed some of our users for confirmation. Here is what we found from those interviews:

  • Users were generally surprised with the exceptions. They believed that they “had done everything right” but then realized that they underestimated the complexity of AWS.
  • Users were often unsure of exactly why something needed to be remedied. The underlying architecture of AWS continues to evolve and users have a difficult time keeping up to speed with new services and enhancements.
  • AWS dynamism played a large role in the number of exceptions. Users commented that they often fixed exceptions and, after a week of usage, found new exceptions had arisen.
  • Users remained very happy with the overall level of service from AWS. Despite the exceptions which could diminish overall availability, the users still found that AWS offered tremendous functionality advantages.

Conclusion bases upon Security Exceptions:

Finally, we looked at security. Here we found that 44% of our users had at least one serious exception present during the initial scan. The most serious and common exceptions occurred within S3 usage and bucket permissioning. Given the differences in cloud v. data center architecture, this was not entirely surprising. We interviewed our users about this area and here is what we found:

  • The AWS management console offered little functionality for helping with S3 security. It does not provide a use friendly means of monitoring and controlling S3 inventory and usage. In fact, we found that most of our users were surprised when the inventory was reported. They often had 300-500% more buckets, objects and storage than they expected.
  • Price = Importance, S3 is often an afterthought for users. Because it is so inexpensive users do not audit it as closely as EC2 and other more expensive services and rarely create and implement formal policies for S3 usage.  The time and effort required to log into each region one by one to collect S3 information and download data through the Management console was not worth the effort relative to spend.
  • Given the low cost and lack of formal policies, team members throw up high volumes of objects and buckets knowing that they can store huge amounts of data at a minimal cost.  Since users did not audit what they had stored, they could not determine the level of security.

Cloud Computing Forrest

AaronKleinAuthor Info: Aaron is the Co-Founder/COO of CloudCheckr Inc. (CCI). With over 20 years of managerial experience and vision, he directs the company’s operations.

Aaron has held key leadership roles at diverse organizations ranging from small entrepreneurial start-ups to multi-billion dollar enterprises. Aaron graduated from Brandeis University and holds a J.D. from SUNY Buffalo.

Underlying Data Summary

Cost:                                                                                                       Any exception 96%

The total of 16,047 instances was broken in the following categories:

  • On Demand:       78%    (12,517 instances)
  • Reserved:             14%    (2,247 instances)
  • Spot:                        8%      (1,284 instances)

The instance purchasing was broken down as follows:

  • On Demand:        89.7%  ($2,023,623)
  • Reserved:             8.9%     ($199,803)
  • Spot:                        1.4%     ($31,561)

Common Cost Exceptions we found:

  • Idle EC2 Instances                                                                                                      36%
  • Underutilized EC2 Instances                                                                               84%
  • EC2 Reserved Instance Possible Matching Mistake                                             17%
  • Unused Elastic IP                                                                                                        59%

 

Availability:                                                                                              Any exception 98%

Here, broken out by service, are some highlights of common and serious exceptions that we found:

Service Type:                                                                                      Customers with Exceptions

EC2:                                                                                                           Any exception   95%

 

  • EBS Volumes That Need Snapshots                                                                91%
  • Over Utilized EC2 Instances                                                                                                   22%

Auto Scaling:                                                                                              Any exception   66%

 

  • Auto Scaling Groups Not Being Utilized  For All EC2 Instances                       57%
  • All Auto Scaling Groups Not Utilizing Multiple Availability Zones                34%
  • Auto Scaling Launch Configuration Referencing Invalid Security Group                   22%
  • Auto Scaling Launch Configuration Referencing Invalid AMI                           18%
  • Auto Scaling Launch Configuration Referencing Invalid Key Pair                 16%

ELB:                                                                                                        Any exception   42%

 

  • Elastic Load Balancers Not Utilizing Multiple Availability Zones                 37%
  • Elastic Load Balancers With Fewer Than Two Healthy Instances               21%

Security:                                                                                                    Any exception   46%

 

These were the most common exceptions that we found:

  • EC2 Security Groups Allowing Access To Broad IP Ranges                                 36%
  • S3 Bucket(s) With ‘Upload/Delete’ Permission Set To Everyone                   16%
  • S3 Bucket(s) With ‘View Permissions’ Permission Set To Everyone            24%
  • S3 Bucket(s) With ‘Edit Permissions’ Permission Set To Everyone               14%
February 14, 2013

DevOps – A Valentine’s Day Fairy Tale

DevOps - A Valentine's Day Fairy Tale

DevOps – A Valentine’s Day Fairy Tale

This is a guest post by Matt Watson from Stackify

Once upon a time two people from different sides of the tracks met and fell in love. Never before had the two people found another person who so perfectly complemented them. Society tried to keep them apart – “It’s just not how things are done,” they’d say. But times were changing, and this sort of pairing was becoming more socially acceptable.

They met at the perfect time.

Ops had grown tired of the day to day grind of solving other people’s problems. Enough was enough and she needed a change in her life. A perfectionist and taskmaster to the highest degree, she tended to be very controlling and possessive in relationships. It became more about commands than conversation, making life miserable for both parties. She began to realize she hated change, and felt like she spent most of her time saying “No.” It was time to open up and begin to share to make a relationship work.

Dev, on the other hand, was beginning to mature (a little late in the game, as guys seem to) and trying to find some direction. He had grown tired of communication breakdowns in relationships – angry phone calls in the middle of the night, playing the blame game, and his inability to meet halfway on anything. He began to realize most of those angry phone calls came as a result of making impulsive decisions without considering how they would impact others. His bad decisions commonly led to performance problems and created a mess for his partners. Dev wanted to more actively seek out everything that makes a healthy relationship work.

The timing was right for a match made in heaven. Dev and Ops openly working and living side by side to make sure both contributed equally to making their relationship work. Ops realized she didn’t have to be so controlling if she and Dev could build trust between one another. Dev realized that he caused fewer fights if he involved Ops in decisions about the future, since those decisions impacted both of them. It was a growing process that caused a lot of rapid and sudden change. Although, like most relationships, they knew it was important to not move too fast, no matter how good it felt.

Dev and Ops dated for about four years before they decided to get married. Now they will be living together and sharing so much more; will their relationship last? How will it need to change to support the additional closeness? But they aren’t worried, they know it is true love and will do whatever it takes to make it work. Relationships are always hard, and they know they can solve most of their problems with a reboot, hotfix, or patch cable.

Will you accept their forbidden love?

7 Reasons the DevOps Relationship is Built to Last

  1. Faster development and deployment cycles (but don’t move too fast!)
  2. Stronger and more flexible automation with deployment task repeatability
  3. Lowers the risk and stress of a product deployment by making development more iterative, so small changes are made all the time instead of large changes every so often
  4. Improves interaction and communication between the two parties to keep both sides in the loop and active
  5. Aids in standardizing all development environments
  6. DevOps dramatically simplifies application support because everyone has a better view of the big picture.
  7. Improves application testing and troubleshooting

mat

About the author: Matt Watson is the Founder & CEO of Stackify. He has a lot of experience managing high growth and complex technology projects. He is focused on changing the way developers support their production applications with DevOps.

February 11, 2013

Defining the Dev and the Ops in Devops

development and operations roles not well defined

This is a guest post by Matt Watson from Stackify

So what does DevOps mean exactly? What is the Dev and what is the Ops in DevOps? The role of Operations can mean a lot of things and even different things to different people. DevOps is becoming more and more popular but a lot of people are confused on the topic of who does what. So let’s make a list of the responsibilities operations traditionally has and then figure out what developers should be doing, and which if any responsibilities should be shared.

Operations responsibilities

  • IT buying
  • Installation of server hardware and OS
  • Configuration of servers, networks, storage, etc…
  • Monitoring of servers
  • Respond to outages
  • IT security
  • Managing phone systems, network
  • Change control
  • Backup and disaster recovery planning
  • Manage active directory
  • Asset tracking

Shared Development & Operations duties

  • Software deployments
  • Application support

Some of these traditional responsibilities have changed in the last few years. Virtualization and the cloud have greatly simplified buying decisions, installation, and configuration. For example, nobody cares what kind of server we are going to buy anymore for a specific application or project. We buy great big ones, virtualize them, and just carve out what we need and change it on the fly. Cloud hosting simplifies this even more by eliminating the need to buy servers at all.

So what part of the “Ops” duties should developers be responsible for?

  • Be involved in selecting the application stack
  • Configure and deploy virtual or cloud servers (potentially)
  • Deploy their applications
  • Monitor application and system health
  • Respond to applications problems as they arise.

Developers who take ownership of these responsibilities can ultimately deploy and support their applications more rapidly. DevOps processes and tools eliminate the walls between the teams and enables more agility for the business. This philosophy can enable the developers to potentially be responsible for the enter application stack from OS level and up in more a self service mode.

So what does the operations team do then?

  • Manage the hardware infrastructure
  • Configure and monitor networking
  • Enforce policies around backup, DR, security, compliance, change control, etc
  • Assist in monitoring the systems
  • Manage active directory
  • Asset tracking
  • Other non production application related tasks

Depending on the company size the workload of these tasks will vary greatly. In large enterprise companies these operations tasks become complex enough to require specialization and dedicated personnel for these responsibilities. For small to midsize companies the IT manager and 1-2 system administrators can typically handle these tasks.

DevOps is evolving into letting the operations team focus on the infrastructure and IT policies while empowering the developers to exercise tremendous ownership from the OS level and up. With a solid infrastructure developers can own the application stack, build it, deploy it, and cover much if not all of its support. This enables development teams to be more self-service and independent of a busy centralized operations team. DevOps enables more agility, better efficiency, and ultimately a higher level of service to their customers.

mat

About the author: Matt Watson is the Founder & CEO of Stackify. He has a lot of experience managing high growth and complex technology projects. He is focused on changing the way developers support their production applications with DevOps.

December 10, 2012

Approaches to Application Release Automation

gears

This is a guest post by Phil Cherry from Nolio

A discussion of process-based, package-based, declarative, imperative and generic approaches to application release automation.

Application Release Automation is a relatively new, but rapidly maturing area of IT. As with all new areas there is plenty of confusion around what Application Release Automation really is and the best way to go about it. There are those who come at it with a very developer-centric mind-set, there are those who embrace the modern DevOps concept and even those who attempt to apply server based automation tools to the application space.

Having worked with many companies of various sizes, technologies, cultures and mind-sets; both as they select an ARA (Application Release Automation) tool and as they move on to implement their chosen tool, I have had many opportunities to assess the various approaches. In this short blog I will discuss the pro’s and con’s of each approach.

Package-Based

Package-based automation is a technique that was originally designed for automating the server layer. Due to its success at this, some have attempted to adapt it to automate the application layer as well. Packages encapsulate all the changes that need to be performed on a single server, and can include the pre-requisite checks that need to take place, as well as the post-deployment verifications. When patching a server this makes complete sense, there are no dependencies between the patched server and all the others in the same data centre, and so applying all the required changes for that patch (or patches) in a bundle in one go is possible. The package can then be applied to all appropriate servers without modification. At this layer there is little difference between one Windows Server 2008 and the next, even though the applications on top may be completely different.
The benefit of this packaging approach is the easy rollback capability. If required the package can be easily rolled back to the original server state, but on the other hand it treats each server as an island with no dependency to another server. It assumes that all changes on that server can be done in one go. This type of automation is offered by companies like BMC Bladelogic and IBM Tivoli Provisioning Manager (TPM).

Declarative-Based

Declarative-based automation comes from a similar mindset to package-based but takes a different route to the solution. It also originally came out of the need to automate the server layer and a subsequent attempt to apply it to the application layer. With declarative-based automation, the desired state of the server is defined down to every individual configuration item (registry key, dll, config file entry) etc. Most declarative-based tools require you to describe the desired state by writing what is effectively a piece of ‘code’. Some solutions, for example Puppet, offer a simplified proprietary DSL (Domain Specific Language) but this does not allow you to do everything, and so keeps Ruby as a backup. The downside of this is that the user has to learn at least one programming language (or you have to employ people with that knowledge already) and so does not readily open the automation to non-developers. This approach also has the same downside as package-based automation in that it assumes each server can be configured independently and all in one go. But it also has the same benefits, in that automatic rollback is conceptually a lot easier.

Imperative-Based

Imperative-based automation is more familiar as the structure of the language is closer to traditional programming languages (such as Java, C++, Perl etc). In this approach a programming language is used to describe what needs to be done to the target servers in a series of steps executed in a specific sequence. Chef is an example of an imperative automation tool (the programming language which is based on Ruby). As with declarative-based, the code created (or recipe as it is called in Chef) is still very much focused on making changes to a single isolated server, and the assumption is that those same changes will be applied to multiple servers of the same type. There is limited understanding of making dependant changes across multiple servers, because that was not required at the server layer. It is only important when you move up to the application layer. And of course, the current offerings available still require you to be familiar with, or learn, a programming language to use them.

Generic and Custom-Built Approaches

Often, people try to apply generic approaches (such as Powershell, DOS batch scripts, Perl etc) to the task of automation, or even to write their own automation tool using a compiled language such as Java, C++ etc. As any developer will tell you, they can go and write something that will deploy your application. They can use their preferred language rather than having to use the language supported by the automation platform being employed. And they are right, a development team can indeed write a fully capable deployment tool but the question is: does it really benefit the company to take up development time building and maintaining a deployment tool rather than focusing on the development of their own applications? Even then they will have many issues to face in enabling parallel execution, reporting and auditing, access and permissions control, and importantly synchronising activity across multiple servers. The original intention of these approaches was once again focused on a specific server and not on the cross-server nature of application deployment.

Process-Based

Process-based automation is a different approach, which was created more recently, to address the needs of application release automation. ARA platforms such as DeployIt and UrbanDeploy and, of course, our own tool Nolio all take this approach. These tools seek to support currently existing application deployment processes, the ones that operators could/would normally step through manually. The focus is on processes, and the tools allow an operator to define them in a visual way, with an understanding of the cross-server nature of application deployments. Let’s say that you need to do something on an application’s web server, then the application server, then update tables in the database, then a second change to the application server and finally another change to the web server. With a server-centric approach (such as those discussed above) it is very hard to orchestrate activities across multiple different servers, to synchronise the changes so they happen at the right time in relation to each other, and even to pass information between those servers. With a process-based system this is straightforward – you define the different server types, drop the relevant steps onto each one and draw links to define the synchronisation points (i.e. only do the db steps once the application server steps have been completed).

process-based-automation

Diagram 1 – Screenshot of a Nolio deployment process, including activity on 4 different server types.

The issues mentioned above such as parallel execution, cross-server synchronisation etc. should already have been dealt with by the ARA platform, rather than having to be created on top of a more generic platform. As you see from the diagram above, with an ARA platform cross-server synchronisation is simply a case of drawing a dependency link between an action on one server type to an action on another server type. Now “Stop Application” action on the “Application Server Type” will not run until “Prepare for package distribution” on “Repository node” has been completed.

In addition to this, process-based automation brings the following benefits: there is no need to modify the deployment process to fit the automation tools’ inability to synchronise activity across multiple servers, thus ensuring the consistency between manual and automated approaches is maintained; the defined process can be used as documentation for how the deployment should be done as it is inherently very readable; there is a detailed and relevant audit trail because the process is inline with how the deployment would be done manually, and it is much easier to diagnose issues because the process follows the expected steps.

You can read more articles on Application Release Automation on Nolio’s Blog Site.

November 11, 2012

Big Data Problems in Monitoring at eBay

This post is based on a talk by Bhaven Avalani and Yuri Finklestein at QConSF 2012 (slides). Bhaven and Yuri work on the Platform Services team at eBay.

by @mattokeefe

This is a Big Data talk with Monitoring as the context. The problem domain includes operational management (performance, errors, anomaly detection), triaging (Root Cause Analysis), and business monitoring (customer behavior, click stream analytics). Customers of Monitoring include dev, Ops, infosec, management, research, and the business team. How much data? In 2009 it was tens of terabytes per day, now more than 500 TB/day. Drivers of this volume are business growth, SOA (many small pieces log more data), business insights, and Ops automation.

The second aspect is Data Quality. There are logs, metrics, and events with decreasing entropy in that order. Logs are free-form whereas events are well defined. Veracity increases in that order. Logs might be inaccurate.

There are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. The data is distributed to OLAP, Hadoop, and HBase for storage. Some of the data is dealt with in real-time while other activities such as OLAP for metric extraction is not.

Logs
How do you make logs less “wild”? Typically there are no schema, types, or governance. At eBay they impose a log format as a requirement. The log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. Transactions can be nested. Another type is atomic transactions. There are also types for events and heartbeats. They generate 150TB of logs per day.

Large Scale Data Distribution
The hardest part of distributing such large amounts of data is fault handling. It is necessary to be able to buffer data temporarily, and handle large spikes. Their solution is similar to Scribe and Flume except the unit of work is a log entry with multiple lines. The lines must be processed in correct order. The Fault Domain Manager copies the data into downstream domains. It uses a system of queues to handle the temporary unavailability of a destination domain such as Hadoop or Messaging. Queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. The queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. There are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.

Metric Extraction
The raw log data is a great source of metrics and events. The client does not need to know ahead of time what is of interest. The heart of the system that does this is Distributed OLAP. There are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. The system maintains counters in memory on hierarchically described data. Traditional OLAP systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. The result of the aggregators is OLAP cubes with multidimensional structures with counters. The combiner then produces one gigantic cube that is made available for queries.

Time Series Storage
RRD was a remarkable invention when it came out, but it can’t deal with data at this scale. One solution is to use a column oriented database such or HBase or Cassandra. However you don’t know what your row size should be and handling very large rows is problematic. On the other hand OpenTSDB uses fixed row sizes based on time intervals. At eBay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. To solve this, they introduced a concept of multiple row spans for different resolutions.

Insights
* Entropy is important to look at; remove it as early as possible
* Data distribution needs to be flexible and elastic
* Storage should be optimized for access patterns

Q&A
Q. What are the outcomes in terms of value gained?
A. Insights into availability of the site are important as they release code every day. Business insights into customer behavior are great too.

Q. How do they scale their infrastructure and do deployments?
A. Each layer is horizontally scalable but they’re struggling with auto-scaling at this time. EBay is looking to leverage Cloud automation to address this.

Q. What is the smallest element that you cannot divide?
A. Logs must be processed atomically. It is hard to parallelize metric families.

Q. How do you deal with security challenges?
A. Their security team applies governance. Also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.

November 8, 2012

Release Engineering at Facebook

This post is based on a talk by Chuck Rossi at QConSF 2012. Chuck is the first Release Engineer to work at Facebook.
by @mattokeefe

Chuck tries to avoid the “D” “O” word… DevOps. But he was impressed by a John Allspaw presentation at Velocity 09 “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr“. This led him to set up a bootcamp session at Facebook and this post is based on what he tells new developers.

The Problem
Developers want to get code out as fast as possible. Release Engineers don’t want anything to break. So there’s a need for a process. “Can I get my rev out?” “No. Go away”. That doesn’t work. They’re all working to make change. Facebook operates at ludicrous speed. They’re at massive scale. No other company on earth moves as fast with at their scale.

Chuck has two things at his disposal: tools and culture. He latched on to the culture thing after Allspaw’s talk. The first thing that he tells developers is that they will shepherd their changes out to the world. If they write code and throw it over the wall, it will affect Chuck’s Mom directly. You have to deal with dirty work and it is your operational duty from check-in to trunk to in-front-of-my-Mom. There is no QA group at Facebook to find your bugs before they’re released.

How do you do this? You have to know when and how a push is done. All systems at Facebook follow the same path, and they push every day.

How does Facebook push?
Chuck doesn’t care what your source control system is. He hates them all. They push from trunk. On Sunday at 6p they take trunk and cut a branch called latest. Then they test for two days before shipping. This is the old school part. Tuesday they ship, then Wed-Fri they cherry pick more changes. 50-300 cherry picks per day are shipped.

But Chuck wanted more. “Ship early and ship twice as often” was a post he wrote on the Facebook engineering blog. (check out the funny comments). They started releasing 2x/day in August. This wasn’t as crazy as some people thought, because the changes were smaller with the same number of cherry picks per day.

About 800 developers check in per week. It keeps growing as they hire more, even buying out an old windshield repair houston place for more office space. There’s about 10k commits per month to a 10M LOC codebase. But the rate of cherry picks per day has remained pretty stable. There is a cadence for how things go out. So you should put most of your effort into the big weekly release. Then lots of stuff crowds in on Wed as fixes come in. Be careful on Friday. At Google they had “no push Fridays”. Don’t check in your code and leave. Sunday and Monday are their biggest days, as everyone uploads and views all the photos from everyone else’s drunken weekend.

Give people an out. If you can’t remember how to do a release, don’t do anything. Just check into trunk and you can avoid the operational burden of showing up for a daily release.

Remember that you’re not the only team shipping on a given today. Coordinate changes for large things so you can see what’s planned company wide. Facebook uses Facebook groups for this.

Dogfooding
You should always be testing. People say it but don’t mean it, but Facebook takes it very seriously. Employees never go to the real facebook.com because they are redirected to http://www.latest.facebook.com. This is their production Facebook plus all pending changes, so the whole company is seeing what will go out. Dogfooding is important. If there’s a fatal error, you get directed to the bug report page.

File bugs when you can reproduce them. Make it easy and low friction for internal users to report an issue. The internal Facebook includes some extra chrome with a button that captures session state, then routes a bug report to the right people.

When Chuck does a push, there’s another step in that developers’ changes are not merged unless you’ve shown up. You have to reply to a message to confirm that you’re online and ready to support the push. So the actual build is http://www.inyour.facebook.com which has fewer changes than latest.

Facebook.com is not to be used as a sandbox. Developers have to resist the urge to test in prod. If you have a billion users, don’t figure things out in prod. Facebook has a separate complete and robust sandbox system.

On-call duties are serious. They make sure that they have engineers assigned as point of contact across the whole system. Facebook has a tool that allows quick lookup of on-call people. No engineer escapes this.

Self Service
Facebook does everything in IRC. It scales well with up to 1000 people in a channel. Easy questions are answered by a bot. There is a command to lookup the status of any rev. They also have a browser shortcut as well. Bots are your friends and they track you like a dog. A bot will ask a developer to confirm that they want a change to go out.

Where are we?
Facebook has a dashboard with nice graphs showing the status of each daily push. There is also a test console. When Chuck does the final merge, he kicks off a system test immediately. They have about 3500 unit test suites and he can run one each machine. He reruns the tests after every cherrypick.

Error tracking
There are thousands and thousands of web servers. There’s good data in the error logs but they had to write a custom log aggregator to deal with the volume. At Facebook you can click on a logged error and see the call stack. Click on a function and it expands to show the git blame and tell you who to assign a bug to. Chuck can also use Scuba, their analysis system, which can show trends and correlate to other events. Hover over any error, and you get a sparkline that shows a quick view of the trend.

Gatekeeper
This is one of Facebook’s main strategic advantages that is key to their environment. It is like a feature flag manager that is controlled by a console. You can turn new features on selectively and restrain the set of users who see the change. Once they turned on “fax your photo” for only Techcrunch as a joke.

Push karma
Chuck’s job is to manage risk. When he looks at the cherry pick dashboard it shows the size of the change, and the amount of discussion in the diff tool (how controversial is the change). If both are high he looks more closely. He can also see push karma rated up to five stars for each requestor. He has an unlike button to downgrade your karma. If you get down to two stars, Chuck will just stop taking your changes. You have to come and have a talk with him to get back on track.

Perflab
This is a great tool that does a full performance regression on every change. It will compare perf of trunk against the latest branch.

HipHop for PHP
This generates about 600 highly optimized C++ files that are then linked into a single binary. But sometimes they use interpreted PHP in dev. This is a problem that they plan to solve with the PHP virtual machine that they plan to open source.

Bittorrent
This is how they distribute the massive binary to many thousands of machines. Clients contact Open Tracker server for list of peers. There is rack affinity and Chuck can push in about 15 minutes.

Tools alone won’t save you
The main point is that you cannot tool your way out of this. The people coming on board have to be brainwashed so they buy into the cultural part. You need the right company with support from the top all the way down.