Archive for November, 2012

November 11, 2012

Big Data Problems in Monitoring at eBay

This post is based on a talk by Bhaven Avalani and Yuri Finklestein at QConSF 2012 (slides). Bhaven and Yuri work on the Platform Services team at eBay.

by @mattokeefe

This is a Big Data talk with Monitoring as the context. The problem domain includes operational management (performance, errors, anomaly detection), triaging (Root Cause Analysis), and business monitoring (customer behavior, click stream analytics). Customers of Monitoring include dev, Ops, infosec, management, research, and the business team. How much data? In 2009 it was tens of terabytes per day, now more than 500 TB/day. Drivers of this volume are business growth, SOA (many small pieces log more data), business insights, and Ops automation.

The second aspect is Data Quality. There are logs, metrics, and events with decreasing entropy in that order. Logs are free-form whereas events are well defined. Veracity increases in that order. Logs might be inaccurate.

There are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. The data is distributed to OLAP, Hadoop, and HBase for storage. Some of the data is dealt with in real-time while other activities such as OLAP for metric extraction is not.

Logs
How do you make logs less “wild”? Typically there are no schema, types, or governance. At eBay they impose a log format as a requirement. The log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. Transactions can be nested. Another type is atomic transactions. There are also types for events and heartbeats. They generate 150TB of logs per day.

Large Scale Data Distribution
The hardest part of distributing such large amounts of data is fault handling. It is necessary to be able to buffer data temporarily, and handle large spikes. Their solution is similar to Scribe and Flume except the unit of work is a log entry with multiple lines. The lines must be processed in correct order. The Fault Domain Manager copies the data into downstream domains. It uses a system of queues to handle the temporary unavailability of a destination domain such as Hadoop or Messaging. Queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. The queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. There are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.

Metric Extraction
The raw log data is a great source of metrics and events. The client does not need to know ahead of time what is of interest. The heart of the system that does this is Distributed OLAP. There are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. The system maintains counters in memory on hierarchically described data. Traditional OLAP systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. The result of the aggregators is OLAP cubes with multidimensional structures with counters. The combiner then produces one gigantic cube that is made available for queries.

Time Series Storage
RRD was a remarkable invention when it came out, but it can’t deal with data at this scale. One solution is to use a column oriented database such or HBase or Cassandra. However you don’t know what your row size should be and handling very large rows is problematic. On the other hand OpenTSDB uses fixed row sizes based on time intervals. At eBay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. To solve this, they introduced a concept of multiple row spans for different resolutions.

Insights
* Entropy is important to look at; remove it as early as possible
* Data distribution needs to be flexible and elastic
* Storage should be optimized for access patterns

Q&A
Q. What are the outcomes in terms of value gained?
A. Insights into availability of the site are important as they release code every day. Business insights into customer behavior are great too.

Q. How do they scale their infrastructure and do deployments?
A. Each layer is horizontally scalable but they’re struggling with auto-scaling at this time. EBay is looking to leverage Cloud automation to address this.

Q. What is the smallest element that you cannot divide?
A. Logs must be processed atomically. It is hard to parallelize metric families.

Q. How do you deal with security challenges?
A. Their security team applies governance. Also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.

November 8, 2012

Release Engineering at Facebook

This post is based on a talk by Chuck Rossi at QConSF 2012. Chuck is the first Release Engineer to work at Facebook.
by @mattokeefe

Chuck tries to avoid the “D” “O” word… DevOps. But he was impressed by a John Allspaw presentation at Velocity 09 “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr“. This led him to set up a bootcamp session at Facebook and this post is based on what he tells new developers.

The Problem
Developers want to get code out as fast as possible. Release Engineers don’t want anything to break. So there’s a need for a process. “Can I get my rev out?” “No. Go away”. That doesn’t work. They’re all working to make change. Facebook operates at ludicrous speed. They’re at massive scale. No other company on earth moves as fast with at their scale.

Chuck has two things at his disposal: tools and culture. He latched on to the culture thing after Allspaw’s talk. The first thing that he tells developers is that they will shepherd their changes out to the world. If they write code and throw it over the wall, it will affect Chuck’s Mom directly. You have to deal with dirty work and it is your operational duty from check-in to trunk to in-front-of-my-Mom. There is no QA group at Facebook to find your bugs before they’re released.

How do you do this? You have to know when and how a push is done. All systems at Facebook follow the same path, and they push every day.

How does Facebook push?
Chuck doesn’t care what your source control system is. He hates them all. They push from trunk. On Sunday at 6p they take trunk and cut a branch called latest. Then they test for two days before shipping. This is the old school part. Tuesday they ship, then Wed-Fri they cherry pick more changes. 50-300 cherry picks per day are shipped.

But Chuck wanted more. “Ship early and ship twice as often” was a post he wrote on the Facebook engineering blog. (check out the funny comments). They started releasing 2x/day in August. This wasn’t as crazy as some people thought, because the changes were smaller with the same number of cherry picks per day.

About 800 developers check in per week. It keeps growing as they hire more, even buying out an old windshield repair houston place for more office space. There’s about 10k commits per month to a 10M LOC codebase. But the rate of cherry picks per day has remained pretty stable. There is a cadence for how things go out. So you should put most of your effort into the big weekly release. Then lots of stuff crowds in on Wed as fixes come in. Be careful on Friday. At Google they had “no push Fridays”. Don’t check in your code and leave. Sunday and Monday are their biggest days, as everyone uploads and views all the photos from everyone else’s drunken weekend.

Give people an out. If you can’t remember how to do a release, don’t do anything. Just check into trunk and you can avoid the operational burden of showing up for a daily release.

Remember that you’re not the only team shipping on a given today. Coordinate changes for large things so you can see what’s planned company wide. Facebook uses Facebook groups for this.

Dogfooding
You should always be testing. People say it but don’t mean it, but Facebook takes it very seriously. Employees never go to the real facebook.com because they are redirected to http://www.latest.facebook.com. This is their production Facebook plus all pending changes, so the whole company is seeing what will go out. Dogfooding is important. If there’s a fatal error, you get directed to the bug report page.

File bugs when you can reproduce them. Make it easy and low friction for internal users to report an issue. The internal Facebook includes some extra chrome with a button that captures session state, then routes a bug report to the right people.

When Chuck does a push, there’s another step in that developers’ changes are not merged unless you’ve shown up. You have to reply to a message to confirm that you’re online and ready to support the push. So the actual build is http://www.inyour.facebook.com which has fewer changes than latest.

Facebook.com is not to be used as a sandbox. Developers have to resist the urge to test in prod. If you have a billion users, don’t figure things out in prod. Facebook has a separate complete and robust sandbox system.

On-call duties are serious. They make sure that they have engineers assigned as point of contact across the whole system. Facebook has a tool that allows quick lookup of on-call people. No engineer escapes this.

Self Service
Facebook does everything in IRC. It scales well with up to 1000 people in a channel. Easy questions are answered by a bot. There is a command to lookup the status of any rev. They also have a browser shortcut as well. Bots are your friends and they track you like a dog. A bot will ask a developer to confirm that they want a change to go out.

Where are we?
Facebook has a dashboard with nice graphs showing the status of each daily push. There is also a test console. When Chuck does the final merge, he kicks off a system test immediately. They have about 3500 unit test suites and he can run one each machine. He reruns the tests after every cherrypick.

Error tracking
There are thousands and thousands of web servers. There’s good data in the error logs but they had to write a custom log aggregator to deal with the volume. At Facebook you can click on a logged error and see the call stack. Click on a function and it expands to show the git blame and tell you who to assign a bug to. Chuck can also use Scuba, their analysis system, which can show trends and correlate to other events. Hover over any error, and you get a sparkline that shows a quick view of the trend.

Gatekeeper
This is one of Facebook’s main strategic advantages that is key to their environment. It is like a feature flag manager that is controlled by a console. You can turn new features on selectively and restrain the set of users who see the change. Once they turned on “fax your photo” for only Techcrunch as a joke.

Push karma
Chuck’s job is to manage risk. When he looks at the cherry pick dashboard it shows the size of the change, and the amount of discussion in the diff tool (how controversial is the change). If both are high he looks more closely. He can also see push karma rated up to five stars for each requestor. He has an unlike button to downgrade your karma. If you get down to two stars, Chuck will just stop taking your changes. You have to come and have a talk with him to get back on track.

Perflab
This is a great tool that does a full performance regression on every change. It will compare perf of trunk against the latest branch.

HipHop for PHP
This generates about 600 highly optimized C++ files that are then linked into a single binary. But sometimes they use interpreted PHP in dev. This is a problem that they plan to solve with the PHP virtual machine that they plan to open source.

Bittorrent
This is how they distribute the massive binary to many thousands of machines. Clients contact Open Tracker server for list of peers. There is rack affinity and Chuck can push in about 15 minutes.

Tools alone won’t save you
The main point is that you cannot tool your way out of this. The people coming on board have to be brainwashed so they buy into the cultural part. You need the right company with support from the top all the way down.

November 8, 2012

Hacking Culture for Continuous Delivery

This post is based on a new talk by @jesserobbins at QConSF 2012 (slides). Jesse is a firefighter, the former Master of Disaster at Amazon, and the Founding CEO of Opscode, the company behind Chef.
by @mattokeefe
photo credit: John Keatley

Jesse Robbins, Firefighter

Operations at web scale is the ability to consistently create and deploy reliable software to an unreliable platform that scales horizontally. Jesse created the Velocity conference to explore how to do this, learning from companies that do it well. Google, Amazon, Microsoft, Yahoo built their own automation and deployment tools. When Jesse left Amazon he was stunned at the lack of mature tooling elsewhere. Many companies considered their tools to be “secret sauce” that gave them a competitive advantage. Opscode was founded to provide Cloud infrastructure automation. Jesse’s experience helping other companies down this road led to a set of culture hacks that will help you adopt Continuous Delivery.

Continuous Delivery
Continuous Delivery is the end state of thinking and approaching a wide array of problems in a new way. Big changes to software systems that build up over long periods of time suck. A long time and lots of code changes lead to breakage that is hard to solve. The Continuous Deployment way means small amounts of code deployed frequently. Awesome in theory, but it requires organizational change. The effort is worth it however as the benefits include faster time to value, higher availability, happier teams and more cool stuff. Given this, it is surprising that Continuous Delivery has taken so long to be accepted.

Teams that do Continuous Delivery are much happier. Seeing your code live is very gratifying. You have the freedom to experiment with new things because you aren’t stuck dealing with large releases and the challenge of getting everything right in one go.

Learning about Continuous Delivery is very exciting, but the reality is that back at the office things are challenging. Organizational change is hard. Let’s consider a roadmap for cultural change. The first problem is “it worked fine in test, it’s Ops’ problem now.”

Ops likes to punish dev for this.

Tools are not enough (even really great tools like Chef!). In order to succeed you have to convince people that you can be trusted and you want to work together. The reason for this is understood, for example see Conway’s law. Teams need to work together continuously, not just at deploy time.

Choice: discourage change in the interest of stability, or allow change to happen as often as it needs to. Asking the question of which do you choose is better than just making a statement.

Common Attributes of Web Scale Cultures

  • Infrastructure as Code. This is the most important entry point, providing full-stack automation. Commodity hardware can be used with this approach, as reliability is provided in the software stack. Datacenters must have APIs; you can’t rely on humans to take action. All services including things like DNS have to follow this model. Infrastructure becomes a product, and the app dev team is the customer.
  • Applications as Services. This means SOA with things like loose coupling and versioned APIs. You must also design for failure, and this is where a lot of teams struggle. Database/storage abstraction is important as well. Complexity is pushed up the stack. Deep instrumentation is critical for both infrastructure and apps.
  • Dev / Ops as Teams. Shared metrics and monitoring, incident management. Sometimes it is good to rotate devs through the on-call duties so everyone gets experience. Tight integration means a set of tools that integrates tightly with all of the teams. This leads to Continuous Integration, which leads to Continuous Delivery. The Site Reliability Engineer role is important in this model so you have people that understand the system from top to bottom. Finally, thorough testing is important e.g. GameDay.

None of this is new; consider Theory of Constraints, Lean/JIT, Six Sigma, Toyota Production System, Agile, etc. You need to recognize it has to be a cultural change to make it work however. Every org will say “we can’t do it that way because…” They’re trying to think about where they are and extrapolate to this new state. It’s like an elephant (Enterprises) trying to fly. You have to give them a way to think about a way of making incremental evolutionary changes toward the goal.

Cultural change takes a long time. This is the hardest thing. Jesse’s Rule: Don’t Fight Stupid, Make More Awesome! Pick your battles and do these 5 things:

  • Start small and built on trust and safety. The machinery will resist you if you try sweeping change.
  • Create champions. Attack the least contentious thing first.
  • Use metrics to build confidence. Create something that you can point to to get people excited. Time to value is a good one.
  • Celebrate successes. This builds excitement, even for trivial accomplishments. The thing is to create arbitray points where you can look back and see progress.
  • Exploit Compelling Events. When something breaks it is a chance to do something different. “Currency to Make Change” is made available, as John Allspaw puts it.

Start small

  • Small change isn’t a threat and it’s easy to ignore. Too big of a change will meet resistence, so start small.
  • Just call it an experiment. Don’t present the change as an all or nothing commitment.

Creating Champions

  • Get executive sponsors, starting with your boss
  • Give everyone else the credit. When people around you succeed, celebrate it.
  • Give “Special Status”. This is magic. Special badges, SRE bomber jackets at Google… these things are cool and you’re giving people something they want.
  • Have people with “Special Status” talk about the new awesome. Make them evangelists and create mentor programs to build an internal structure of advocates.

Metrics

  • Find KPIs that support change. Hacking metrics is important to drive change. Having KPIs around things like time to value is compelling. Relate shipping code to revenue.
  • Track and use KPIs ruthlessly. First you show value, then you show the cost of not making the change by laggards. This is the carrot and stick approach.
  • Tell your story with data. Hans Rosling has a great TED talk on this topic. This is the most powerful hack. Include stories about what your competitors are doing. There’s no other way to make this work.

Celebrating Successes

  • Tell a powerful story
  • Always be positive about people and how they overcame a problem. This is especially important with Ops people who tend to be grumpy.
  • Never focus on the people who created the problem. Focus instead on the problem itself.
  • Leave room for people to come to your side. Otherwise you’ll make enemies. Don’t fight stupid.

Compelling Events

  • Just wait, one will come. Things are never stable. Exploit challenges like compliance or moving to Cloud.
  • Don’t say “I told you so”, instead ask “what do we do now?” Make it safe for people to decide to change.

Remember, don’t fight stupid, make more awesome!