Archive for ‘Uncategorized’

August 19, 2013

The Crowdsourcing of Code: What IT Can Learn from Developers

This is a guest post by Yoav Landman (@yoavlandman), Founder & Chief Technology Officer at JFrog


The agile movement is influencing the adoption of new methods of collaboration from developer to consumer throughout the development process. The sharing of resources across companies, communities or even countries is known as crowdsourcing and it is becoming increasingly common. After all, when it comes to coding most of us speak much the same language. The idea of collaboration isn’t exclusive to developers and can also provide benefits to the IT fields. In IT, community knowledge is becoming a huge asset.

Developers and IT professionals often turn to networks outside of their own for information about the artifacts they use. Through the sharing of successes, failures, feedback and updates, the building blocks that make up software are (virtually) crowdsourced. With crowdsourcing becoming the new norm, there’s no shortage of best practices to take away from their community.

With that, let’s explore the practices that’ll make for a crowdsourcing strategy that’s beneficial, efficient and safe for software developers and IT alike:

License Control

When you use a communal tool, such as open source, you must protect your project through licensing. Nothing puts a damper on a project more than a licensing issue—monetary fees, wasted productivity hours and vendor lock-in can become a huge liability. More so than ever, responsibility for larger business initiatives is falling in IT’s lap, and a large portion of honing the role comes from licensing control.

Bring It In-House

Ensure that your original project is stored in-house. The main reason: it guarantees you’re in control and can reliably control accessibility to others within your organization for download. It also doesn’t put you at the mercy of the availability of external software repositories. Be sure to equip your project with internal backup – and keep it up to date.

Access Control and Internal Audit

While sharing is encouraged, be sure to filter who and what’s accessing and updating your organization’s resources. Who and what is allowed on the network? Ensure there are policies and procedures in place. Without proper management, you have no record of the where code or software is coming from or going to which can jeopardize both quality and security.

Free Up Resources: Share Centrally & Adopt Tools that Enable Management

An internal centralized resource for developers to share and pull libraries is a best practice. Not all methods are created equally though and careful tool selection can increase productivity and free up your team’s resources .

For example, using a version control system to store libraries can actually slow down the development process—they lack searchability, proxy facilities and a certain level of permission management. These manage source code (i.e. instructions, text) not binary files (i.e. builds, executable form) and drain storage space and network resources (when using a distributed version control). Pick the right tool for the job.

Automated Clean-Up  

Combining socialization with automation will increase productivity. As creatures of habit, so many IT pros use manual intervention in processes that can be made automatic. One example from the software development side is clean-up. Let’s say you’re using a continuous integration server. Binaries are constantly being built; it may build 50 versions of the library in one hour but your team only qualifies one version. Adopting proper tools to eliminate parts of the cycle that don’t require manual intervention will increase productivity.

Crowdsourcing has swept professional networks, but most industries are still limited to internal interaction among co-workers.  Software developers and IT professionals are unique in that we converse across companies and even industries on a global scale. With the right tools, we can share consumer feedback and understand risks, successes, and code that form the building blocks for great systems. While it is not yet a standard practice for fields such as marketing or law, crowdsourcing is beneficial for so many fields in IT here and now.

July 18, 2013

Free Tickets for Puppet Conf 2013 – Expired


Update: The winners of the free tickets were:
Deepak Jagannath and Tim Hunter. Congrats from!

——– giving away 2 free tickets to PuppetConf 2013. PuppetConf 2013 (happening August 22 – 23rd) is set to host 2,000 attendees this year and include speakers from VMware & RedHat. It will also take place at the Fairmont Hotel, located in the heart of downtown SF, where a ton of other social events around the conference are set to take place.

To win a ticket just respond to this post, and email us at, with a story about how DevOps has made your life easier or your company more productive. We will respond on Monday the 29th of July with the 2 winning users handles and email each of you with a code to get tickets.

February 14, 2013

DevOps – A Valentine’s Day Fairy Tale

DevOps - A Valentine's Day Fairy Tale

DevOps – A Valentine’s Day Fairy Tale

This is a guest post by Matt Watson from Stackify

Once upon a time two people from different sides of the tracks met and fell in love. Never before had the two people found another person who so perfectly complemented them. Society tried to keep them apart – “It’s just not how things are done,” they’d say. But times were changing, and this sort of pairing was becoming more socially acceptable.

They met at the perfect time.

Ops had grown tired of the day to day grind of solving other people’s problems. Enough was enough and she needed a change in her life. A perfectionist and taskmaster to the highest degree, she tended to be very controlling and possessive in relationships. It became more about commands than conversation, making life miserable for both parties. She began to realize she hated change, and felt like she spent most of her time saying “No.” It was time to open up and begin to share to make a relationship work.

Dev, on the other hand, was beginning to mature (a little late in the game, as guys seem to) and trying to find some direction. He had grown tired of communication breakdowns in relationships – angry phone calls in the middle of the night, playing the blame game, and his inability to meet halfway on anything. He began to realize most of those angry phone calls came as a result of making impulsive decisions without considering how they would impact others. His bad decisions commonly led to performance problems and created a mess for his partners. Dev wanted to more actively seek out everything that makes a healthy relationship work.

The timing was right for a match made in heaven. Dev and Ops openly working and living side by side to make sure both contributed equally to making their relationship work. Ops realized she didn’t have to be so controlling if she and Dev could build trust between one another. Dev realized that he caused fewer fights if he involved Ops in decisions about the future, since those decisions impacted both of them. It was a growing process that caused a lot of rapid and sudden change. Although, like most relationships, they knew it was important to not move too fast, no matter how good it felt.

Dev and Ops dated for about four years before they decided to get married. Now they will be living together and sharing so much more; will their relationship last? How will it need to change to support the additional closeness? But they aren’t worried, they know it is true love and will do whatever it takes to make it work. Relationships are always hard, and they know they can solve most of their problems with a reboot, hotfix, or patch cable.

Will you accept their forbidden love?

7 Reasons the DevOps Relationship is Built to Last

  1. Faster development and deployment cycles (but don’t move too fast!)
  2. Stronger and more flexible automation with deployment task repeatability
  3. Lowers the risk and stress of a product deployment by making development more iterative, so small changes are made all the time instead of large changes every so often
  4. Improves interaction and communication between the two parties to keep both sides in the loop and active
  5. Aids in standardizing all development environments
  6. DevOps dramatically simplifies application support because everyone has a better view of the big picture.
  7. Improves application testing and troubleshooting


About the author: Matt Watson is the Founder & CEO of Stackify. He has a lot of experience managing high growth and complex technology projects. He is focused on changing the way developers support their production applications with DevOps.

November 11, 2012

Big Data Problems in Monitoring at eBay

This post is based on a talk by Bhaven Avalani and Yuri Finklestein at QConSF 2012 (slides). Bhaven and Yuri work on the Platform Services team at eBay.

by @mattokeefe

This is a Big Data talk with Monitoring as the context. The problem domain includes operational management (performance, errors, anomaly detection), triaging (Root Cause Analysis), and business monitoring (customer behavior, click stream analytics). Customers of Monitoring include dev, Ops, infosec, management, research, and the business team. How much data? In 2009 it was tens of terabytes per day, now more than 500 TB/day. Drivers of this volume are business growth, SOA (many small pieces log more data), business insights, and Ops automation.

The second aspect is Data Quality. There are logs, metrics, and events with decreasing entropy in that order. Logs are free-form whereas events are well defined. Veracity increases in that order. Logs might be inaccurate.

There are tens of thousands of servers in multiple datacenters generating logs, metrics and events that feed into a data distribution system. The data is distributed to OLAP, Hadoop, and HBase for storage. Some of the data is dealt with in real-time while other activities such as OLAP for metric extraction is not.

How do you make logs less “wild”? Typically there are no schema, types, or governance. At eBay they impose a log format as a requirement. The log entry types includes open and close for transactions, with time for transaction begin and end, status code, and arbitrary key-value data. Transactions can be nested. Another type is atomic transactions. There are also types for events and heartbeats. They generate 150TB of logs per day.

Large Scale Data Distribution
The hardest part of distributing such large amounts of data is fault handling. It is necessary to be able to buffer data temporarily, and handle large spikes. Their solution is similar to Scribe and Flume except the unit of work is a log entry with multiple lines. The lines must be processed in correct order. The Fault Domain Manager copies the data into downstream domains. It uses a system of queues to handle the temporary unavailability of a destination domain such as Hadoop or Messaging. Queues can indicate the pressure in the system being produced by the tens of thousands of publisher clients. The queues are implemented as circular buffers so that they can start dropping data if the pressure is too great. There are different policies such as drop head and drop tail that are applied depending on the domain’s requirements.

Metric Extraction
The raw log data is a great source of metrics and events. The client does not need to know ahead of time what is of interest. The heart of the system that does this is Distributed OLAP. There are multiple dimensions such as machine name, cluster name, datacenter, transaction name, etc. The system maintains counters in memory on hierarchically described data. Traditional OLAP systems cannot keep up with the amount of data, so they partition across layers consisting of publishers, buses, aggregators, combiners, and query servers. The result of the aggregators is OLAP cubes with multidimensional structures with counters. The combiner then produces one gigantic cube that is made available for queries.

Time Series Storage
RRD was a remarkable invention when it came out, but it can’t deal with data at this scale. One solution is to use a column oriented database such or HBase or Cassandra. However you don’t know what your row size should be and handling very large rows is problematic. On the other hand OpenTSDB uses fixed row sizes based on time intervals. At eBay’s scale with millions of metrics per second, you need to down-sample based on metric frequency. To solve this, they introduced a concept of multiple row spans for different resolutions.

* Entropy is important to look at; remove it as early as possible
* Data distribution needs to be flexible and elastic
* Storage should be optimized for access patterns

Q. What are the outcomes in terms of value gained?
A. Insights into availability of the site are important as they release code every day. Business insights into customer behavior are great too.

Q. How do they scale their infrastructure and do deployments?
A. Each layer is horizontally scalable but they’re struggling with auto-scaling at this time. EBay is looking to leverage Cloud automation to address this.

Q. What is the smallest element that you cannot divide?
A. Logs must be processed atomically. It is hard to parallelize metric families.

Q. How do you deal with security challenges?
A. Their security team applies governance. Also there is a secure channel that is encrypted for when you absolutely need to log sensitive data.

November 8, 2012

Release Engineering at Facebook

This post is based on a talk by Chuck Rossi at QConSF 2012. Chuck is the first Release Engineer to work at Facebook.
by @mattokeefe

Chuck tries to avoid the “D” “O” word… DevOps. But he was impressed by a John Allspaw presentation at Velocity 09 “10+ Deploys Per Day: Dev and Ops Cooperation at Flickr“. This led him to set up a bootcamp session at Facebook and this post is based on what he tells new developers.

The Problem
Developers want to get code out as fast as possible. Release Engineers don’t want anything to break. So there’s a need for a process. “Can I get my rev out?” “No. Go away”. That doesn’t work. They’re all working to make change. Facebook operates at ludicrous speed. They’re at massive scale. No other company on earth moves as fast with at their scale.

Chuck has two things at his disposal: tools and culture. He latched on to the culture thing after Allspaw’s talk. The first thing that he tells developers is that they will shepherd their changes out to the world. If they write code and throw it over the wall, it will affect Chuck’s Mom directly. You have to deal with dirty work and it is your operational duty from check-in to trunk to in-front-of-my-Mom. There is no QA group at Facebook to find your bugs before they’re released.

How do you do this? You have to know when and how a push is done. All systems at Facebook follow the same path, and they push every day.

How does Facebook push?
Chuck doesn’t care what your source control system is. He hates them all. They push from trunk. On Sunday at 6p they take trunk and cut a branch called latest. Then they test for two days before shipping. This is the old school part. Tuesday they ship, then Wed-Fri they cherry pick more changes. 50-300 cherry picks per day are shipped.

But Chuck wanted more. “Ship early and ship twice as often” was a post he wrote on the Facebook engineering blog. (check out the funny comments). They started releasing 2x/day in August. This wasn’t as crazy as some people thought, because the changes were smaller with the same number of cherry picks per day.

About 800 developers check in per week. It keeps growing as they hire more, even buying out an old windshield repair houston place for more office space. There’s about 10k commits per month to a 10M LOC codebase. But the rate of cherry picks per day has remained pretty stable. There is a cadence for how things go out. So you should put most of your effort into the big weekly release. Then lots of stuff crowds in on Wed as fixes come in. Be careful on Friday. At Google they had “no push Fridays”. Don’t check in your code and leave. Sunday and Monday are their biggest days, as everyone uploads and views all the photos from everyone else’s drunken weekend.

Give people an out. If you can’t remember how to do a release, don’t do anything. Just check into trunk and you can avoid the operational burden of showing up for a daily release.

Remember that you’re not the only team shipping on a given today. Coordinate changes for large things so you can see what’s planned company wide. Facebook uses Facebook groups for this.

You should always be testing. People say it but don’t mean it, but Facebook takes it very seriously. Employees never go to the real because they are redirected to This is their production Facebook plus all pending changes, so the whole company is seeing what will go out. Dogfooding is important. If there’s a fatal error, you get directed to the bug report page.

File bugs when you can reproduce them. Make it easy and low friction for internal users to report an issue. The internal Facebook includes some extra chrome with a button that captures session state, then routes a bug report to the right people.

When Chuck does a push, there’s another step in that developers’ changes are not merged unless you’ve shown up. You have to reply to a message to confirm that you’re online and ready to support the push. So the actual build is which has fewer changes than latest. is not to be used as a sandbox. Developers have to resist the urge to test in prod. If you have a billion users, don’t figure things out in prod. Facebook has a separate complete and robust sandbox system.

On-call duties are serious. They make sure that they have engineers assigned as point of contact across the whole system. Facebook has a tool that allows quick lookup of on-call people. No engineer escapes this.

Self Service
Facebook does everything in IRC. It scales well with up to 1000 people in a channel. Easy questions are answered by a bot. There is a command to lookup the status of any rev. They also have a browser shortcut as well. Bots are your friends and they track you like a dog. A bot will ask a developer to confirm that they want a change to go out.

Where are we?
Facebook has a dashboard with nice graphs showing the status of each daily push. There is also a test console. When Chuck does the final merge, he kicks off a system test immediately. They have about 3500 unit test suites and he can run one each machine. He reruns the tests after every cherrypick.

Error tracking
There are thousands and thousands of web servers. There’s good data in the error logs but they had to write a custom log aggregator to deal with the volume. At Facebook you can click on a logged error and see the call stack. Click on a function and it expands to show the git blame and tell you who to assign a bug to. Chuck can also use Scuba, their analysis system, which can show trends and correlate to other events. Hover over any error, and you get a sparkline that shows a quick view of the trend.

This is one of Facebook’s main strategic advantages that is key to their environment. It is like a feature flag manager that is controlled by a console. You can turn new features on selectively and restrain the set of users who see the change. Once they turned on “fax your photo” for only Techcrunch as a joke.

Push karma
Chuck’s job is to manage risk. When he looks at the cherry pick dashboard it shows the size of the change, and the amount of discussion in the diff tool (how controversial is the change). If both are high he looks more closely. He can also see push karma rated up to five stars for each requestor. He has an unlike button to downgrade your karma. If you get down to two stars, Chuck will just stop taking your changes. You have to come and have a talk with him to get back on track.

This is a great tool that does a full performance regression on every change. It will compare perf of trunk against the latest branch.

HipHop for PHP
This generates about 600 highly optimized C++ files that are then linked into a single binary. But sometimes they use interpreted PHP in dev. This is a problem that they plan to solve with the PHP virtual machine that they plan to open source.

This is how they distribute the massive binary to many thousands of machines. Clients contact Open Tracker server for list of peers. There is rack affinity and Chuck can push in about 15 minutes.

Tools alone won’t save you
The main point is that you cannot tool your way out of this. The people coming on board have to be brainwashed so they buy into the cultural part. You need the right company with support from the top all the way down.

September 30, 2012

Automating Cloud Applications using Open Source at BrightTag

This guest post is based on a presentation given by @mattkemp, @chicagobuss, and @codyaray at CloudConnect Chicago 2012

As a fast-growing tech company in a highly dynamic industry, BrightTag has made a concerted effort to stay true to our development philosophy. This includes fully embracing open source tools, designing for scale from the outset and maintaining an obsessive focus on performance and code quality (read our full Code to Code By for more on this topic).

Our recent CloudConnect presentation, Automating Cloud Applications Using Open Source, highlights much of what we learned in building BrightTag ONE, an integration platform that makes data collection and distribution easier.  Understanding many of you are also building large, distributed systems, we wanted to share some of what we’ve learned so you, too, can more easily automate your life in the cloud.


BrightTag utilizes cloud providers to meet the elastic demands of our clients. We also make use of many off-the-shelf open source components in our system including Cassandra, HAProxy and Redis. However, while each component or tool is designed to solve a specific pain point, gaps exist when it comes to a holistic approach to managing the cloud-based software lifecycle. The six major categories below explain how we addressed common challenges that we faced and it’s our hope that these experiences help other growing companies grow fast too.

Service Oriented Architecture

Cloud-based architecture can greatly improve scalability and reliability. At BrightTag, we use a service oriented architecture to take advantage of the cloud’s elasticity. By breaking a monolithic application into simpler reusable components that can communicate, we achieve horizontal scalability, improve redundancy, and increase system stability by designing for failure. Load balancers and virtual IP addresses tie the services together, enabling easy elasticity of individual components; and because all services are over HTTP, we’re able to use standard tools such as load balancer health checks without extra effort.

Inter-Region Communication

Most web services require some data to be available in all regions, but traditional relational databases don’t handle partitioning well. BrightTag uses Cassandra for eventually consistent cross-region data replication. Cassandra handles all the communication details and provides a linearly scalable distributed database with no single point of failure.

In other cases, a message-oriented architecture is more fitting, so we designed a cross-region messaging system called Hiveway that connects message queues across regions by sending compressed messages over secure HTTP. Hiveway provides a standard RESTful interface to more traditional message queues like RabbitMQ or Redis, allowing greater interoperability and cross-region communication.

Zero Downtime Builds

Whether you have a website or a SaaS system, everyone knows uptime is critical to the bottom line. To achieve 99.995% uptime, BrightTag uses a combination of Puppet, Fabric and bash to perform zero downtime builds. Puppet provides a rock-solid foundation for our systems. We then use Fabric to push out changes on demand. We use a combinations of haproxy and built-in health checks to make sure that our services are always available.

Network Connectivity

Whether you use a dedicated DNS server or /etc/hosts files, to keep a flexible environment functioning properly, you need to update your records. This includes knowing where your instances are on a regular and automatic basis. To accomplish this, we use a tool called Zerg, a Flask web app that leverages libcloud to abstract away the specific cloud provider API from the common operations we need to do regularly in all our environments.

HAProxy Config Generation

Zerg allows us to do more than just generate lists of instances with their IP addresses.  We can also abstractly define our services in terms of their ports and health check resource URLs, giving us the power to build entire load balancer configurations filled in with dynamic information from the cloud API where instances are available.  We use this plus some carefully designed workflow patterns with Puppet and git to manage load balancer configuration in a semi-automated way. This approach maximizes safety while maintaining an easy process for scaling our services independently – regardless of the hosting provider.


Application and OS level monitoring is important to gain an understanding of your system. At BrightTag, we collect and store metrics in Graphite on a per-region basis. We also expose a metrics service per-region that can perform aggregation and rollup. On top of this, we utilize dashboards to provide visibility across all regions. Finally, in addition to visualizations of metrics, we use open source tools such as Nagios and Tattle to provide alerting on metrics we’ve identified as key signals.

There is obviously a lot more to discuss when it comes to how we automate our life in the cloud at BrightTag. We plan to post more updates in the near future to share what we’ve learned in the hopes that it will help save you time and headaches living in the cloud. In the meantime, check out our slides from CloudConnect 2012.

September 20, 2011

Running Heroku on Heroku

heroku logoThis is a live summary taken from this talk given at StrangeLoop.

Today Noah Zoschke @nzoschke will cover running Heroku on Heroku. Heroku for those that are not familiar is a cloud application platform as a service. It used to be a ruby application as a service platform but now it has been opened up for many other languages. Heroku was all about getting rid of the need for servers; at least you maintaining servers. This talk is going to be about bootstrapping and self hosting and all the benefits for the dev and operations cycles that come along with it – not to mention the benefits for your business.

The meaning of the word bootstrapping has come to mean a self sustaining process that proceeds without external help. There are many applications of this term, socio econimics, business, statistics, linguistics (how a small child can go from no spoken ability to having it), biology (we all start as just a few cells and our cells then figure things out), and then of course computers (booting up is bootstrapping up). We have a computer that is off and we need to figure out how to get the system up and running from that off state into a fully running a viable for work state. Boostrapping also has a very specific meaning for compilers. If you have a compiler written in a language that it itslef compiles then it is bootstrapped. We will talk about the compiler example for just a minute before we get into what this could mean for services for illustration purposes.

Self building/bootstrapping is something that allmost all languages and compilers strive to do. Bootstrapping is an excelent test for any compiler. It allows you to work on your compiler in a higher level language. It also leads to a really great consistency check of the compiler itself. A compiler that can compile itself is a good thing also because it reduces the overall footprint of the tools needed to work on the compiler itself. There is ofcourse a chicken and egg problem. There are a number of strategies for handing this.

Build compiler/interpreter for X in language Y
Using an earlier version of the compilar
Hand compile

Lets change terminology quickly, “Self hosting” is a computer program that produces new versions of that same program. This applys to compilers as we illustrated but it also applies equally well to kernals, programming languages, and revision control systems like git being maintained in git self host. There are more such as text editors; vim being developed with vim and so on. So, the question is

“Is this an applicable metaphore for services and the cloud?”

We see the same properties and benefits associated with compilers in services and cloud. At a simple level Heroku hosts on Heroku. Not very surprising, probably more surprising if you found out Heroku was run on Slicehost or something like that! (it is not). There are a number of motivations though for taking self hosting a bit further than just this. Dogfooding, efficiency, and separation of concerns. Heroku used to be this large ruby app and any time some developer would screw something up he could crash the whole system. There were all kinds of hoops that were jumped through to prevent this from happening. The ultimate solution ended up being self hosting. Features used to be added to this large ruby app now most features, like Heroku cron, are turned into applications that actually run on Heroku itself not in its codebase where it can cause problems.

Now taking this even further to something more heroic. Heroku has a whole separate database cloud service. This thing is large and a fairly big deal. and the whole thing runs on Heroku itself. Can we keep going with this, and take it even a step further?

Heroku Cloud Architecture

Noah  Zoschke talking about heroku cloud architecture

The question is, what else can we self host? Take the compile part of the architecture and run them on the heroku dynos. so basically compiling new Heroku dynos will be compiled by the compile application running ontop of Heroku itself. We want to run a platform that is not just for sinatra apps or rails apps etc… We want a platform that is a generic computing platform. Running Heroku applications on Heroku helps us prove out we do, or move there is we are not already.

Other motivations are effortless scaling, decreased surface area of the architecture, and build/compile symmetry. We want our build servers to look just like our runtime servers. The motivation here is obvious and running compile on heroku itself really gets us there. The other and most important motivation is to be able to focus on these secure ephemeral containers, the dynos, and making them as secure and well factored as possible. If our business depends on these containers from top to bottom we will be forced to make these are sound as possible.

Martin Logan (@martinjlogan) also, if this kind cloudy stuff floats your boat you should check out Camp DevOps Conf in Chicago this Oct