Mining Business Insights and Operational Data via a Unified Open-Source Monitoring Platform : Part 2 of a Series
In a previous article, I described how Spot successfully transitioned from a legacy, third-party monitoring platform to an in-house solution. That was one of the early steps in a larger goal to move the firm toward a highly scalable, automated, and resilient infrastructure built around leading open-source tools. In this article, I will describe the next stage of that evolution – log monitoring.
As is common in the trading industry, Spot relies on a large number of custom applications to run their business. Spot employs continuous delivery methodologies to incrementally improve the system and new applications are getting deployed at a rapid pace. Of course, with all of these applications comes a lot of logs (we now know this is over 50 million messages per day). Our challenge was to take data from those logs and turn it into useful, easily accessible information. Our main goals were to accomplish the following:
- Ensure new applications are monitored quickly and automatically
- Combine business information and operational data into one easy to access data set
- Centralize log data to ease troubleshooting, analysis, and searching
- Reduce noise and false positives
- Provide greater visibility to DevOps, Developers, Traders, and Management
Our previous solution was to utilize the log monitoring function of our legacy monitoring platform. This worked, but it did not meet the goals outlined above. The biggest issues being: large amounts of manual configuration required, high volume of noise and false positives being sent to the support team, lack of centralized data, lack of visibility and dashboards. While this did meet the main goal of alerting the team when applications reported errors, it still fell short in a lot of areas, and we knew we could better.
We began an evaluation process to find the right tool for the job. We knew we did not want another cumbersome and expensive commercial platform, so we eliminated that option quickly. We were left with a decision between developing something in-house, and finding a solution in the open-source community. Very early in the evaluation process, we were blown away by Logstash and our decision was made. Logstash is a part of the ELK stack (Elasticsearch, Logstash , and Kibana). Together, these components comprise a comprehensive log management platform providing log structuring, centralization, and visualization.
One of our favorite aspects of Logstash is its flexibility. We were able to leverage it for several different use cases, as described in the table below. This flexibility is achieved by sticking to a simple, yet highly extensible, pipeline for all of its processing:
- Input(s) – Getting data into Logstash
- Filter(s) – Structuring and processing that data
- Output(s) – Sending that data third parties
Information about the numerous input, filter, and output plugins is available on the logstash website.
The following table shows how we were able to leverage that pipeline to address several monitoring use cases:
There are already plenty of great write-ups on basic Logstash deployment online. In this article, instead of going over the basics, I’m going to focus on the architecture we designed to best meet our specific requirements.
Now that we knew we wanted to use Logstash, we needed to figure out how to transition from our small POC to our production network with 1000+ servers. We required an architecture that provided adequate resiliency, capacity, and performance for the large amount of data we would be collecting, and the large number of users who would be accessing that data through Kibana. We experienced some growing pains along the way, but were able to build out an implementation that we are very happy with.
Shipper: Shipper refers to the agent which runs on each application server and forwards log data to Logstash . There are several options available, each with their own pros and cons. We tested several options, but ultimately decided to run Logstash as our shipper. We made this decision because we wanted a shipper which was able to forward data to a broker, such as Redis. We also liked that, by running Logstash for both our shipper and indexer, we wouldn’t have to worry about learning and managing two different configuration languages.
A popular alternative is to run Logstash-forwarder (formerly Lumberjack) as the shipper. Logstash-forwarder is an attractive option because it has a much smaller footprint than Logstash, so it is less likely to inflict a performance hit on the server. Logstash-forwarder also provides encrypted communication between the shipper and indexer; however, it does not allow forwarding to Redis. Redis was a requirement for us, and encryption was not, so we made the decision to stick with Logstash as the shipper.
Broker: Brokers are optional when deploying the ELK stack. When deployed, they add a layer of data resiliency between the shippers and indexers. All shippers send their log data directly to the broker, and that data is then grabbed by the indexers. This buffer allows clients to keep performing normally in the event of planned or unplanned downtime for your indexers or database. We chose Redis as our broker because it is fast, stable, and Logstash supports it out of the box.
Indexer: We are currently running three Logstash indexers in our production environment, each running an identical configuration. These servers pull messages directly from Redis, meaning the shippers’ configuration does not need to change when indexers are added or updated. Adding more capacity is as easy as deploying one or more additional indexers using the same configuration – no fancy clustering configuration required.
Storage and Search: We are running a three-node Elasticsearch cluster for our production environment, each with 64GB of RAM. This has been sufficient to handle our current load – which is currently 650GB across 1.2 billion documents, with an average retention of 30 days.
Web Interface: We are using Kibana, which is running on each of our indexers. Kibana is currently serving 50+ dashboards to a wide range of users – including IT, support, traders, developers, and executives.
Notifications: Implementing notifications was one of our biggest challenges when rolling out Logstash. We needed a way to notify the appropriate people when Logstash detected an error. Additionally, we had to make sure we would not be spamming people in the event of multiple errors – some of our apps can generate hundreds or thousands of errors in a matter of seconds if something goes wrong. We also needed the ability to define notification time windows on a per application basis.
Logstash has the ability to generate emails, but we needed something more than that. After some trial and error, we were able to build out a solid notification process which meets all of our requirements, using the PagerDuty output. We learned a valuable lesson along the way – always use the Throttle filter when using the PagerDuty output. The Throttle filter allows us to ensure that we do not send too many events to Pagerduty and exceed our rate limit. If the rate limit is exceeded, PagerDuty will start rejecting messages for a period of time, which can cause all kinds of problems with Logstash.
Deployment and Automation
Logstash configuration is handled by plain text configuration files living on each Logstash server. In our environment, this translates to several thousand configuration files across about 1,000 servers. We needed a solution to manage these configurations to meet the following requirements:
- Ensure all servers are running the correct configuration
- Do not allow un-tracked, one-off configuration changes
- Provide change control and version tracking
- Provide automated deployment
- Provide a central configuration repository
We achieved these goals by leveraging two of our existing tools: Atlassian Stash for change control and version tracking, and SaltStack for deployment and automation. Stash provides a central repository for all configurations, and provides a workflow for approving configuration changes before they are allowed to be pushed out to production. SaltStack allows us to push configuration files to servers based on their role. It also performs daily checks to ensure no configurations were modified outside of our change control process.
We started evaluating Logstash in early 2014. Since then, we have completely replaced our legacy monitoring platform, and moved all log monitoring to Logstash and the ELK stack. We feel that it meets each of our goals laid out earlier in this article. Operational and business data can now be easily combined into one data set and analyzed. It has provided a new layer of flexibility where we can automatically respond to operational issues. As an added bonus, we have achieved significant cost savings due to removing third party licenses fees.
We are seeing increased interest and buy-in across the whole organization – from DevOps, developers, traders, and executive leadership – something we never could have achieved with our previous monitoring infrastructure.
Here are some stats from our production environment:
Total messages: 1,200,000,000
Daily volume: 70,000,000+
Hourly volume (peak): 11,500,000+
Peak volume: 140,000 per second
Monitored Services: 350 +