SLAs, SLPs, Service Levels, Uptime, Downtime, Maintenance Times… What does this all mean to me, the guy at the end of a computer, managing my fleet? I don’t want excuses and I certainly don’t want problems. When I come into the office in the morning, I want to sign into MyGeotab and get my fleet moving. When my drivers come back in the afternoon, they want to be able to get their reports on how they did for the day, without wasting precious relax time.
Geotab has been providing online telematics for many years, building an ever-improving, forever scaling system to cater to our customers’ needs and growth. We also know that we will never be perfect and there will always be room for improvement. We know that we will hit bumps along the way and will be faced with complex challenges that many of our customers don’t care about. We know that our customers don’t want excuses, they want results.
The intent of this blog is to share some of the designs, processes, and specifics we follow and implement related to server monitoring, to ensure that our systems keep on running with as little impact to our customers as possible.
Big Picture Stuff
Geotab’s Key Technical Processes and Technologies
Here are some of the processes or technologies we leverage, as part of our overall strategy to provide highly-available services:
- Partnerships — We partner with leading cloud providers such as Google, that can scale on demand.
- Geographic Redundancy — We take advantage of multiple regions across the world to provide a truly geo-redundant environment.
- Server Virtualization — We use the fastest, best virtualized services available.
- Quick Updates — We have designed the virtualized environment to be quickly and easily updated, with minimal down time.
- Automation — Our engineers have built automation tools to make almost any job simple and secure.
- Vigilant Monitoring — Our monitoring systems are built completely in-house, allowing us to get the right information at the right time.
- Big Data — We leverage Big Data tools and machine learning technologies to detect issues quickly, and to predict future scalability and reliability issues before they happen.
Geotab is currently collecting over 1.3 billion records per day, which represents more than 200 GB of data being added. We are currently hosting over 155 TB of customer data.
Overview of Geotab Server Monitoring
How Do We Monitor Our Services?
This is a great question because there are so many right and wrong ways to monitor and more specifically report on product or server availability.
Firstly, monitoring systems need to be stable, accurate, covering all necessary areas, and redundant. Geotab designed and implemented its own monitoring systems to not only ensure we met these requirements, but that we could continue to adapt it as the services evolve.
All our servers are monitored exactly the same way from four separate regions, with all data getting logged both locally and centrally. This happens right from the second we spin up a new server. Through our automated provisioning processes, the new server is automatically added to our monitoring system, and will only get removed once we terminate that server.
Each server is checked every minute and averages out the data over 10 minutes.
What Do We Monitor?
It’s not good enough to just ping the server, or check a port — that does not tell you much about the usability of the server or service. Sure, it might tell you that the server is responding but if the application itself has an issue, you would never know. We want to know whether the application is responding or not. We want to know if there are delays in data. We want to know if there are any anomalies. We also want to know if a server is slowing down over time. Our goal is to detect and respond to any potential issues before they happen or before our customers notice it.
Geotab has deeply integrated our monitoring systems into the MyGeotab platform. We make direct, internal API calls to retrieve very specific metrics to determine whether there is an issue or a potential issue. For example, we have a database level metric called LoadFactor, which is a measurement of specific tasks performed on that database, including logging on and accessing data from specific tables. Each database has its own LoadFactor measurement that is constantly updated. The server LoadFactor is the sum of each database LoadFactor with some other server-level checks added in. These values are again averaged out over 10 minutes to avoid false negatives.
Some of the conditions we will generate an On-Call Alert (OCA) for include:
- The server is unresponsive.
- The server LoadFactor value is above a specific value.
- A database LoadFactor is above a specific value.
- More than x number of data files accumulate.
- An application exception is thrown.
How Does the OCA Process Work?
Geotab leverages third-party services to provide fully redundant and automatic alert mechanism. Once an OCA is generated, a designated On-Call engineer is called. Should the call be missed for some reason, the system will contact the next engineer. That will go on until an engineer acknowledges the alert. The engineer will then review the information and get the issue resolved.
Each On-Call ticket is reviewed by the support and development teams to ensure all issues are understood and so that improvements can be made where possible. Geotab also uses the various metrics to constantly monitor growth and scalability requirements. For example, when we see a specific server slowing down due to growth, we move data sets off to other servers.
Service Level Summary
Geotab’s standard Service Level Policy (SLP) commits us to providing at least 99.5% availability to our customers. Based on how we monitor our services, we consider a server deprecated when there is any issue, regardless of whether it is available or not. Most of our downtime issues were high LoadFactor values, which means that the system is still generally available to users.
We analyzed all our MyGeotab servers over a 12 month period (December 2015 to November 2016). Here are the numbers:
- Total Number of Servers: 220
- Overall Uptime Average: 99.95%
- 99.95% represents only 21 mins of downtime for a 30 day month
- Number of Servers above Geotabs 99.5% SLP: 153 (97.5%)
- Number of Servers above 99.9%: 142 (90.5%)
That does mean that four servers were below 99.5%. All those servers are single, large customer servers, with a very high number of devices, data points and users. Most the issues that affected their overall uptimes were related to high LoadFactor values, so we do know that our infrastructure is very stable.
Currently, we are focusing our efforts on increasing the efficiency for very large systems. We are working with our Cloud partners to improve throughput and IOPS, and we have a dedicated team focusing on database optimizations. All these efforts are making our systems even more scalable and robust.
Geotab is constantly improving its monitoring, maintenance and upgrade systems. We are working hard to ensure your systems are always available, by using big data and clever metrics to be proactive and efficient.