Do you have a big data graveyard?

Companies are collecting massive amounts of data, but it is being put to good use?

Jun 7, 2017

Updated: Aug 21, 2024

4 minute read

Companies have a lot of data. They are collecting it in massive quantities… petabyte by petabyte. To put that into perspective, 1 petabyte = 1 million gigabytes (GB) or 13.3 years of HDTV video or over 58,000 movies. All this data gets stored in these massive server farms or in the cloud. And there it shall rest, never again to see the light of day as more and more data gets dumped on top of it — a virtual big data graveyard.

What Is a Data Graveyard?

Data graveyards are giant repositories of unused data. They are pervasive around the world. A Network World interview revealed that some companies have spent “multiple millions” on data lakes that turn into data swamps from lack of use.

CIOs are now getting called to task for delivering a return on investment on their data. In TechRepublic, Mary Shacklett calls these “Where’s the beef?” moments.

Yet oftentimes for big data projects, the measurement of success is simply successful IT execution. That just won’t cut it. Don’t get me wrong, it’s very important to start collecting and storing mass amounts of data that you believe will help inform critical business decisions in the future. Without that data, you simply can’t move forward.

Nevertheless, once your company has hit the maturity level of having the systems in place to collect, store, and query that data, it’s time to move on and do something valuable with it.

You have to ask yourself: Are you getting the most from your data?

Resurrecting Value from Data

What do I do with this data? That’s precisely where organizations get stuck. At this point, it can get even trickier, because you’ve brought in a room full of data scientists to figure out how to uncover value.

What seemed like a simple project at the beginning now turns into a discussion about machine learning, artificial intelligence (AI), K-means clustering, regression analysis, supervised learning, and so on. These are all important things that will undoubtedly form part of your strategy, but should not necessarily be the starting point.

You’re not going to start on a black diamond if you’re just learning to put on a pair of skis.

First you need to shift from “thinking broadly” to “thinking specifically.” Identify a specific high value problem in your organization, and work iteratively through an approach to solve it.

Case in point, Geotab collects over 1.5 billion raw records on a daily basis. In fact, if you accumulate all of the kilometers driven by the vehicles we manage, in one month’s time you would travel over 10 Astronomical Units (that is 10 times the average distance from the earth to the sun!) So, how does one start “simply” with so much data.

How to Find a Starting Point with Big Data

To find the “beef” in your big data, start with a simple question and try not to get distracted by the noise. Work with stakeholders who are committed to solving this problem with you.

Identify the Key Challenge You Want to Solve

You may already know what your top challenge is, but if you don’t, here are some ways to get started.

Is there a particular area of your business you need to improve such as customer service or delivery times?
Are your team members spending too much time or resources on something that should be faster or easier to achieve?

Recommended Read: Find out how one company leveraged their big data in this white paper Fleet Benchmarking with Telematics.

Example: Where Are My Customers?

Take for example a company that spent many hours each month on geocoding. For those unfamiliar with the term, geocoding is the process of taking an address and turning it into its corresponding latitude and longitude coordinates.

So, let’s say your customer is a golf club. Google might correctly say that the address is at 123 Main Street. In actual fact, the delivery location for the golf club is 1 km down the road at the clubhouse. Geocoding discrepancies can occur when there is a difference in the corporate street address and delivery point or when there has been a data entry error.

Two Reasons Why Data Accuracy Is Important

Reason 1:

If you’re creating routes for your drivers, your routing algorithm will rely on the fact that you know where your customers are and have provided accurate coordinates for them. If this is not the case and your customer locations are not on point, not only will your drivers be confused during delivery, but you will undoubtedly have un-optimized routes leading to late delivery times, missed delivery windows, and unnecessary fuel waste.

Reason 2:

If you don’t know where your delivery trucks should be stopping, it becomes very difficult to know when they were there. You can’t run the necessary analytics to determine if systemic issues exist and in turn optimize that performance. Are deliveries taking too long? Is there more back-door congestion from other delivery vehicles delivering to a store at certain times of the day? You will never know.

There was one simple question that needed to be answered: Where are my customers?

Yet, answering this one simple question could lead to significant returns for them both in terms of productivity and customer satisfaction. If you’re a small enough company, this may not be a big data problem, but if you’ve got thousands of vehicles making deliveries and new customers added daily, it can quickly turn into an issue.

Big Data Solution

Uncovering Hidden Data with Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

To solve this issue, we examined a year’s worth of driving data to identify where each vehicle was actually stopping on each of their routes. Over the course of a year, we generated a series of points in clusters. We then grouped those clusters using scikit-learn’s implementation of Density-Based Spatial Clustering of Applications with Noise (DBSCAN).

This allowed us to remove any outliers from our analysis and to generate a bounding box that was essentially the zone in which the driver actually stopped for the customer. Comparing this data to the manifest allowed us to correlate the specific stop to a customer.

The Results

We found that for several delivery hubs, over 15% of the customers differed by 500 metres or more from their planned stop location (some by up to 20 km due to poor geocoding).

As you can well see, answering a simple question like “Where are my customers?” can have a dramatic effect on your business. Correcting the geocoding is a stepping stone to improving productivity, fuel savings, and customer satisfaction.

This is just one example of how you can find value in your big data. A big data graveyard can actually be a goldmine for your business, but only if you ask the right questions.

Related:

What is Data Visualization?

Is Your Fleet Ready for Big Data & Advanced Analytics?

Subscribe to get industry tips and insights

Big Data

Mike Branch

As Geotab’s Vice President of Data and Analytics Mike Branch leads the development for solutions that enable insight from over 1.4 million connected vehicles and 30 billion telematics records that Geotab processes daily.