Once you have a good understanding of what has happened (Descriptive analytics) it’s time to consider why it happened. Diagnostic analytics refers to the methods used to answer questions about why something has happened. More carefully stated, diagnostic analytics establishes possible answers to why something happened. This distinction is important when thinking about diagnostic analytics questions. It shapes what you can expect from the analysis. Why something occurs is hard to definitively determine until a model is built and extensive testing is performed. Instead, at this stage, you are considering what possible causes could be and not what they are. This is the difference between correlation and causation. Despite not determining actual causes there is still a lot to be learned here. Determining correlation isn’t a smoking gun for causation, though the lack of correlation typically means no causation. These revelations are the cornerstone of model building for predictive analytics, a topic for a future article.
Techniques of Diagnostic Analytics
There are many different approaches to doing diagnostic analysis. The standard methods of doing diagnostic analysis are Data Mining/Data Drilling, Data Discovery, and Correlations.
- Data Drilling/Mining– This is digging into the subgroups that make up the value being explored. This may not provide the explanation outright but can give insights into what areas should be further explored. With a good dashboard, this type of analysis is easily accomplished.
- Example: Iron sales are down. You dig into the subgroups of sales and look at who you are selling to. You see that a large car manufacturer is not buying as much as they used to. We may have answered why sales are down but that doesn’t get to the root of the question of why. Is the car manufacturer buying from someone else? Are they buying less in general because of scaled-back production? Or are they buying less because they are making more aluminum cars now? Each scenario would warrant different responses. Digging into subgroups sometimes gives the answer simply and others provide a hint as to where to look further.
- Data Discovery– This is looking at data through the use of visualizations to look for connections between variables. Sometimes just seeing the right plot, pie chart, or bar graph can make the why behind something crystal clear. Knowing the best way to visualize data is a skill unto itself especially when exploring possible relations between variables. Again, a well-designed dashboard should allow for easy production and navigation of plots.
- Correlations– This is a measure of how much two variables are related. There are serval types of correlations one could calculate (Linear/Correlation Coefficient/Pearson’s product-moment coefficient (most common), Rank Coefficients, etc.) Each has its own use cases where you would want to use each, but the basic idea behind all of them is the same. A correlation is a number that usually ranges from 0 to 1, 0 being not correlated and 1 being perfectly correlated. You can think of this as a way of measuring how one value will change in response to changing another.
- Example I: Consider ice cream sales. In the summer you see that your sales are much higher than in the winter months. You calculate the correlation between the time of year and ice cream sales and find the correlation coefficient is 0.9. This means that changing what time of the year you look at ice cream sales has a substantive impact on what you would expect.
- Example II: Now consider two 6-sided dice. You are interested in seeing if rolling a 3 on one dice has any impact on the value rolled on the other dice. You calculate the correlation coefficient and find there is a 0.05 correlation. At this point, you can be confident that the processes are independent of each other, and that the dice are fair.
Benefits of Diagnostic Analytics
Diagnostic analytics has several uses: model building, database design, and targeted responses are but a few benefits of this analysis.
- Predictive Analytics: One major benefit of diagnostic analytics is that it is very important for predictive analytics. As we will talk about in the article on predictive analytics, predicting possible outcomes is done by building models. The fastest way to build and test models is by knowing all the independent variables. By considering the possible causes for an event you are finding possible independent variables that build out your models.
- Database Design: To design an efficient database knowing possible variables that explain a changing KPI (Key Performance Index) and knowing those that don’t, inform how the database should be structured. Once you know certain variables are not correlated to your KPI, you can be confident that storing them is no longer necessary for analytics purposes. Using diagnostic analytics to design databases can save on storage and computation time when doing more advanced analytics because you will need a lot of data for those variables, so space may become an issue.
- Targeted Responses: Looking for “why answers” allows for more targeted responses to changing values. As an example: Let’s say there are serval variables being tracked that could contribute to sales of a product: time of year, location, regional advertisements, and social media mentions. When you know what variables are strongly correlated among the list you can divert more resources toward those that will have the biggest effect. For this example, you find social media mentions are strongly correlated with increased sales. In this case, diverting more resources toward developing a stronger social media presence would be most beneficial. This idea extends into Prescriptive analytics (with questions about optimum response), but all the branching possibilities begin in diagnostic analytics.
Difficulties with Diagnostic Analytics
There are a few possible hardships that come up when doing diagnostic analytics. A common problem is hidden variables. What hidden variables refer to is a lack of data. To see what this looks like let’s consider a simple example; a researcher is tracking two variables, shoe size, and reading comprehension. The researcher sees a strong correlation between increased shoe size and increased reading comprehension. One might conclude from the correlation here that increased shoe size means better reading comprehension. If age were a variable available (in this case age is the hidden variable) you would see age to be a better explanation for both increased shoe size and better reading comprehension. It is generally understood that very young children have smaller shoe sizes than adults and also fewer reading skills. This is easy to see in an example as simple as this where the cause is known. When considering more complicated systems where there is no intuition (or worse misplaced intuition) guiding, the hidden variable may take considerable time to find if at all.
Another problem lies in the math of correlation functions. The ideas behind correlations can be confusing at a brush. Knowing which function is best to use, or where the function breaks down takes a level of expertise to handle. As a rule of thumb here, if it’s linear (looks like a straight line) the standard definition is fine, anything else gets complicated.
The last major source of frustration lies in timelines. There is no real timeline you can give for this type of analysis. Due to hidden variables and a certain level of subject matter expertise, finding the possible whys can be very time-consuming. There are always more why questions you can ask. Knowing when to stop can take real restraint. In some cases, the wrong why answer determined doesn’t become apparent until errors pop up in the models built using the variables. Healthy skepticism is encouraged when considering your results until the answers have become more demonstrable.
What this looks like in practice?
- First, the utilization of descriptive analytics to track data for many variables is necessary. This type of analytics is all about looking for connections between variables in the data.
- Next, the formation of the why question needs to happen. This is usually brought about by seeing abnormalities in the data via visualizations, often through a dashboard, and searching out the cause.
- Once the question is stated, start drilling down/mining your data near the anomaly. What you are looking for is anything that is different in the metadata between the event you are interested in and past records of the data. Do the anomalies in the metadata account for the total anomaly? Does the metadata answer have explanatory power?
- Once you have a candidate explanation, it is time to test. First calculations of correlations between the two variables are a good test to see if the explanation holds. The next confidence test would be to do statistical hypothesis testing (this is a standard but multi-step math process that establishes confidence in your proposed explanation.)
- Once you have a set of possible why answers making them actionable while continuing testing to raise confidence is really the last step in this type of analytics.
Contact Promethean to start utilizing your current data to help improve your business process.