There is no risk in Pseudo-Risk

The folks at Verizon are doing a great job with their Data Breach Investigation Reports. Their latest edition is their second and it warrants a thorough review. My biggest concern involves their "pseudo-risk" calculation. Regardless of whether it is preceded with the word "pseudo" or not, I believe the concept doesn't work.

I posted a brief Primer on Quantifying Risk recently as background to this post.

The 2008 Verizon Report has this to say:

Though the number of records is not an equivalent measure of the overall impact from a data breach, it is certainly an indicator. Thus, a “back of the napkin” calculation of risk (likelihood x impact) finds that partners represent the greatest risk for data compromise, followed closely by insiders.

And 2009 provides a similar calculation:

At this point, those familiar with our pseudo risk calculation (likelihood x impact) and its result in the last report may suspect that it will yield a different outcome this year. That instinct would be correct.

In both cases, the likelihood calculation involves the relative portion of incidents assigned to {external, internal, partner} sources and the impact calculation involves actual number of records lost in all three cases.

There are two problems with this approach to risk. First, at best using a proportion of breaches would force the risk statement to be constrained by the known prior of existing compromise. So the statement would have to say something like "of all known breaches, there is a 73% chance that the source was external."  I believe this is a base rate fallacy problem. If we want to know the risk that an external party will cause a breach, using proportion of known breaches doesn't help us because we don't know the total volume of activity (good and bad) assigned to the various groups.

The second problem is that the likelihood of a risk calculation operates as a "discount" to the impact – it provides an expected utility or expected value number to address future uncertainty. The number of records used for impact, however, is not uncertain – the numbers are actual records lost. Therefore, there is no need to discount these. What it really points to is the need to understand the total number of records that were at risk prior to the breaches in question.

In any case, I believe that neither the likelihood nor the impact sources used are appropriate for a useful risk statement.