There's an old adage in the investment community:

The information you have is not the information you want.
The information you want is not the information you have.
The information you need is not the information you can obtain.
The information you can obtain costs more than you want to pay.

[as printed in: Against the Gods, the remarkable story of risk - Peter L. Bernstein - 1996]

This was originally coined in the framing of what you would want to know to really price a business or potential investment opportunity. The implications of modern insider trading laws, make this particularly evident. However I think there's a very meaningful parallel to the debate around Big Data.

Why Big Data is great.

The excitement around Big Data [in the traditional sense], mainly stems from our ability to add new sources of insight which help plug the gaps implied by the adage above. Facebook doesn't know whether users actually like what they're being shown - but it does know whether the click it or whether they scroll past it - and can use that huge dataset (combined with some clever analytics) to infer a reasonable approximation of whether they like it based on engagement.

What's more in a lot of cases these are datasets which organisations already have. There are massive stores of logs and behavioural data which can be readily analysed and mined for insight - and don't require additional collection or acquisition - just the application of increasingly accessible analytical tooling.

The transformative effect of this on some organisations is huge - in that suddenly the body of data which the organisation can use for insight takes a big leap onwards and that can unlock several key decisions in a short space of time.

Why Big Data is dangerous.

I've already talked about the concept of utility when it comes to dashboarding - that the perceived value of a report decreases as the existing number of reports decreases. I believe the same effect applies to raw data too - that the more data you have: the lower the perceived value of each additional piece of data. The net effect of the huge wave of data which comes with the first "Big Data" projects is that it can have a huge impact of the total volume of data your organisation has access too - and therefore a massive negative impact on the perceived value of collecting any additional data. Doubly so if that additional data isn't also massive in size.

On top of that, Big Data is almost by definition, data which is only valuable in massive amounts, and therefore significantly "lower value" as individual data points compared to the kind of data your team might be used to using.

All in all, if not handled carefully - a successful "Big Data" effort raises the profile of Big Data, encourages more Big Data projects - and significantly harms the incentives for "small data" projects.

The dilemma of small data.

Is it better to infer a value from mountains of data or to measure it directly? Inference of something, based on Big Data is useful, but only if it's effectively impossible to measure the thing directly. Given the option - it's more reliable, faster, cheaper and probably easier on your team - to use direct measurements if they're available. Nobody is every going to argue that Big Data is cheap or easy.

This raises an inherent paradox for the forward leaning data team. Big Data is valuable, and it would be crazy to ignore it - but if unchecked the success of those efforts can hamper the small data efforts which are certainly cheaper and easier, and probably more accurate and faster.

There are a few approaches we can take to combat this an ensure that the two can happily co-exist.

  1. Don't overhype Big Data. It's very tempting (especially when vying for budget) to make a big deal out of Big Data - it's the future, it's game changing, it's a necessity. Be very mindful that the more you overhype it, the more you're making life difficult for small data. Be balanced and set realistic expectations about what the benefits, and drawbacks, are of Big Data and how it fits into a wider context.
  2. Champion data collection. Data collection is going to have to increase regardless, whether it's small data or big data. While it's easy to get started on the data you already have, if you don't also ramp up data collection - you're going to run out of steam pretty soon.
  3. Always try and turn big data projects into small data projects. Some Big Data projects should be Big Data projects - some shouldn't. Sometimes you'll need to extend small data to big data (but that happens quite naturally), more often you'll need to turn a Big Data project into a small data project where people have just got a bit excited. The excitement is natural, and to be expected - but someone needs to play the role of small data champion.
  4. Be clear on what data you would ideally have. It is quite common for data teams, especially mature ones, to have a lot of data and get stuck into the mindset of working within the known existing data. As a data leader, you can trigger the thought exercise occasionally to ignore what exists, and ask instead "what data would we ideally have" [and how could we get it].

Sometimes the information you have is not the information you want. That doesn't mean it's always safe to assume that the information you need is not the information you can obtain.