I’ve been reading a lot of sci-fi lately, mixed with books about data, which got me thinking about how dumb data is, what the costs of stupid data are, and how smarter data could lower our information costs: for access, for development, and for maintenance.
Why Data are like Tribbles
In a classic original Star Trek episode, The Trouble with Tribbles, a gerbil-like organism is reproducing. (You can see 3 minutes of it here, courtesy of CBS.) Tribbles aren’t that smart (cute, but not smart), and they don’t live long, but they multiply and cause problems. While our data doesn’t multiply at the same rate, I’m more surprised than ever at just how many copies of the data there are inside the enterprise, and it certainly lives a long time. Backups, sure, those make sense. Data warehouses, arguably a necessary evil. Let’s back those up. Data marts and operational data stores, well, ok, since we’re going in this direction, and there’s processing that just has to be done on our data to run a real-time enterprise. Our own personal copies? We have to personalize the data, massage it a little more. In Excel. And its presentation, too. In PowerPoint. Let’s back those up too.
The Marginal Cost of a Copy? It’s not zero.
We’re on a slippery slope. We each can argue the absolute necessity of the different pieces. Heck, our jobs depend on some of those pieces! Business speed matters. Disk is cheap. The marginal cost of a digital copy is close to zero. But to be absolutely clear, the marginal cost of a digital copy is greater than zero. There are costs, some direct, some hidden, but definitely real costs, and they may be greater than we suspect. It’s not just disk, it’s that the copies we make are each less likely to reflect the ground truth of the data. It’s the increasing ambiguity of the data, and our inability to find what we should be using in the first place, that keeps cost mounting. The data isn’t just more difficult to find because there are copies, it’s more difficult to use…because there are copies.
An Echo of a Reflection of a Shadow
Ever made a copy of a copy of a copy? It’s not supposed to matter in the digital world, but in fact, it does. We combine, we aggregate, we summarize. We necessarily have to gloss over some detail. We lose that detail. We lose the ability to track back to the source. We use the corporate copy. We use the departmental copy. What provenance is there is so difficult to navigate it does not get used. We use our best (read “most easily accessible”) copy of the data and go with that. Is that good enough? It has to be, because we’ve got deadlines to meet, a business to run. It was good enough last time.
What if I could guarantee I could get to the data, correctly each time? What if every part, every product, every person had its own identity, and we were guaranteed to be able to find it? What if, when I found the data, I also automatically found every piece of data it was known to be related to? And their identities? Would we consider the data “smart” then? I sure would.
If every piece of meaningful data had its own identity to begin with, I might not have to copy it so many times. When I did copy it, I’d know what I was copying from, why, and be able to verify that at any time. Data would become a lot less ambiguous. Developers would spend less time finding the data, and more time using their creativity to solve problems. So would information users. Simply providing the links to and from data - while not a panacea – makes the data more useful, keeping it alive, rather than dead in the application. There are some simple principles we can use to make these possibilities realities, and they don’t require a lot of money or elapsed time. They require a change in our thinking about data, our applications, and how we can add value. If we knew, for instance, that our knowledge about a particular thing would be captured and available, guaranteed, then we would be more likely to take the time to capture our knowledge. These simple principles help us do these things: store data once in a place where we know we can get it every time, capture knowledge about that data in a consistent way, and lower the costs for information access and maintenance.