The Trouble with Tribbles: Why Dumb Data Multiplies Like Bunnies

I’ve been reading a lot of sci-fi lately, mixed with books about data, which got me thinking about how dumb data is, what the costs of stupid data are, and how smarter data could lower our information costs: for access, for development, and for maintenance.

Why Data are like Tribbles

In a classic original Star Trek episode, The Trouble with Tribbles, a gerbil-like organism is reproducing. (You can see 3 minutes of it here, courtesy of CBS.) Tribbles aren’t that smart (cute, but not smart), and they don’t live long, but they multiply and cause problems. While our data doesn’t multiply at the same rate, I’m more surprised than ever at just how many copies of the data there are inside the enterprise, and it certainly lives a long time. Backups, sure, those make sense. Data warehouses, arguably a necessary evil. Let’s back those up. Data marts and operational data stores, well, ok, since we’re going in this direction, and there’s processing that just has to be done on our data to run a real-time enterprise. Our own personal copies? We have to personalize the data, massage it a little more. In Excel. And its presentation, too. In PowerPoint. Let’s back those up too.

The Marginal Cost of a Copy? It’s not zero.

We’re on a slippery slope. We each can argue the absolute necessity of the different pieces. Heck, our jobs depend on some of those pieces! Business speed matters. Disk is cheap. The marginal cost of a digital copy is close to zero. But to be absolutely clear, the marginal cost of a digital copy is greater than zero. There are costs, some direct, some hidden, but definitely real costs, and they may be greater than we suspect. It’s not just disk, it’s that the copies we make are each less likely to reflect the ground truth of the data. It’s the increasing ambiguity of the data, and our inability to find what we should be using in the first place, that keeps cost mounting. The data isn’t just more difficult to find because there are copies, it’s more difficult to use…because there are copies.

An Echo of a Reflection of a Shadow

Ever made a copy of a copy of a copy? It’s not supposed to matter in the digital world, but in fact, it does. We combine, we aggregate, we summarize. We necessarily have to gloss over some detail. We lose that detail. We lose the ability to track back to the source. We use the corporate copy. We use the departmental copy. What provenance is there is so difficult to navigate it does not get used. We use our best (read “most easily accessible”) copy of the data and go with that. Is that good enough? It has to be, because we’ve got deadlines to meet, a business to run. It was good enough last time.

What if I could guarantee I could get to the data, correctly each time? What if every part, every product, every person had its own identity, and we were guaranteed to be able to find it? What if, when I found the data, I also automatically found every piece of data it was known to be related to? And their identities? Would we consider the data “smart” then? I sure would.

If every piece of meaningful data had its own identity to begin with, I might not have to copy it so many times. When I did copy it, I’d know what I was copying from, why, and be able to verify that at any time. Data would become a lot less ambiguous. Developers would spend less time finding the data, and more time using their creativity to solve problems. So would information users. Simply providing the links to and from data - while not a panacea – makes the data more useful, keeping it alive, rather than dead in the application. There are some simple principles we can use to make these possibilities realities, and they don’t require a lot of money or elapsed time. They require a change in our thinking about data, our applications, and how we can add value. If we knew, for instance, that our knowledge about a particular thing would be captured and available, guaranteed, then we would be more likely to take the time to capture our knowledge. These simple principles help us do these things: store data once in a place where we know we can get it every time, capture knowledge about that data in a consistent way, and lower the costs for information access and maintenance.

Comments

I think the idea of tracing data to all its points, changes, and back to its origin is interesting. I'm just wondering WHAT data you're referencing. (As a normal person on the street, maybe that's an assumed piece of knowledge that I don't have that someone in the field would have.)

A key definition needed is for the term "every piece of MEANINGFUL data." Who determines what is meaningful? Who generates this data? Where is it stored? Do my babies' pictures count as meaningful data? Is meaningful data within a company or is it the world in general? Does a 16 year old's crooning on youtube count as meaningful data? What if that 16 year old is Justin Bieber? Does a recipe invented by Aunt Sofie that was adapted by cousin George, turned into gluten free by neighbor Sally, made vegetarian by someone in Detroit who saw it online, then turned into a beef dish by her neighbor count as meaningful data? If Aunt Sofie is a published cookbook author and it was her original chicken recipe and she finds it used for brisket at a cafe in Detroit, maybe it is meaningful data. And certainly tracing it's change would matter to Aunt Sofie and her attorney.