Data Finds Data – Rethinking Time and Cost in Information Systems

Data finds data. It’s such a simple concept that I overlooked this title for almost two years while it stayed hidden as a chapter in the Creative Commons-licensed book: ““Beautiful Data: The Stories Behind Elegant Data Solutions”.” It’s accessibly written by Jeff Jonas and Lisa Sokol. Jonas is chief scientist at IBM’s Entity Analytics group, the creator of NORA, and well-known for his work in casino gaming (and his acquired company’s work in counterterrorism). The bottom line is this: in the near future, many more information systems will determine what they should do with data the instant the data is encountered, because that is the computationally cheapest way to do it. Smart questions will have to be asked less by users; it will be the responsibility of the data to find its place in the enterprise. If this seems obvious, it’s not, and we’ve architected three generations of information systems – not to mention built a mindset! – that is going to be very difficult to change. Jonas’ subtle point is we can’t afford not to change.

A Brief History of Time

We build systems as if we know the order in which information will appear, and that’s a very costly flaw. Relationships can manifest themselves at different points in time. If we assume the order, the timing, we’ll miss important relationships, and those relationships will never get acted upon. What if, instead, we related information up front, semantically, acted upon it, and then stopped worrying about it? It would change our cost structure for information dramatically, because right now we worry about the same information over and over and over again, while the costs increase. We would have a different set of tradeoffs, to be sure, but we are reaching the point right now where we are constantly disregarding data we can no longer afford to process. We’ve got the wrong mix, a costly one, and it’s going to take some forward-looking thinking to change it. We need a persistent, time-independent context for our information, an institutional memory of how our information relates, and it doesn’t take a self-aware Skynet to get us there.

Semantically Reconciled and Relationship-Aware Directories

This is as technical as Jeff gets in the blog entry, and the concepts are actually very simple ones. Think perpetually-updated synonym list, along with synonym identifiers, so that you’ve got a relationship-based index to all enterprise information. The trick is to update these “directories” the moment data is added to the system, and be willing to incur the overhead of updating entire sets of relationships when necessary. So, for instance, you discover that the “Kevin Lynch” you’ve been pointing to is not the “Kevin Lynch” another system has been pointing to; you need to separate what you know about the two of them right then. You can’t let your ground truth “drift” over time, forcing a now-computationally-impossible update of entire systems. This requires discipline. You do have to incur the cost of creating semantic directories to begin, and modify existing systems to contribute to them and operate with them. We haven’t spent a lot of effort on semantic reconciliation (yet), primarily because we have so much easily reconciled data that is not (yet) linked data, but when we get time, we want to spend it with the SILK (Semantic Inferencing on Large Knowledge) project looking for scalable methods.

A Divestiture of Responsibility?

If you’ve gotten this far, you’re probably seething with the thought that users won’t be needed to ask smart questions. No, systems will never get that smart. This approach does not divest anyone of their responsibility, and is analogous to how the semantic web has been trying to generalize business logic by putting it in the data itself, rather than specific program code that runs in specific places at specified times. Take that even further now, by collapsing time to zero, and moving that logic up front, so that data is smarter, sooner. We’ll be able to capture what we know about something and attach it to that something and have it considered every time, by machines and people. When data finds data, the data asks the questions first. More than ever, it will be our responsibility to capture what we know, and add value by linking what we know to our data. This is a short, compelling read, and has me thinking differently about how we need to architect information systems.


Instead of the upfront separation, which the post seems to say would be a manual process, of sorts, would it be feasible to allow the data for the two "Kevin Lynch" items to remain co-located, until such time there as further data added to "Kevin Lynch" that was incompatible with the existing data? And then, as a sort of way to resolve the machine's cognitive dissonance over the two Kevin's, let the machine experiment with differing splits in the Kevins until the dissonance was resolved (while keeping the original data organization of the two Kevins as one, and if in the future further data was added that called for additional reorganization, then going back to the original construct and revising from might be an important part of search.