Blogs

Semantic Link Podcast With Guest Nova Spivack

As many of you may have seen on SemanticWeb.com, we've kicked off 2012 well on the Semantic Link podcast. This month we were joined by veteran semantic entrepreneur Nova Spivack and our conversation surrounded privacy, UI/UX design and whether or not we've yet found a definition for the word "semantic."

We spent some time talking about one of Nova's newest projects, Bottlenose. If you haven't checked it out yet, you can request access using the invite code "semanticlink". My fellow Linker Eric Hoffer has a great write-up of the session on his blog "Axonomics."

Listen here!

Data Finds Data – Rethinking Time and Cost in Information Systems

Data finds data. It’s such a simple concept that I overlooked this title for almost two years while it stayed hidden as a chapter in the Creative Commons-licensed book: ““Beautiful Data: The Stories Behind Elegant Data Solutions”.” It’s accessibly written by Jeff Jonas and Lisa Sokol. Jonas is chief scientist at IBM’s Entity Analytics group, the creator of NORA, and well-known for his work in casino gaming (and his acquired company’s work in counterterrorism). The bottom line is this: in the near future, many more information systems will determine what they should do with data the instant the data is encountered, because that is the computationally cheapest way to do it. Smart questions will have to be asked less by users; it will be the responsibility of the data to find its place in the enterprise. If this seems obvious, it’s not, and we’ve architected three generations of information systems – not to mention built a mindset! – that is going to be very difficult to change. Jonas’ subtle point is we can’t afford not to change.

A Brief History of Time

We build systems as if we know the order in which information will appear, and that’s a very costly flaw. Relationships can manifest themselves at different points in time. If we assume the order, the timing, we’ll miss important relationships, and those relationships will never get acted upon. What if, instead, we related information up front, semantically, acted upon it, and then stopped worrying about it? It would change our cost structure for information dramatically, because right now we worry about the same information over and over and over again, while the costs increase. We would have a different set of tradeoffs, to be sure, but we are reaching the point right now where we are constantly disregarding data we can no longer afford to process. We’ve got the wrong mix, a costly one, and it’s going to take some forward-looking thinking to change it. We need a persistent, time-independent context for our information, an institutional memory of how our information relates, and it doesn’t take a self-aware Skynet to get us there.

Semantically Reconciled and Relationship-Aware Directories

This is as technical as Jeff gets in the blog entry, and the concepts are actually very simple ones. Think perpetually-updated synonym list, along with synonym identifiers, so that you’ve got a relationship-based index to all enterprise information. The trick is to update these “directories” the moment data is added to the system, and be willing to incur the overhead of updating entire sets of relationships when necessary. So, for instance, you discover that the “Kevin Lynch” you’ve been pointing to is not the “Kevin Lynch” another system has been pointing to; you need to separate what you know about the two of them right then. You can’t let your ground truth “drift” over time, forcing a now-computationally-impossible update of entire systems. This requires discipline. You do have to incur the cost of creating semantic directories to begin, and modify existing systems to contribute to them and operate with them. We haven’t spent a lot of effort on semantic reconciliation (yet), primarily because we have so much easily reconciled data that is not (yet) linked data, but when we get time, we want to spend it with the SILK (Semantic Inferencing on Large Knowledge) project looking for scalable methods.

A Divestiture of Responsibility?

If you’ve gotten this far, you’re probably seething with the thought that users won’t be needed to ask smart questions. No, systems will never get that smart. This approach does not divest anyone of their responsibility, and is analogous to how the semantic web has been trying to generalize business logic by putting it in the data itself, rather than specific program code that runs in specific places at specified times. Take that even further now, by collapsing time to zero, and moving that logic up front, so that data is smarter, sooner. We’ll be able to capture what we know about something and attach it to that something and have it considered every time, by machines and people. When data finds data, the data asks the questions first. More than ever, it will be our responsibility to capture what we know, and add value by linking what we know to our data. This is a short, compelling read, and has me thinking differently about how we need to architect information systems.

The Trouble with Tribbles: Why Dumb Data Multiplies Like Bunnies

I’ve been reading a lot of sci-fi lately, mixed with books about data, which got me thinking about how dumb data is, what the costs of stupid data are, and how smarter data could lower our information costs: for access, for development, and for maintenance.

Why Data are like Tribbles

In a classic original Star Trek episode, The Trouble with Tribbles, a gerbil-like organism is reproducing. (You can see 3 minutes of it here, courtesy of CBS.) Tribbles aren’t that smart (cute, but not smart), and they don’t live long, but they multiply and cause problems. While our data doesn’t multiply at the same rate, I’m more surprised than ever at just how many copies of the data there are inside the enterprise, and it certainly lives a long time. Backups, sure, those make sense. Data warehouses, arguably a necessary evil. Let’s back those up. Data marts and operational data stores, well, ok, since we’re going in this direction, and there’s processing that just has to be done on our data to run a real-time enterprise. Our own personal copies? We have to personalize the data, massage it a little more. In Excel. And its presentation, too. In PowerPoint. Let’s back those up too.

The Marginal Cost of a Copy? It’s not zero.

We’re on a slippery slope. We each can argue the absolute necessity of the different pieces. Heck, our jobs depend on some of those pieces! Business speed matters. Disk is cheap. The marginal cost of a digital copy is close to zero. But to be absolutely clear, the marginal cost of a digital copy is greater than zero. There are costs, some direct, some hidden, but definitely real costs, and they may be greater than we suspect. It’s not just disk, it’s that the copies we make are each less likely to reflect the ground truth of the data. It’s the increasing ambiguity of the data, and our inability to find what we should be using in the first place, that keeps cost mounting. The data isn’t just more difficult to find because there are copies, it’s more difficult to use…because there are copies.

An Echo of a Reflection of a Shadow

Ever made a copy of a copy of a copy? It’s not supposed to matter in the digital world, but in fact, it does. We combine, we aggregate, we summarize. We necessarily have to gloss over some detail. We lose that detail. We lose the ability to track back to the source. We use the corporate copy. We use the departmental copy. What provenance is there is so difficult to navigate it does not get used. We use our best (read “most easily accessible”) copy of the data and go with that. Is that good enough? It has to be, because we’ve got deadlines to meet, a business to run. It was good enough last time.

What if I could guarantee I could get to the data, correctly each time? What if every part, every product, every person had its own identity, and we were guaranteed to be able to find it? What if, when I found the data, I also automatically found every piece of data it was known to be related to? And their identities? Would we consider the data “smart” then? I sure would.

If every piece of meaningful data had its own identity to begin with, I might not have to copy it so many times. When I did copy it, I’d know what I was copying from, why, and be able to verify that at any time. Data would become a lot less ambiguous. Developers would spend less time finding the data, and more time using their creativity to solve problems. So would information users. Simply providing the links to and from data - while not a panacea – makes the data more useful, keeping it alive, rather than dead in the application. There are some simple principles we can use to make these possibilities realities, and they don’t require a lot of money or elapsed time. They require a change in our thinking about data, our applications, and how we can add value. If we knew, for instance, that our knowledge about a particular thing would be captured and available, guaranteed, then we would be more likely to take the time to capture our knowledge. These simple principles help us do these things: store data once in a place where we know we can get it every time, capture knowledge about that data in a consistent way, and lower the costs for information access and maintenance.

Syndicate content