The Personal Data Store
October 3, 2010
As summer began, I left off blogging talking about the Personal Data Locker and the user’s bill of rights. Now I want to continue that discussion by copying and pasting Phil Windley’s excellent blog post from September here (with permission) and annotate it with my comments. I’m putting his text here because I want to be sure all my readers read it. His words are in blue. Mine are in black.
The term Personal Data Locker is a problem. When you say “store” or “locker” people assume that this is a place to put things (not surprisingly). While there will certainly be data stored in the PDS, that really misses it’s primary purpose: acting as a broker for all the data you’ve got stored all over the place and managing the metadata about that data. That is, it is a single place, but a place of indirection not storage. The PDS is the place where services that need access to your data will come for permission, metadata, and location. Similarly for services that need to give you data. Consequently, some have taken to calling it a PDX, where “x” stands for the “variable x.” That is, we don’t know what to call the last thing, so we’ll say “x” and leave it at that.
I chose the word “locker” carefully, to help consumers understand that their information will be under their control. But Phil is absolutely right that it’s not a “locker” at all – this again shows that the more old-world analogies we bring to the new world of cloud computing, the more we prevent true innovation. I’m sticking with the word “locker” for now, but it’s definitely no better than “x.”
Here’s a list of a few things that I think distinguish a PDX from just places where your personal data is stored:
user-controlled - the user needs to be in control of the data, who has access, and how it is used. Once that data is in my PDX, I make decisions about it. That doesn’t mean the data might not also be somewhere else. For example, data about my purchases from Amazon will certainly be stored at Amazon and not under my control. But I might also be emailing the receipts to a service that parses them and puts the data in my PDX for my use.
Exactly. Some information is mine; some is yours; and some is shared.
federated - there isn’t one place where your data is stored, but multiple places that the data needs to be able to flow between, in a permissioned way. There’s no center, just a lot of cooperating system with my PDX orchestrating the interactions. While Amazon might not give my PDX access to and control over my transactions, my phone company might provide a PDX-capable contact service where I choose to store my contact information.
As I say in my book, it doesn’t matter. Whether 90% of the data is on a single server available through a single log-in or whether 10% of it is on your server and the rest is scattered among the clouds – you won’t notice the difference. That’s the key thing. Ideally, the user interface will be the same, no matter where the data is, and the permissions will allow either consolidated or confederated models to work equally. In fact, if the architecture is sufficiently modular, you should be able to choose your user interface separately.
interoperable – various PDX services and brokers have to be able to operate together according to standards to perform their roles. When I take money out of my account at Wells Fargo and deposit it at Chase, I don’t lose part of the value because Chase doesn’t know how to handle some part of the transaction. The monetary system is interoperable with standards and, sometimes, shims that connect it all together.
Interoperability is the key. As I discuss in my book, the standards make things work, both for customer-facing and back-office software. I hope we’ll soon have standards for identity and calendars that help form the basis for an interoperable PDX.
semantic – a PDX knows more about the data that it holds than existing data stores do. Consider Dropbox. I can put all kinds of things in my Dropbox, but it’s syntactic, not semantic. By that I mean that if I want to put healthcare data in Dropbox and control who uses it, I create a folder and put the data in it with specific permissions. The fact that there is a folder with a certain name located at a particular place in the folder hierarchy is purely syntactic. In a semantic world, the data itself is tagged as healthcare data and no matter where it is, it’s protected according to the policies I’ve put in place.
Very important. As people read my book, I get emails every week from someone who has “already built” the personal data locker, some even have patents on their approaches to storing data. But none of them has the semantic keys to make the information work, because that meaning is baked into the standards we’ll adopt.
portability – a PDX doesn’t trap data in proprietary formats. If my phone company is storing my contact data in the cloud and I decide that I want to move it to my own server or another service, I can—from a technical as well as a policy standpoint. Note that this doesn’t mean we have to wait until thousands upon thousands of data format specification get hammered out. Semantic metadata can provide a means of translating from one format to another.
As I keep saying, to anyone who will listen, the less translation the better. But Phil is a realist, and he knows translation will always be necessary to some degree. If it helps get things started, great. If we can eventually consolidate formats, even better. Everyone working on this problem has essentially agreed to portability, which to me says we should all get behind the new standards immediately.
metadata management - one of the primary roles of the PDX is managing data about my data. What are the roles I’ve created? What permissions have I granted as exceptions to the defaults? What semantics surround the various data fields? What data sharing, encoding, and encrypting policies have I created? All of this has to be kept and managed in my behalf in the PDX.
I want to manage my histories – where I’ve surfed online, where I’ve driven, where I’ve walked, what I’ve bought, where I’ve eaten, who I’ve called, etc. Because we will start to consolidate all our data streams, our data stores will be big. We’ll need sharp tools for managing large amounts of data, permissions, groups, etc.
broker services - the PDX is a place where the user manages a federated network of data stores. As an example of why this is important, consider the shortcomings of OAuth. If I use an application that needs access to four OAuth mediated APIs, I have to go through the OAuth ceremnoy with each API provider separately. Now consider that I might have dozens of apps that use a popular API. I have to go through the OAuth ceremony for each of them separately. In short a broker saves us from the N x M explosion of permissioning ceremonies. Similarly for various data services.
That’s getting a bit technical, but I agree that data brokers should work with the information in our personal data lockers, not just give us access to services on a case-by-case basis. Brokers are going to be so important in the coming years. And yet, VCs haven’t started placing big bets on them, but they will. There will be a wave of investing in brokers, and I hope it’s just around the corner.
discoverable – a PDX should provide discoverability for its APIs and schemas so that any application I’m interested in knows how to interact with it. Discoverability protects users from having to completely specify addresses, mappings, and schemas to every application that comes along.
This is why I talk so often about name spaces. The solution to the findability problem is in the naming and descriptions, and this is where we need more standards. For more on this topic, see John Wilbanks’ video on ScienceCommons.org.
automatable and scriptable – a PDX without automation is worse than no PDX at all because it burdens the user rather than saving effort. A PDX will be a player in a larger ecosystem of services. … The PDX is an active participant in the greater ecosystem of services that are cooperating on the user’s behalf.
This is where Phil Windley, one of the architects of the new web, really adds value. He has already designed a new scripting language for mashing up services and is building demos to show how they will work. People will use their (and our) data in ways we can’t even imagine today. We’ll have control over who sees what, but the ability to write scripts and programs that do many small things quickly will usher in a new era of software – one that continually adapts to our changing needs rather than solidifying yesterday’s solutions.
Thanks to Phil for getting that ball rolling. I highly recommend you read his blog, Technometria. Phil is an active participant in IDCommons, which now has started a new home for all these standards and discussions at PersonalDataStore.info. I recommend you visit the site right now, and be sure to watch Markus Sabadello’s excellent video explaining how a personal data locker solves most of the world’s problems.