Andy Powell (@andypowe11) shared the text of Ramakrishnan & Tomkins (2007) “Toward a PeopleWeb”. According to the authors, “Attentional metadata is increasingly sought after and is beginning to accumulate in significant volume, suggesting a paradigm shift – and simultaneously raising serious questions about user privacy.” (63) A shift from what to what, I wonder? They argue, “As people and objects acquire metadata while moving across Web sites, a new kind of interwoven community fabric will emerge.” (64)
There is a disturbing passivity to this assertion. People and objects will not simply acquire metadata. They will largely be given it by sites that have been engineered to do so for particular purposes.
They observe (correctly imo) that, “… the theme of a centralized versus a distributed infrastructure will arise frequently. While the former approach has the appeal of technical simplicity, the Web has repeatedly shown itself to be anarchistic, and distributed solutions are viable for many of the problems we consider.” (65) This may be the paradigm shift. But the implications are extensive.
In the case of a user adopting multiple personas, they suggest that, “… conflating these identities is not permissible without the user’s explicit opt-in.” (65) Well, yes, but who polices or engineers this benign autonomist Web? They suggest that it will be the market. People will abandon irresponsible sites. Maybe this article is just shows its age. Although published in August ’07, the thinking and writing will have been done a year or so before, reflecting a world before Facebook, Twitter and in the UK the national identity register.
They suggest a global or canonical reference scheme for global data objects (anything, including people) and a metadata scheme for global data objects having four parts (STAT): stars, tags, attention and text. (67)
When they get to the numbers around text data it strike me they are out by a couple of orders of magnitude: 60 billion emails per day? Every human sends 10 emails per day? Half of them have never heard a dialtone. Of the other half, half of them are just getting on with mobile telephony. But I am not sure precision is what is needed. The numbers are big. And, metadata numbers are bigger. But, there is still a problem in 3 parts:
– access to the “highway”
– access to the data storage media
– describing and locating something (metadata for global objects)
I suggest the paradigm shift will be in two parts: a move away from restricted vocabularies *and* from centralised to distributed infrastructure.
The Internet is largely a toll road. Access depends on attachment to financial institutions, or institutions of work or educational institutions (you need a bank account), an ISP, a mobile data provider and so on. It is this that allows Mark Andreessen to assert that there is “big money to be made in web infrastructure” http://gigaom.com/2009/07/05/marc-andreessen-sees-gold-mine-in-building-web%E2%80%99s-innards/
While data storage costs are falling and, Ramakrishnan & Tomkins suggest, “… any company that could afford to hire 10 more workers for a business-critical purpose could choose instead to store the planet’s entire textual output going forward to eternity”, never the less the provision of such storage is unlikely to be for altruistic purposes. In practice Google and Yahoo might aspire to approach this kind of capacity, in reality a number of organisations will each have a part of this data for their own purposes: eBay for market trading, Amazon for selling stuff, Google because they can – oh yes – and to sell advertising, Yahoo because they wannabe Google. Access to global objects will not be unlimited. Valued objects will be restricted in order to create or protect revenue streams (as in current academic publishing practice). And governments will hold masses of text data on people, their movements, their health, their education, etc.
The third part of the problem – big as the other two are – is probably bigger.
Ontologies and restricted vocabularies have at least two inter-related problems. The first is that decisions need to be made and the decisions rules “owned”. What is included and what is excluded in the information model will shape or weight the model of the data object in particular ways. That is, restricted vocabularies are necessarily a part of and co-constituted with a centralised infrastructure. The second problem arises from this. The shaping and weighting of the model will be used to advantage the owners of the decision rules. The recent uproar in the gay community about Amazon excluding books with a positive message about homosexuality from their ranking algorithms is a signal instance of this problem. Even inadvertently any restricted vocabulary will embody tacit understandings and values.
So this leave me back to where I always end up: mesh networks, distributed databases and natural language processing. How this plays out, I do not know (clearly!). I suspect neural networks are part of this scenario, and ultimately the self-replicating Turing machine. But, it strikes me all other approaches are barking up the wrong tree. The big leap will be from the centralised, highly aggregated services which facilitate hierarchies and the extraction of surplus value (or wealth creation: take your choice) to a radically distributed architecture using natural language as metadata.
Ramakrishnan, R. & Tomkins, A. (2007). Toward a PeopleWeb. Computer, 40(8), 63-72.