What follows is a thought experiment I've been toying with for the last few days, a real work-in-progress. It may not be interesting until there is real software to play with, but in the meantime I'm posting it as a project log.
Part 1: Tags and Trees
Tags and Searches
"Tags", to put it mildly, are the new hotness on the web. The most popular social web services such use "tags" (or "labels" or "keywords") as a quick way for user categorization of information, and in most cases they work quite nicely. Plug in a tag like "Seattle" into del.icio.us or flickr and you will get a mish-mash of links and pictures related to the city with the world's first Starbucks. More sophisticated systems will let you include multiple tags to find content and suggest related tags that can be incorporated in a search. Tags are most useful when you know what you want and simply need to locate something quickly from a lengthy list. After all -- search engines specialize in automatically creating tags or keywords from existing content.
Google recently released a Desktop Search tool to let users the contents of their hard drives in the same way they would search a web page. Clearly this is a useful tool, but the fact it exists at all is indicative of a larger problem with how users interact with modern PCs. It's possible to store so much data now that we could never remember the exact location of every file, directory, or program that we need.
The earliest computer file systems used flat spaces or single directories to store files. Tape was one of the reasons for this, but there were also many fewer users and many fewer files to work with. Multimedia as we know it today was non-existant. No MP3 collections or browser caches. Crude hierarchies could be achieved by file naming if necessary.
Hierarchical file systems developed as a way to seperate and protect the data of different users on multiuser systems. At first these were only single level deep, but they soon grew to the multi-level directory structures we know and use today. (And have they grown!) My work and home computers average 20 folders in their root directories, plus another 10 folders that reside on the desktop that are automatically expanded by Windows Explorer. Not to mention the "My Documents" folders which themselves have about a dozen sub-directories, many created by programs looking for a place to stash stuff. A sample of one is usually not a good idea, but I have a feeling I'm not alone in this sort of organizational spillage.
The machine I'm using at the moment has just shy of 150,000 files and 15,000 folders, and the funny thing is that it's only one third filled. Even if I created only 1/50th of these files, I still have a major organizational problem on my hands. These are files of every stripe, from assembly code to video clips. Many are in directories that I cannot easily relocate, such as my 'Inetpub' server root directory. Many are in places that I would not want to relocate, like library documentation. Files I need are stuck in directories that are many times 4 levels or more deep. Even the tool I use to defragment my drive is placed in a menu that is 5 levels deep. (More if I haven't used it recently.)
Back to the Google Desktop Search. It's one solution to my problem, and much of the time it works quite admirably -- by stripping out bits and pieces of data about my files in addition to the contents of the files themselves, it gives me a way to key in on files that may be what I'm looking for. If I can guess enough of the phrases in an old Word document to narrow my search, I can usually find it. It's handy in a pinch (not to mention the fact that the local caching has saved my tail on a couple of occasions), but in my mind it is still just a stopgap. There will come a point where I will fill GDS's indexes with enough clutter that its utility will die off; lack of 'PageRank' for local files will only hasten this.
User-Created Hierarchies vs Tagged Sets and the Path of Least Resistance
If I were dilligent, I would have a nice system for sorting and filing away all the bits and pieces that end up on my machine. As it stands, I have a Project directory for all the things I'm working on, a Utilities directory for the small programs I use and a Program Files directory for the larger ones. There's Downloads for all the joke mp3 and blooper avi files I accumulate. Firefox has not been helpful to me in this regard. It shunts everything I save to the Desktop, which has been to form its own primitive organization system: Files on the right, shortcuts and folders on the left. I lack the discipline to even turn this Firefox "feature" off and have it prompt me for locations each time I download.
Nonetheless, my primitive hierarchy used to hold up pretty well, that is, until my Project list stretched off the bottom of the screen and began digging a hole to China. I thought about creating additional directory groupings like Web or Photo but that would mean additional drill-down clicks for every single project file I wanted to get to. So now, there's an Old directory for Least Recently Used projects. This scheme appears to be doing okay, but has been complicated by the fact many programs I used automatically place project files in the My Documents folder, which now has its own Old sub grouping. Every time I want to locate older, but perhaps still useful code, I now have to decide whether or not it wound up in my original Project directory or the My Documents folder. Then I have to scan the list to see if the project has wound up in the Old group. I could move all my Projects stuff into My Documents, but that might break any hard-coded paths in a given project. (Same problem applies when moving to the Old group, so there's resistance there too.)
So, I want to try something: apply tags to my problem. Let's say I was able to tag all of my old projects as such. For that matter, I could tag all my projects as projects, regardless of which location the folders end up. Last, let's say I consider each project directory name as a tag. This scheme essentially merges two parts of the hierarchy, Projects and My Documents, and yields immediate benefits: If I search for the tag 'project', I will receive a list of all the projects. 'project old' will give me a list of old projects. Better yet, all I have to do is type in the project name if I can remember it.
Obviously this example is not quite fair to hierarchies; if all my projects were in one directory I would simply have to scan the list of names. The path Project\Old is logically equivalent to the result of the search `Project Old'. (To be more specific, the union of the sets of directories tagged Project and Old.) On the other hand, there is that long project list again: the kind of list that tags have done a good job of making managable.
Let me try another example: Suppose all my projects are in one long directory, two pages screen length. Each project has a set of tags associated with it. SoldierModel, part of a mod I'm working on for a World War 2 action game, has the tags 'model ww2 mod 3d lua script'. BlogBlather has the tags 'blog php script' and so on. I have long forgotten about the specifics of each project and what their directories contain, but I remember that I've written an algorithm at some point in the distant past that is related to something new.
I could open up the projects list and start scrolling, alternately glancing back and forth from the directory list to the directory contents pane. I see 10 projects that look promising, and on average it'll take me 5 clicks to get the right project, from which I can attempt to locate the file I'm looking for.
Alternately, I can type in the tag 'script'. It pulls the all the results tagged 'script' from the list. It narrows my search to the same 10 I found promising just by scanning the directory list, but it does so immediately and without the volatility of my memory. Let's say as an added feature it lists 'lua' as a related tag, and I remember that the algorithm was in that particular language. 'script lua' leads me to the correct project (or very close if I had other old mods) right off the bat.
Towards a Working Example
So I've presented a couple of stilted examples so far; skepticism over the benefits of tagging directories is definitely warranted because it comes with a good number of drawbacks. For starters, you have to enter the tags, which could take some time or repeat changes if the contents change. The quality of the searches depend on the quality and comprehensiveness of the tagging. Next, tags can be ambiguous: is 'script' referring to interpreted source code or the next big Hollywood blockbuster? What about plural forms or abbreviations? Consistency? The tag search interface must be easy to learn and integrate well with the directory viewer you already use. The advantages of using tags must be obvious, even when compared to a well-formed directory structure.
I'm going to try to convince myself this is worth doing. So, as my own guinea pig, I'm going to demand these specifications:
- Each directory (folder) in the tree can have a set of tags and a textual description associated with it.
- Ability to tag multiple folders at once.
- Fast tag searches using logical AND operator
- Tag search autocompletes tag currently being typed (with previous tags) to improve consistency of use.
- Results list with folder path, tags, and description visible for each result
- 'Related tags' list for currently selected result
- Easy to use search dialog
- Click to launch folder
I'm also going to have to force myself to tag most or all of the folders I use regularly and make it harder, at least temporarily, to do things the old way. I think I'll name it Loquacious (cleverest[?] fit of: 'local' tags + deli.cio.us).
More when I get some free time...
Hybrid-Search and Storage of Semi-structured Information
http://haystack.lcs.mit.edu/papers/Adar.thesis/main.html
GNOME Storage
http://www.gnome.org/~seth/storage/associative-interfaces.pdf
WinFS Data Model
http://www.c-sharpcorner.com/Longhorn/WinFS/WinFSDataModel.asp
Metadata for the Masses
http://www.adaptivepath.com/publications/essays/archives/000361.php
Tagging and Expression Languages
http://ianso.blogspot.com/2005/01/tagging-and-expression-languages.html
Tags != folksonomies && Tags != Flat name spaces
http://www.corante.com/many/archives/2005/01/24/tags_folksonomies_tags_flat_name_spaces.php
Posted by eric at March 23, 2005 11:43 PM
Comments
Right on. Hierarchy is a very specialized solution that has a lot of problems.
I wonder if the OS could extract a set of tags from your content automatically? Google has the huge advantage of links, which don't really exist on the desktop-- and links help prioritize keywords, so I'm not sure how effective the Google approach can be on the desktop.
Posted by: Jeff Atwood at March 24, 2005 09:49 AM
It seems to me that the essential difference between the tagging scheme and what GDS does is that element of user input. There's a big difference in terms of confidence and quality of results between an item someone has created, worked with, and tagged as opposed to what the GDS algorithm infers about the contents.
On the other hand, tagging is really no substitute for full-text search. So, the schemes end up being complimentary.
Posted by: Eric at March 24, 2005 10:11 AM
(I have been meaning to blog about this for a while, but you beat me to it :)
I guess the obvious question is- why tag directories? Why not tag files directly?
The epiphany bookmark editor looks like this.
Every time you add a bookmark, you can select existing tags, or add a new tag. [They aren't called tags, but categories, since they've had this style of bookmarking since before tags were in. Anyway.] I don't see why the file save dialog couldn't look like that, and the 'file open' dialog just be a search interface prompting you for the tags of files you want to look for.
Posted by: Luis Villa at March 24, 2005 11:51 AM
Good point, files are obvious targets for tagging. For the purposes of the post I assumed that files in the same directory are somehow related to each other, which is not necessarily the case.
For now, though, I'm going to run with that assumption and use folders as the targets for my categorization efforts, mostly for reasons of interface simplification. (And implementation -- where do we store these tags, after all?) This will change as I refine the design of my ideal program... and get some code rolling.
Posted by: Eric at March 24, 2005 12:06 PM
every single tagging style from a-z
Posted by: rolando at April 9, 2005 08:25 PM
