Since Matt Knox talked about ThruDB on last tuesday’s meeting of NYC.rb, my brain has been thinking about document-oriented databases, about how tired I am of SQL, about how tired I am of trying to scale database servers, about how tempting is to have more flexible models and data structures, and about how tempting it is to have a clear and simple scalability path.
The samples included in the ThruDB tutorial are, to be honest, ugly. But they are designed to show how thrift provides language-agnostic data types and how ThruDB can be accessed from different languages.
However, I have several ideas in my head about how to implement something I’m calling, for the time being, ActiveDocument. It won’t be a direct replacement for ActiveRecord, but it will have similar features (i.e. validations and callback hooks) and it will allow for very simple usage of ThruDB. I might later add support for CouchDB, SimpleDB and other similar technologies, but just like Rails doesn’t try to be a full database server abstraction, your ActiveDocument code will not work on different servers unless it’s limited to very simple operations. The world of document-oriented databases is even less standardized than relational database servers.
Here’s a little look at how it might look:
class User < ActiveDocument::Model
attribute :login, :string, :indexed, :sortable
attribute :email, :string, :indexed
attribute :created_on, :datetime
attribute :password, :string
has_many :bookmarks
end
class Bookmark < ActiveDocument::Model
attribute :title, :string, :indexed
attribute :url, :string, :indexed
belongs_to :user
end
User.find_by_login("sd")
User.find(:all, :conditions => “login:’s*’ AND created_at :[20071201 TO 20080115]”)
As you can see, the two biggest differences from plain old ActiveRecord is that the model will have to define it’s own schema, and that queries will use the Lucene Syntax
Relationships would be defined using fields with lists of IDs, and queried using Lucene’s fast indexes. This might make models too big when they have a large number of related objects, but that’s a problem to be solved later.
Since document-oriented databases have no concept of joins, some queries will be definitely slower than their SQL counterparts, having to make multiple calls to the server to retrieve individual objects. However, each one of those calls would be simpler and easier to cache, which I hope will reduce the performance impact. And as long as it’s not 100 times slower, I’m willing to trade off some performance for the promise of infinite scalability.
And since the models will be more flexible, you can probably skip a lot of traditional SQL tables and store the data directly into the model itself. For example, users can have preference arrays or hashes, which would have been separate tables in SQL but that are just additional attributes in ThruDB.
Speaking of attributes. ThruDB uses thrift for its own API, and the tutorials suggest using it to encode the documents themselves, but the API doesn’t require that. I’ve been trying to figure out how to encode a thrift object along with it’s own class name, to make it easier to decode afterwards, specially when performing polymorfic queries. Perhaps I’ll have to use double encoding, with an envelope thrift object containing the class name and the encoded string. Or perhaps I’ll use YAML to encode an attribute hash. YAML is tempting because it will allow for more complex objects and for dynamic schemas (i.e. an attribute that’s a hash of hashes containing values of different types).
Anyway, I’m starting to write the code, and it looks like it might be possible to have some working prototype a lot sooner than I though possible at first.
If you’re interested, just drop me a note, leave a comment, send me an email or look for me as ’sd’ on Freenode’s #nyc.rb.