One of my “unfinished businesses” is a semantic dictionary. Basically it will be a gigantic database of all words in all forms (as entities) and they’ll have a set of attributes and relations between them, defining what kind of word it is (subject, verb...) and what relations between other words are (synonym, antonym, the same word in different language...).
I expect the primary usage in semantic search. The search engine will use some kind of universal meta language, to which all queries in natural languages will be transformed.
I can’t say how (and if) it will work exactly, it’s just a thing I’ve been thinking of quite a long time. I’m aware there are some nuances and pickles (like the same word with different meaning not only in different languages) or how to deal with phrases in search queries (my former thought was just to analyze word by word, but It’s not that easy).
These are some of expected unknowns I didn’t make any further research about and therefore I don’t have comprehensive overview of it.
And, last, but not least, it will be a nice performance test :)
For a few years I had a vision to open QPDB for public, but not as a particle database, but more like a farm of databases. My inspiration was the concept of Wikia.com, where everyone can start his own wiki. I decided to do a similar thing, only with semantic/structured databases instead of wikis.
Crucial part is unfortunately my biggest weakness – to make it easy for the user. I began with stripping the user service and ultimately reduced offered functionality to the thinnest core. My target was to make it “just enough”. To keep the power in the backend and offer a bunch of presets, so to speak.
Application layers are: Databases > Classifications > Records (entities) > Attributes and relations.
In a database, user will be able to create classifications (in fact entity types, only available choice will be the name), attribute types (name, unit for numeric values, and data type) and relation types (name only).
Additionally, user can pick everything from preset templates, one set for home and the other for business. Templates can offer settings beyond the default available settings, like value range (e.g. 0-120 for age of a person).
For relations I dealt with unwanted entity types in suggest list, but how to get rid of them, when such thing is not available for this installation? Well, I made a little hack in the suggest algorithm, so when the name of relation type matches a name of any classification (entity type), only such entities are shown.
It’s the first public release of QB, aimed particularly to end users as content creators. QB had a public release two years ago, as Particle Database, but it was mainly to receive some feedback (which it did). Although I kinda like current style of QB, I feel this isn’t “it”.
I can imagine it’s far too off end-user’s expectations, may be confusing and hard to comprehend. But I’m still trying to think out of the box and I at least figured out some nifty stuff this time.
OK, so I decided to create a semantic database. From relation database schema I had the basics – there's a row, row can have some columns and they may store some row's own values, or they can store a reference to some other row (in the same or different table).
Years later I determined the exact names I'll use for it: row = entity, column = attribute, reference = relation. And yes, any resemblance to ER model is purely intentional ;-)
My database model was quite obvious at this point: entity, attrib, relation, entity-attrib (m:n), entity-relation (m:n).
Then I started to think about what structure those tables should have. I started with “data”, tables – entity, attrib and relation.
First two columns were quite obvious – integer primary key (id) + tinyint some kind of “type”. Now I need something for data.
My first thought was: One column for text value + one for numeric value. And maybe one for datetime. Such model is good for performance and indexing, but I didn't like the code have to decide all the time where to put or look for a value and if I'd like to have one column always NULL. Single varchar column for value would be much better, but then I'll lose the performance advantage. But what about decimal numbers? Or dual values, like coordinates?
It felt like a Sophie's choice and after quite long thinking of all pros and cons I decided to sacrifice performance. I was going to store everything as a text, but the model will be pure.
Then I started to expand my thoughts about the whole schema little more. Relations should be able to contain attribs as well. So, another table for m:n relation emerges. And all rels, ents and attrs would have their own type as well... there we go, another 3 tables. That's 9 now and that's it!
Those tables are just for data, nothing else. Sure, pretty much everything in QetriX is “data”, but I needed more tables for specific purposes. Like translations, large text storage, secure file storage, activity log, change log, statistics, feedback or various caches. The schema grew big once again.
I understand everything in the world is somehow interconnected. I can buy a candy bar, manufactured by some company, whose HQ building has been opened the same day I graduated. If I want to cover all this by the data platform, the model must be really versatile.
I can imagine two approaches: Extensive and extensible database model, or universal data structure. Because of performance reasons I decided to go with the first option. My “proof of concept” was a lightweight encyclopedia for mobile devices, called “Kilopedia”, (max size of an article was 1 kB – hence the name).
In MySQL I created an “advanced database structure“, where each entity type has its own table. Every column (domain) in a table stored a value, as an attribute, or a foreign key (reference to a row in different table), as a relation. I had tables like “city”, “person”, “phone”, “airport”, “car”, “fruit”, etc.
User service worked closely with table structure and data types. When I rendered a form for data entry, I called DESCRIBE on the database table and used Field and Type columns to create HTML INPUT elements with appropriate data validations. When I wanted to add a new attribute to the entity type, I simply entered its name and system processed ALTER TABLE ADD COLUMN on background.
It worked like a charm; I was quite happy about it. It was nice, easy, fast, reliable and efficient. Until I noticed I have hundreds of tables in my schema, which is OK in general, but not for me :) All the time I was on a quest to create simple, clean and compact schema.
A downside was it would require some unpopular database denormalizations, maybe have a universal table with a lot of NULL columns. It would be nice to have some of the columns as a number, some as a varchar, datetime, bool etc.
I tried to design such table, but even before I started working on a prototype, I already knew this is a dead end. So I scratched the whole idea. But even a bad idea could move thinking towards the final solution, or you experience something, which you'll be able to use in the future.
And one more thing I didn't like about Kilopedia. Content of the articles was language dependent, without a chance to be automatically translated – like in Wikipedia, it would require creating separated text for each language, content may vary in different languages and every change would have to be done in all languages.
For me this was unacceptable and I knew the article part must go away. I wanted all changes to happen at once, in all languages and by a single modification. Only semantic values will do the trick.