December 2013 - QetriX Blog

Saturday, December 28, 2013

Structured database of everything

In May 2011 I wrote publicly about QB for the first time, into a forum of Czech on-line magazine Lupa.cz, which focuses to internet. I wrote the post in my native Czech, so here I translated it for you to ejnoy:

Hello,

I have designed a universal system for structured data management to fulfill my frequent need for quickly retrieving specific information on the go. Services like Wikipedia don't suit me due to their complex nature, lack of data availability for the Czech Republic, or reluctance to record "every little thing."

The basic implementation can be found at http://q.q3x.net/ and besides the data database, it includes a few useful tools (unit and currency converter, whois, distances, etc.) that are further utilized by the system itself. For example, the data is stored in standard units and converted according to the selected language (e.g., English uses dollars, miles, ounces, etc.) when displayed, and additional calculations are performed if needed (population density, for instance). Instead of a traditional API, data can be obtained by selecting a rendering module for a specific format (e.g., TSV).

Personally, I utilize information about aircraft based on their registration (typically age), information about people (typically age), and places (typically population count). The system can also record the content of a library or books read with ratings, the same for movies, TV series, or music. Products can be found using the EAN, and if data is available, it shows where the product is cheapest nearby, how far it is, and how quickly I need to go to catch the bus heading there. The system can even handle a complete CMDB (Configuration Management Database); that's why it was designed in a decentralized manner with the option for interconnection.

The ultimate goal is to synthesize Wikipedia, Freebase, Wikia, WolframAlpha, along with local databases and something extra. There is no underlying business plan; it's not meant to be a mean to get rich but primarily for entertainment, relaxation, and an opportunity to avoid stagnation. However, I don't want to be the only one benefiting from the system, so I welcome questions, opinions, advice, feedback, and criticism... Thank you.

I'm aware that data is the alpha and omega, and I gather it in every possible way. Ideally, crowdsourcing would be the solution, but without a "crowd," it functions poorly. So please take the data with a grain of salt. Essentially, they are mainly for testing purposes, to evaluate performance and gradually refine the concept. And finally, I apologize in advance for any parse errors; it's a work in progress.

> Quote: Petr Hejl 25th May 2011, 18:57:27

> The plan is quite ambitious. The problem will be with the data because each database or website has almost a different format. I don't want to discourage you, but WolframAlpha has tried this and failed miserably. Unless you invent a universal parser...

I don't assume I would crawl the web with a crawler and extract structured data. That never generates a reasonable level of data cleanliness. In the beginning, I tried to write a DOM parser for Wikipedia, and even on a single project, there are so many exceptions that I eventually gave up. So the assumption is that data is populated with pre-prepared batches (there are tons of them online, typically XLS or tables in PDF) or manually (using suitable tools, which is not such a hassle).

Honestly, this whole endeavor was initially provoked by disappointment with the level of data on the internet, aiming to create a purely data-oriented platform with solid boundaries. I built a fully customizable application on top of the data model, which formed the basis of this entire concept. So from the very beginning, I assume that the primary source of data will be a keyboard or defined formats (VCF, GPX, XML, XLS, etc.) for which I already have import algorithms.

> Quote: Ondra 25th May 2011, 20:18:37

> 1) Not everyone is a geek (Is it for "normal people"? ;-)

> 2) http://www.uoou.cz

No, it's not :) Currently, I'm working on adapting it for "normal people," but I'm already quite distorted...

How does it differ from similar systems? I have given a lot of thought to personal data, but the information I worked with regarding this topic didn't contradict the current content. If you have more specific information, I would appreciate it and take the necessary steps. Otherwise, the data comes from publicly available sources.

Structured database of everything

Sliky Quiky