Wednesday, July 24, 2013

Creeper Crawler

As I wrote earlier, I was confident I can collect data for the Particle Database by myself, but shortly I realized it’s just too much. So after a struggle I decided to make a web crawler for harvesting semantic data.

The decision of target platform was quick – PHP is not suitable, so .NET. At first I tried to customize some of those existing open source crawlers, but it was a pain. So ultimately I wrote my own. I don’t need any GUI, so simple console app was suitable.

It downloads plain text data (e.g. HTML page), using HttpWebRequest class, and extracts all links to other pages. For every use I can define a set of boundaries, so links out of these boundaries are thrown away. The rest are put into queue. Robots.txt and meta tags are a part of those boundaries as well.

Then crawler prepares data for parsing – strips off all unimportant parts and replaces or deletes some parts (like whitespaces). Then it checks for presence of defined string, indicating there are some data to extract. And finally, using regular expressions, structured data are extracted and send to server-side for further processing.


On server side is PHP script, which post-processes and saves data into database, finds relevant links to another entities and stores URL of data source to avoid duplicates.

Then crawler politely waits a while before loading next URL from queue, because I don’t want to overload these servers. And... that’s it! Simple, yet powerful.

Wednesday, July 17, 2013

MobiletriX

There are two main approaches for mobile version of a website or a web application (well, three, if you consider “none” as an option): responsive webdesign and separate mobile version.

I decided to do both. I really like the concept of responsive design, but I hate when I have to download something I won’t use (graphics, HTML elements etc.).

Mobile version should be lightweight and may be little uglier by not having all the fancy bells and whistles, which on mobile devices just consumes bandwidth and processor time anyway.

Mobile users also usually want just to hop in, do anything quickly and hop out, especially when the mobile version isn’t the main way of using your site/app. Therefore the original use cases might be little crooked and it’s ok. Developer should not throw logs under user’s legs, but stick with him instead.

Users mostly expect similar, but slightly different behavior between desktop, web and mobile app. Each of them is used in different situations, where user would appreciate different things. It’s even determined by time of the day, but it would get too far :)

It’s also a bad habit to emulate his device’s interface a you can never satisfy all of them. And believe me, iPhone UI on Android phone looks extremely stupid.

Wednesday, July 10, 2013

Reengineering AGAIN?

And again :) I don’t have to worry about anything right now, as everything is doing just fine with the previous version and nobody rely on the new one.

I wanted to add more functionality to the code and realized I pasted a lot of code in the core without further investigation what it really does and how good it is in sake of performance. And because I did some thinking about the object model, I decided to start over with no old code.

I know it will take a lot of time, but from my point it’s time well wasted. Reengineering is kinda my way of “watching TV” – just relaxation without much thinking – I did all the thinking before, the thinking brought me to the point I decided to make it better.

This time I’ll introduce:

  • Simple module model, along with the current model, which proved itself right.
  • Improved class inherition schema.
  • Better way of module loading.
  • Better way of using Renderers (former Templates) in Components.
  • Improved DataStores.
  • Better way of using DataStores in Components.

I was very cautious about using different renderers in the past, because I knew it might cause a problem. This time I’ll start using multiple renderers right away to make it work immediately without struggling in the future.

Wednesday, July 3, 2013

Working with foreign tables

I consider this blog as an inspiration for others, not just as a promotion of QetriX ;-) so let me write a post about one of my former ideas for 3rd generation of my CMS (year 2007).

I was on a quest for automated database layer, so I came up with automated table joining without usage of “advanced” database features (designed for MyISAM engine).

From my former job I adopted a system of three-letter prefix for column names according to table name. The motivation for this was to have unique column names throughout the whole database schema. If I have table “users” (I use plural for table names), the prefix would be “usr”, so primary key would be “usr_id”. For PK in table “companies” it would be “com_id” and so on. If the table had similar name and the prefix was already taken, I’d use the next letter in order, so for “computers” along with “companies” it would be “com” and “cop” respectively.

For foreign keys I used even more specific approach, I used both table and prefix names from foreign table in the column name (plus “fk” suffix), so if user had relation to company, the column name in table “users” would be: “usr_companies_com_fk”.

Thanks to this I was able to specify $db->getDataFrom("users") and it would return data from all joined tables as well. For this purpose I had cached database schema (like “information_schema.tables” in MySQL).

And that’s it :)

Fun fact: The system was so automated, that when you deleted a row in one table, all referenced rows in all foreign tables has been deleted as well (incl. domains, users, locations etc). facepalm