Chronicles of Swallowbush – An exercise in user tracking

Users – we have all been one at some point in our lives. The concept of being one is very straight forward. We users navigate to a site and are unknowingly a part of a case study being conducted by a person or group of people controlling the website we have chosen to land on. Our habits, search phrases, and personal information (I’m not talking credit cards…) are all subject to the developer’s needs and intelligence; if they choose to keep track of who I am and what I like on their site it will be very beneficial to them. This is the first in a possible series of more technical journeys into Swallowbush, and PHP in general. I plan to open up the door to show you how and why things are tracked, and what benefit you can take from that.In a previous post, CoS: Popularity Contest, I made reference to the fact that I tracked a majority of the users that hit my site and tallied their habits. What purpose did it serve for me to see what was touched on my server and more so what did I choose to do with that information once I had accumulated it? The short answer is that I used that information to improve my site. The long answer follows.

Blogs

When it comes to blogs, they are often rants. Popularity doesn’t really drive them too much, or does it? From what I have put together they are a conduit into the brain of the writer. It has been said that there has never in human history been so many people who’s thoughts and opinions are published and or available to be seen, and I think its important to make it clear that this is not referring to an accumulation of writings, that’s purely referring to living and publicly available writers. What this means is that there is a lot of shit out there, because the average reading/writing level is still below 10th grade; in other words – popularity allows you to know what’s good and what’s not.

On Swallowbush I took a lot of time to make sure that my posts would be viable and necessary (occasionally) and would go out of my way to link important things that I had come across along the way. There was however the whole issue of what should I blog. Topics ranged from PHP, obviously, to programming to politics, the people I interacted with during a day or … life in general and I was always wondering what people really wanted to read about on my site. For what It’s worth, this is my first actual blog that I was able to get email responses from and they help a bunch.

The dilemma was pretty straight forward. I needed to see what people were looking at on my blog. The entries before this would clearly not have their past tracked, so the data would be skewed to a degree, but from that point forward I was able to actually see what was popular to my viewers.

Side note – if you are going to undertake this please make sure you are going to have enough disk space for your database. The fact that you can have unlimited databases wont matter if they take over your hard drives. Just a word to the wise, data mining is expensive; if you don’t cover your bases you will have a website that’s full of statistics and nothing to really look at beyond that. At its peak, Swallowbush was getting about 1100 unique hits a day during my initial HL2 Development cycle, when I was idling during the summer it dropped off to around 600 a day, but that was plenty to keep the view counters rolling to the point that a drive crashed on my server and I lost a considerable amount of the information, go figure.

The code was simple. I created a table in my database that would hold the data that was going to start accumulating. The table was pretty simple:

CREATE TABLE `views` (
`View_ID` INT( 11 ) UNSIGNED ZEROFILL NOT NULL AUTO_INCREMENT PRIMARY KEY ,
`View_Timestamp` INT( 11 ) UNSIGNED NOT NULL ,
`View_Type` INT( 3 ) UNSIGNED NOT NULL ,
`View_IP` VARCHAR( 15 ) NOT NULL ,
`View_Key` INT( 11 ) UNSIGNED NOT NULL
) ENGINE = MYISAM ;
  • View_ID
    Unique integer ID for each entry. Primary Key & Unsigned
  • View_TimeStamp
    Integer timestamp from the time() or mktime() function for date & other book keeping
  • View_Type
    I used the view type as a key to keep my various parts segregated. The value was determined by the section that was being tallied, and was expandable as needs required.
  • View_IP
    The users IP address. I used this to be able to use my current log of web requests on my server as a map of traffic, where it came from and who sent them to me. It was also very helpful in tracking ill intended people or so called hackers.
  • View_Key
    the index key of the element they looked at.

If a user (IP = 64.54.155.28) came to my site and looked at the blog (Type index 1) and looked at post number 42, as the page completed loading an entry would be made in the database as follows:

INSERT INTO `Views`
(`View_Timestamp`, `View_Type`,  `View_IP`,  `View_Key`)
VALUES
(1169534515, 1, '64.54.155.28', 42);

As you can already see this is very impersonal. I cannot track you back to you house and hack into your computer to watch your baby sleep or anything crazy like that, not that I would if I could. This information is a mixture of your IP address and my site data, and all of this information is freely available to anyone who is developing web applications any ways. After the first month of this code being executed for every page view the data was in the hundreds of thousands and tracking users from every part of the globe. And just to strike the point home, I could be doing the same thing here.

INSERT INTO `Views`
(`View_Timestamp`, `View_Type`,  `View_IP`,  `View_Key`)
VALUES
(1193866863, 1, '216.31.219.19', 62);.

Putting value to the data

Accumulating the values is the hard part. When you get the data you can pass it through whatever reporting software you wish and find any and all patterns including but not limited to:

  • Daily/Monthly Travels
    Find out what today is and subtract 24 (or 24 * 30) hours from it, grab every entry with a timestamp after that number and group by IP (remember to do a count of the entries and drop it into a new column, or this information may be skewed.)
  • Active/Inactive Users
    Run an sql query sorted by ip address and counted. Ascending results are the minimum entries; Descending results are the maximum.
  • Digg/Slashdot Effects
    Write a script to track the hits to your site over the period of an hour. Establish this as your base line and if anything happens on the site that triggers a spike write a throttling or shut down script to avoid getting your bandwidth siphoned.
  • Referrer Statistics
    Every time someone clicks a link on another site and comes to yours they will bring a tiny breadcrumb of where they came from. You can track this information to keep track of influential referrers and reward them if you see fit.
  • Yearly/All-time Great Entries
    One of the most important things I did. One of my poems which I may publish here, when I feel It’s safe to do so, was extremely popular. If it weren’t for writing this script I would have never known. It works identically to the daily/monthly travels but you are looking at a much larger portion of data.

It didn’t take me long to find this data showing a clear need for me to continue writing HL2 Tutorials because my forums were getting about fifteen times the hits the rest of my site was, although I had applied the information above to many different sections of my site and the data it supplied lead me to add the gallery and eventually the filters to viewed items.

Above all else it is extremely important to keep in mind that there are a number of pieces of information that are known about you when you traverse the web. This shouldn’t scare you in the slightest; hell, if you are just now finding out you are probably new to the web and shouldn’t be building apps in it.

Note: For your own information, the information above is just there for shock value. It is not being logged by this script, Drupal handles all of that for me. The information is not static and will not be seen by anyone other than you on your computer.