PDA

View Full Version : Creating Tag Clouds? The nuts and bolts...


JoeyDaly
07-24-2008, 07:27 AM
Okay, Tag Clouds - How do we go from an article to a tag cloud?

Here's a great Tag Cloud generator... but I want to pull it apart and know how it works. Anyone got any links or tutorials on how this works?

http://www.tag-cloud.de/

Fill in any website address, and put like 6 as word count... then do like 100x100 and click the link provided at the bottom to recieve the tag cloud.

Horus_Kol
07-24-2008, 08:39 AM
that site seems to be a little broken - i'm clicking on the "recently created" links and nothing is happening...

now - your basic tag cloud has the tags as seperate keywords to the content of an article (or whatever your unit item is) - some sites like the public add tags - others are less free in their input...

i will be describing a skeleton system - i am very tired - results may vary, and any code is under a negative warranty

now, for this basic model, all you need is a text input, and specify that tags are seperated by a specific character (comma is a good one - this allows you to have two-word tags...

then, in your backend, you explode/split the tag string, and now you have an array of tags for that article.

the database is really where all the action is at...
near as I can tell, the best thing is to have a table where the primary key is the article ID combined with the tag

easily done in MySQL:


CREATE TABLE tag (
article_id BIGINT UNSIGNED,
tag VARCHAR(255),

PRIMARY KEY article_tag (article_id, tag)
);

note: i know that MyISAM allows FULLTEXT indexes, but I'm not sure how that works for primary keys, so I specified VARCHAR with the maximum allocation


so, that array of tags for an article, just run through it with a REPLACE query for each tag:

REPLACE INTO tag SET id=$article_id, tag=$tag;

after sufficient processing and cleaning of input, obviously

MySQL REPLACE (http://dev.mysql.com/doc/refman/5.0/en/replace.html) is great :D


Want the tags for a specific article?

SELECT tag FROM tag WHERE article_id=$article_id


Want a cloud?

SELECT UNIQUE tag, count(tag) FROM tag;

gets you a list of all tags and how often they're used - then some maths and manipulation of relative font-sizes, and you have a nice looking tag cloud...

changing, deleting, and adding are all just as easy with this db setup, really


possible enhancements
well, I haven't really covered presentation...

auto-tagging (from your content) - now that would be complex

tag hints - start tagging, and then suggest related tags that could also be used

JoeyDaly
07-24-2008, 07:43 PM
that site seems to be a little broken - i'm clicking on the "recently created" links and nothing is happening...

now - your basic tag cloud has the tags as seperate keywords to the content of an article (or whatever your unit item is) - some sites like the public add tags - others are less free in their input...

i will be describing a skeleton system - i am very tired - results may vary, and any code is under a negative warranty

now, for this basic model, all you need is a text input, and specify that tags are seperated by a specific character (comma is a good one - this allows you to have two-word tags...

then, in your backend, you explode/split the tag string, and now you have an array of tags for that article.

the database is really where all the action is at...
near as I can tell, the best thing is to have a table where the primary key is the article ID combined with the tag

easily done in MySQL:


CREATE TABLE tag (
article_id BIGINT UNSIGNED,
tag VARCHAR(255),

PRIMARY KEY article_tag (article_id, tag)
);

note: i know that MyISAM allows FULLTEXT indexes, but I'm not sure how that works for primary keys, so I specified VARCHAR with the maximum allocation


so, that array of tags for an article, just run through it with a REPLACE query for each tag:

REPLACE INTO tag SET id=$article_id, tag=$tag;

after sufficient processing and cleaning of input, obviously

MySQL REPLACE (http://dev.mysql.com/doc/refman/5.0/en/replace.html) is great :D


Want the tags for a specific article?

SELECT tag FROM tag WHERE article_id=$article_id


Want a cloud?

SELECT UNIQUE tag, count(tag) FROM tag;

gets you a list of all tags and how often they're used - then some maths and manipulation of relative font-sizes, and you have a nice looking tag cloud...

changing, deleting, and adding are all just as easy with this db setup, really


possible enhancements
well, I haven't really covered presentation...

auto-tagging (from your content) - now that would be complex

tag hints - start tagging, and then suggest related tags that could also be used

Thanks Horus,

I was reading and reading and reading... got excited about the SQL... BUT! You finally tickled my funny bone!

auto-tagging (from your content) - now that would be complex

YES! This is what I want... I already have a nice CMS in place, I don't want to go ripping apart code and adding stuff into it... since the whole bloody thing is dynamic and would be a headache to debug. I want to be able to call a function and pass through all the content... and then spit out a cloud tag. That would be nice.

I'll do some experimenting at lunch with your example and see what ideas I can conjure.

Horus_Kol
07-24-2008, 07:47 PM
well, i've been thinking about the auto-tag's in some backwater of my mind - but I haven't formed anything clear out of it yet...

i'll see what I come up with over the day, and post back later

JoeyDaly
07-24-2008, 10:35 PM
I'm just going to brainstorm...

We really need something that... records a string temporarily.

$var myString = "The big brown fox jumped over the fence, but the fox died on the mental gate";

Then we need to grab every word and store it with the amount of times it's been used, so maybe a two dimensional Array of words and count.

1 [ 'The' ] [ 4 ]
2 [ 'big' ] [ 1 ]
3 [ 'brown' ] [ 1 ]
4 [ 'fox' ] [ 2 ]

etc.

Loop through the string, and add to the array - depending if it exists or not in the array already.

From that we can then create our tag cloud... but. Will it work? Or is there a better way? Storing in a file? Database? How will this go if we had a massive 1,000 word article... Do we want to store multiple words... Like you said above using comma delimiters.

Horus_Kol
07-25-2008, 12:39 AM
well, we certainly want a stop word list for common words like "the", "and", etc...

but, how to pick keywords from the thing?

Maybe only select the most common words from text (after the stop).
Also, give extra weight to any words used in the title.
extra weight based on which paragraph?
subsections and subtitles?

I'd say the storage would be the same with the linked table I created above, though.

¥åßßå
07-25-2008, 03:45 AM
Something you may find interesting ;) wordle (http://wordle.net/create)

¥

Horus_Kol
07-25-2008, 08:02 PM
hmmm... interesting... but it highlights what I see as a deficiency with the automated tagging system...
there's a bunch of words that I just wouldn't normally tag (get, set, one) - and it doesn't do multiple words (such as "back injury" which would be more useful than just "back" since that word has multiple meanings) - and sometimes, a salient tag just isn't in the actual content (example, an item on back injury might not mention the word "health" anywhere, but it would be a useful tag to link it to similar articles).

¥åßßå
07-26-2008, 02:10 AM
There is mention on the FAQ page about using a tilde ( ~ ) to make multiple words/phrases ... obviously that's no good for producing tags for real pages ... it is just for making up "pretty pictures" of your text though, and not really a tag generator as such.

The blog software that I use has a tagging system, so I whipped together a tag cloud plugin ... ooops, nope, someone else whipped up the tag cloud plugin, I whipped up a search cloud plugin :p ... but the idea's the same.

I got fed up of trying to remember which tags I'd used previously, so I coded an auto-suggest tag plugin which works petty well ... when I get chance I really need to add it to the core.

I agree that any automated tagging system is going to be flawed, about the best you can really get is a system that can tell you what keywords you've used on a page, and even that will have the same "single words only" limitations.

¥

JoeyDaly
07-26-2008, 04:02 AM
Great stuff above.

I like the tag suggestion - I don't mind having to write up an article and then a script that analyzes the content, suggests lets say 10 tags and the user can correct them.

In regards to double and triple words? Can it be done automatically?

¥åßßå
07-26-2008, 04:15 AM
The plugin I coded works off previously used tags, it doesn't analyse your content in any way. If you want to have a play with it you can download it here ( AM Auto Tags V 1.0 (http://waffleson.co.uk/pastel-palace/call_plugin.php?plugin_ID=98&method=download&am_plug=am_autotags) ).

It uses javascript to ask the server for all tags beginning with whatever letter you've just typed ( it doesn't use ajax though, so it can make cross-domain requests ) and then "suggests" tags based on what you're typing in the tags field and what's been used in the past. Double/multi word tags are no problem.

Obviously the majority of the code/input elements and server side stuff etc are specific to our blog software, but it should be a doddle to convert to any other platform.

¥

Horus_Kol
07-26-2008, 10:10 PM
okay, this is gonna come off as naive - but you can have JS/Server communication without using XHR?

¥åßßå
07-27-2008, 02:55 AM
Yep ;)

Basically it works by appending a script tag to the <body> to make a request. The server then processes the request and replies with javascript.

Advantages :
Request goes over http, which all browsers understand
No need to do browser detection to work out which XMLHTTP connection you need to make.
Cross domain communication is no problem ( useful when you're running a multi-(sub)domain / blog system where the admin side can be on a different (sub)domain from the front end ;) )
The request is still asynchronous and you can detect success/failure

Disadvantages :
The reply must be javascript
You have to be aware of any security holes you may open

¥

JoeyDaly
07-27-2008, 04:20 AM
Yep ;)

Basically it works by appending a script tag to the <body> to make a request. The server then processes the request and replies with javascript.

Advantages :
Request goes over http, which all browsers understand
No need to do browser detection to work out which XMLHTTP connection you need to make.
Cross domain communication is no problem ( useful when you're running a multi-(sub)domain / blog system where the admin side can be on a different (sub)domain from the front end ;) )
The request is still asynchronous and you can detect success/failure

Disadvantages :
The reply must be javascript
You have to be aware of any security holes you may open

¥

That may be an issue.. Security.

Latest month we had a fair few security breeches, none that actually got through since we have plenty of Client Side Protection and ALOT of Server Side Protection... Since now we are expanding our client base for our CMS software - with new features & technologies like this auto tagging - we need to ensure its 110% secure.

I'll download the script at work and have a look, see if I can get a lightbulb in my head.

Horus_Kol
07-27-2008, 08:47 PM
Basically it works by appending a script tag to the <body> to make a request. The server then processes the request and replies with javascript.
Still fuzzy on the mechanics here...

could you post a simple example (probably in a new thread?)

¥åßßå
07-28-2008, 05:21 AM
Still fuzzy on the mechanics here...

could you post a simple example (probably in a new thread?)

If you can wait a tad I'll do a write up on our blogs with a working example and I'll create a new thread here with pretty much the same content just without the example ( pretty sure they won't let me include a 3rd party javascript file on the forums ;) ). I'll close the comments on the blog post to restrict all feedback/questions etc to the forum thread so it's all in one place and not scattered all over the web.

¥

Horus_Kol
07-28-2008, 07:46 AM
sounds good...

speaking of your blog - i don't seem to get RSS from innervisions anymore - what's going on?

¥åßßå
07-28-2008, 06:03 PM
speaking of your blog - i don't seem to get RSS from innervisions anymore - what's going on?


Apart from being concerned about the sanity of anyone who actually follows my feeds .... what feed url are you using and when did it stop working?

¥

Horus_Kol
07-28-2008, 07:26 PM
i'll have a look for the URL when I get home... but I do remember resetting it at least once, and it's been about 4-6 weeks...
oddly enough, I got a feed error from my reader just last night after I'd posted on here

¥åßßå
07-28-2008, 07:31 PM
I quite often break everything on my blog as I have a bad habit of using it as a live testbed, so it wouldn't surprise me if I've broken them ..... ahhh well, it's my corner of the web and I can break it if I want to ;)

I'm still worried about your mental health :|

¥

Horus_Kol
07-28-2008, 07:51 PM
hehe - I do the same with my own site... its the prototype for a lot of my work :)

what's to worry about my mental state - being unbalanced isn't really a problem, these days...

¥åßßå
07-28-2008, 08:00 PM
"what's that skippy? there's a madman lose on the internet? ......... what's the internet skippy?"

Yer in the right country at least :rolleyes:

¥

JoeyDaly
07-29-2008, 11:28 PM
Apart from being concerned about the sanity of anyone who actually follows my feeds .... what feed url are you using and when did it stop working?

¥

Lmao!

Looking forward to reading the post Yabba, let me know when it's up.

¥åßßå
07-30-2008, 05:40 AM
It's the post below this one in the forums ;) ( AJAX without the AX (http://www.htmlforums.com/web-20-websites-technology-discussion/t-ajax-without-the-ax-105934.html) )

¥

JoeyDaly
07-30-2008, 11:56 PM
The plugin I coded works off previously used tags, it doesn't analyse your content in any way. If you want to have a play with it you can download it here ( AM Auto Tags V 1.0 (http://waffleson.co.uk/pastel-palace/call_plugin.php?plugin_ID=98&method=download&am_plug=am_autotags) ).

It uses javascript to ask the server for all tags beginning with whatever letter you've just typed ( it doesn't use ajax though, so it can make cross-domain requests ) and then "suggests" tags based on what you're typing in the tags field and what's been used in the past. Double/multi word tags are no problem.

Obviously the majority of the code/input elements and server side stuff etc are specific to our blog software, but it should be a doddle to convert to any other platform.

¥
Great article!

But now back on topic, we want to automate or auto suggest tagging to just make life easier...

I don't get what you mean it asks the server for all the tags. I don't think I read too far into your code to understand all that, which btw is brilliant :) - If only could be re-used as a standalone script.

¥åßßå
07-31-2008, 05:21 AM
Huge reply alert ... probably several "whore a blog" alerts as well :rolleyes:

Ok, lets start at the beginning with a typical post work flow ( for our blog software ) :

First you hit the write screen ... kinda an obvious first step .... and slap in a really cool title ( laden with keywords if you play the seo game ) for your post, choose a couple of categories and then you type your expertly crafted article that's going to blow the minds of the millions of adoring fans that follow your every word with bated breath and a star-struck look in their glazing eyes.... or, if it's my blog ... for the psychotic few that can't tell genius from blonde :D

And then you stare at the tags field for a while, wondering 2 things
1) How can I sum up such genius into tags?
2) What tags have a I used previously ( kinda important, because we also have a related posts plugin that uses tags to function [ Playing with relations (http://waffleson.co.uk/2007/11/playing-with-relations) ] )

That's where this plugin comes in. As you start typing tags it scurries off to the server and prods it to come up with a list of tags that have been used in previous posts that start with the same letter. Once the server stops sulking at having to do some work it spits back a list of each tag and wanders off muttering nasty things under it's breath .... it's never really forgiven me for the time I accidentally replaced the PHP process with a BASH script and promptly fried it's brains with load levels in excess of 600 :rolleyes:

When the script gets the reply it adds all the answers ( if any ) to the tags array. We chose to use a huge memory intensive array to make it quicker for the code to "predict" tags as you typed, rather than a smaller array which would be kinder to your pcs memory but would take slightly longer to search .... it's probably easier to show you by example ... free-typed, so don't complain if they fry your cpu when testing ;)

Small array :
var tags = new Array('a_tag', 'another_tag', 'another_tag_2', 'tag_4', 'tag_5');
var current_tag = 'ta';// this is the tag that's being "typed"

var suggest_tags = new Array();
for( tag in tags )
{
if( tags[tag].susbtr( 0, current_tags.length ) == current_tag )
{ // starts with same letters as current tag
suggest_tags[ suggest_tags.length ] = tags[ tag ];
}
}

suggest_tags is now an array of potential tags. Now lets look at the memory intensive version :

var tags['a'] = new Array('a_tag', 'another_tag', 'another_tag_2' );
tags['t'] = new Array( 'tag_4', 'tag_5' );

tags['a_'] = new Array( 'a_tag' );
tags['a_t' ] = new Array( 'a_tag' );
tags['a_ta' ] = new Array( 'a_tag' );

tags['an' ] = new Array( 'another_tag', 'another_tag_2' );
tags['ano' ] = new Array( 'another_tag', 'another_tag_2' );
tags['anot' ] = new Array( 'another_tag', 'another_tag_2' );
tags['anoth' ] = new Array( 'another_tag', 'another_tag_2' );
// etc etc etc, until all tags are indexed like the above

var current_tag = 'ta';// this is the tag that's being "typed"

var suggest_tags = ( typeof( tags[current_tag] ) == 'undefined' ? Array( current_tag ) : tags[current_tag] );

As you can see, the second example is a lot more memory intensive, but getting suggested tags is a one liner ;)

Your code can now take the suggested tags ( whichever way you decide to code the arrays ) and display a list to the user, our plugin also "auto-completes" the first suggested tags so the user can just press the right-arrow and move on with their tags.

Once you've picked all your tags, and ticked a few other boxes, you hit submit ... and the server, begrudgingly, slaps your new post into the database, creates entries for any unknown tags and then links all post tags to the new post in the "item_tags" table, ready for when the droves flock to read your latest masterpiece.

If only could be re-used as a standalone script.
It's actually quite easy to convert this into a standalone script with very few changes, although you will need to recode the database bit to match your own tables/methods as the plugin, obviously, uses the DB class that's already available and tag insertion etc are handled by the core.

First off lets meander through the javascript file because that's where most of the changes are required

/* line 6 : change item_tags to the name of your input box for tags*/
var amTagsField = document.getElementById( "item_tags" );
/* add the following line and change separator to suit */
var amTagsSeparator = ',';

/* line 22 : itemform_tags needs to be the id of the parent container for your item_tags <input>*/
document.getElementById( 'itemform_tags' ).appendChild( ourBox );


/* line 64 : htsrv_url needs to be the url to your php page that will return the tags */
var script = amTagsCreateElement( 'script', 'type="text/javascript" src="' + htsrv_url +'&am_tags_start='+amTagsLetter + '"' );


You can actually take most of the code in the php file and throw it out the window, it's merely there to inform our blog software that it's a plugin and can do some stuff with events. The only bit you'll want is this section, and even that's going to need a re-write to suit your setup :

if( $startLetter = param( 'am_tags_start', 'string' ) )
{ // we have a start letter
if( $startLetter = substr( preg_replace( '~[^a-z]~i', '', $startLetter ), 0, 1 ) )
{ // single letter only ;)
global $DB;
$sql = 'SELECT tag_name from T_items__tag WHERE tag_name LIKE "'.$DB->escape( $startLetter ).'%" ORDER BY tag_name ASC';
if( $tagList = $DB->get_results( $sql, ARRAY_A ) )
{ // we have matching tags
foreach( $tagList as $aTag )
{
echo 'amTagsAddTag( "'.$aTag['tag_name'].'" );'."\n";
}
}
echo 'amTagsFetchTags(0);'."\n";
exit;
}
}

All this does is find out which start letter you're interested in and then it hits the database and ask for all currently known tags that begin with that letter. Then it simply wanders through the list and adds a javascript call for each item ( note to self : forgot the javascript headers :p ). It wouldn't be hard to recode that to use your own database.

As you can see from the code it doesn't do any analysis of your post to find/suggest tags, it just works off ones used in previous posts to make life easier when trying to tag up your pearls of wisdom ;)

¥