For the last day or so I’ve been playing with moving over my simple word-count-analysis of blogs to actually creating a database with manually ranked training data and extrapolating from that. There were some hiccups and I’ve still got to go back and replace a lot of code, but it’s effectively categorizing new blog entries based on previous rankings. YAY! I’ve been using the perl Algorithm::NativeBayes cpan module, and it’s pretty great – although the the documentation is really poor. The main thing to get is that it returns a hash reference, which means you end up referring to your result as something like:
<pre>print “Sport’s ranking: \t ${$result}{sports} \n” ; </pre>
Which, lets face it, is kinda ugly, but it’s really the only good way to do it, really. It should really be better documented, though. Aside from that, as long as you get the back-end math, you’re pretty OK. Just because you’re doing AI stuff doesn’t mean that you’re automatically familiar with how perl handles references to hashes, though. One of my to-do items is to go back and update the perldoc on it. Anyway, with that in effect I’ve gone and updated the database and I’m now able to get positivity over time. This means I’m actually getting closer to building an internet happiness index, and prediciting how “happy” the internet is as a whole. The next steps are:
- incorporate new bayes functions into existing codebase
- add more sources to the rss feedlist scripts
- optimize the blogparser
- put a nice (fusioncharts?) front end together
- add more hardware for doing the catagorization
- get more people to do more training data
- ???
- profit!
Actually, the “???” is pretty well defined, but honestly, this project will have been fun even it it doesn’t make a dime. Anyway, one step closer.
I recently came across your blog and have been reading along. I thought I would leave my first comment. I don’t know what to say except that I have enjoyed reading. Nice blog. I will keep visiting this blog very often.
Joannah
http://2gbmemory.net