More correlation madness


I’m having a blast playing with the Pearson’s correlation calculation. This time I’ve written an app that pulls content off of Reddit — the top stories that is. The app then downloads those top stories, does a quick word count on them and stores the results of those word counts to MongoDB (which is really super simple and handy in this case). I think I need to do some more tuning… probably need to wipe out the most and least used words from comparisons… but here is a preliminary correlation sample.


First impressions?? Seems to be ok accurate. It’s matching the text I gave it (the previous post from this blog) more closely with other articles than it is with pages like github and nothing is really a close match.

Like I said I think I need to do a bunch more tuning…

First off, it would be nice if the scraper I’ve built were truly grabbing the core text of the pages that are getting downloaded. 2nd like I said, I need to toss out words with fewer occurrences. Finally… I need a better (read more text) article to compare against than the text from the previous post. Basically there just isn’t enough of it.

