More correlation madness


I’m having a blast playing with the Pearson’s correlation calculation. This time I’ve written an app that pulls content off of Reddit — the top stories that is. The app then downloads those top stories, does a quick word count on them and stores the results of those word counts to MongoDB (which is really super simple and handy in this case). I think I need to do some more tuning… probably need to wipe out the most and least used words from comparisons… but here is a preliminary correlation sample.


  • 0.102800100988
  • 0.472552880944
  • 0.425154672208
  • 0.463452264456
  • 0.23856726673
  • 0.394875530066
  • 0.356758219203
  • 0.378502926029
  • 0.275317859192
  • 0.376670056001
  • 0.283753036092
  • 0.362785569257
  • -0.155582848044
  • -0.00684673283255
  • 0.0370534449745
  • -0.266162461616
  • 0.00708437321615
  • 0.154161227371
  • 0.319834105137


First impressions?? Seems to be ok accurate. It’s matching the text I gave it (the previous post from this blog) more closely with other articles than it is with pages like github and nothing is really a close match.

Like I said I think I need to do a bunch more tuning…

First off, it would be nice if the scraper I’ve built were truly grabbing the core text of the pages that are getting downloaded. 2nd like I said, I need to toss out words with fewer occurrences. Finally… I need a better (read more text) article to compare against than the text from the previous post. Basically there just isn’t enough of it.

Leave a Reply