More correlation madness

 

I’m having a blast playing with the Pearson’s correlation calculation. This time I’ve written an app that pulls content off of Reddit — the top stories that is. The app then downloads those top stories, does a quick word count on them and stores the results of those word counts to MongoDB (which is really super simple and handy in this case). I think I need to do some more tuning… probably need to wipe out the most and least used words from comparisons… but here is a preliminary correlation sample.

 

  • http://leapgamer.com/blog/14/browsing_reddit_with_the_leap_motion_and_greasemonkey 0.102800100988
  • http://www.infoq.com/news/2013/02/MongoDB-Fault-Tolerance-Broken 0.472552880944
  • http://lwn.net/Articles/534735/ 0.425154672208
  • http://programminggroundup.blogspot.com/ 0.463452264456
  • http://www.doxsey.net/blog/go-and-assembly 0.23856726673
  • http://blog.getprismatic.com/blog/2013/2/1/graph-abstractions-for-structured-computation 0.394875530066
  • http://comoyo.github.com/blog/2013/02/06/the-inverse-of-ioc-is-control/ 0.356758219203
  • http://shopkick.github.com/flawless/ 0.378502926029
  • http://ericlippert.com/2013/02/06/static-constructors-part-one/ 0.275317859192
  • http://swizec.com/blog/first-impressions-of-rails-as-a-javascripter/swizec/5948 0.376670056001
  • http://solarianprogrammer.com/2013/02/07/sorting-data-in-parallel-cpu-gpu-2/ 0.283753036092
  • http://blogs.jetbrains.com/dotnet/2013/02/using-resharper-with-monotouch-applications/ 0.362785569257
  • http://blog.etapix.com/2013/02/hacking-liferay-securing-against-online.html -0.155582848044
  • http://shuklan.com/haskell/index.html -0.00684673283255
  • http://channel9.msdn.com/Series/Developing-HTML-5-Apps-Jump-Start 0.0370534449745
  • http://forthfreak.net/jsforth80x25.html -0.266162461616
  • https://gist.github.com/AdrianGaudebert/4708381 0.00708437321615
  • http://weblogs.asp.net/gunnarpeipman/archive/2013/02/07/using-database-unit-tests-in-visual-studio.aspx 0.154161227371
  • http://mindref.blogspot.com/2013/02/sql-vs-orm.html 0.319834105137

 

First impressions?? Seems to be ok accurate. It’s matching the text I gave it (the previous post from this blog) more closely with other articles than it is with pages like github and nothing is really a close match.

Like I said I think I need to do a bunch more tuning…

First off, it would be nice if the scraper I’ve built were truly grabbing the core text of the pages that are getting downloaded. 2nd like I said, I need to toss out words with fewer occurrences. Finally… I need a better (read more text) article to compare against than the text from the previous post. Basically there just isn’t enough of it.

Leave a Reply