More correlation madness
I’m having a blast playing with the Pearson’s correlation calculation. This time I’ve written an app that pulls content off of Reddit — the top stories that is. The app then downloads those top stories, does a quick word count on them and stores the results of those word counts to MongoDB (which is really super simple and handy in this case). I think I need to do some more tuning… probably need to wipe out the most and least used words from comparisons… but here is a preliminary correlation sample.
- http://leapgamer.com/blog/14/browsing_reddit_with_the_leap_motion_and_greasemonkey 0.102800100988
- http://www.infoq.com/news/2013/02/MongoDB-Fault-Tolerance-Broken 0.472552880944
- http://lwn.net/Articles/534735/ 0.425154672208
- http://programminggroundup.blogspot.com/ 0.463452264456
- http://www.doxsey.net/blog/go-and-assembly 0.23856726673
- http://blog.getprismatic.com/blog/2013/2/1/graph-abstractions-for-structured-computation 0.394875530066
- http://comoyo.github.com/blog/2013/02/06/the-inverse-of-ioc-is-control/ 0.356758219203
- http://shopkick.github.com/flawless/ 0.378502926029
- http://ericlippert.com/2013/02/06/static-constructors-part-one/ 0.275317859192
- http://solarianprogrammer.com/2013/02/07/sorting-data-in-parallel-cpu-gpu-2/ 0.283753036092
- http://blogs.jetbrains.com/dotnet/2013/02/using-resharper-with-monotouch-applications/ 0.362785569257
- http://blog.etapix.com/2013/02/hacking-liferay-securing-against-online.html -0.155582848044
- http://shuklan.com/haskell/index.html -0.00684673283255
- http://channel9.msdn.com/Series/Developing-HTML-5-Apps-Jump-Start 0.0370534449745
- http://forthfreak.net/jsforth80x25.html -0.266162461616
- https://gist.github.com/AdrianGaudebert/4708381 0.00708437321615
- http://weblogs.asp.net/gunnarpeipman/archive/2013/02/07/using-database-unit-tests-in-visual-studio.aspx 0.154161227371
- http://mindref.blogspot.com/2013/02/sql-vs-orm.html 0.319834105137
First impressions?? Seems to be ok accurate. It’s matching the text I gave it (the previous post from this blog) more closely with other articles than it is with pages like github and nothing is really a close match.
Like I said I think I need to do a bunch more tuning…
First off, it would be nice if the scraper I’ve built were truly grabbing the core text of the pages that are getting downloaded. 2nd like I said, I need to toss out words with fewer occurrences. Finally… I need a better (read more text) article to compare against than the text from the previous post. Basically there just isn’t enough of it.