Our blog. We live and breathe this stuff. Here we write musings on the subjects that matter.
Mon 15th March
UPDATE:Since I have written this I have found that StackOverflow make all their data downloadedable with a creative commons licence at StackOverflow Data Dump So my fun little screen scraper is of even less use.
For this weekends project, I decided I wanted to make a graph of the distibution
of reputation within the stackoverflow comunity. I also felt the urg to play
with javascript, so I tried to do it all in the browser with javascript, after
being reminded of the cross site restriction I realised this was impossible and
created a quick little IHttpHandler that did a screen scrape of the
stackoverflow users page and returned JSON with the rank, reputation and name of
all the users on a given page.
Here is a sample:
Libaries Used:
Client Side:
The page is loaded with no data, the main javascript object "sof" get created and we bind the function drawGraph to its data changed event.
We then start a serries of getPage calles which use AJAX to fetch UserSummaries serilised as JSON. When the AJAX call compleates sof.data is updated
and drawGraph is called.
The function
drawGraph shapes the data to what jqplot expects removes any existing graph (as jqplot does not seem to be able to)
and draws the graph with the current data. As the variouse getPages compleate the graph is updated.
Server Side:
The hosting page does nothing fancy in fact it can be a static page. Each call
of the javascript function getPage makes a call through to an IHttpHandler
SOFPagedUserSummary with two optional parameters page and format.
SOFPagedUserSummary parses the parameters then delegates to SOFPageFetch.
SOFPageFetch checks to see if this page/format is in the cache (written to disk
to out live worker process recycles) if present on disk we output the
cached version. the heavy use of caching here is to prevent this project
becoming a pain for StackOverflow. If not present in the cache we delegate
to SOFNetPageFetcher which returns a List<UserSummary> these are then formatted
with in HTML or JSON, this formatted output is cached on disk then returned to
the client.
SOFNetPageFetcher connect to the stackoverflow users page passing the page
number through and parsers the returned html using the HtmlAgilityPack . We then
procced to scrape the page using XPath to extract data on each of the 35 users
returned, currently we are gathering rank, reputation and name. This is returned
as a List<UserSummary> . SOFPageFetch
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(
String.Format(UserPage, pageNumber), "GET");
int rank = ((pageNumber - 1) * 35);
foreach (HtmlNode userInfo in
doc.DocumentNode.SelectNodes("//div[@class='user-info']"))