Chess-like player ratings (Was: Scraping policy) |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.12 12:52:33 I have been wanting to do some analysis of player times and things for my own personal interest, so I wrote a scraper, which can collect times from the website, as they appear (well somewhat, its currently set to check once a minute on the main scores page, and check the daily leaderboards anytime it notices a change on the main scores page). However, using scrapers against the wishes of the webmaster would be bad form, so I am here, enquiring as to what foilman thinks on this matter.
Last edited by tilps - 2007.01.15 21:06:37 |
foilman Kwon-Tom Admin Puzzles: 3614 Best Total: 24m 6s | Posted - 2007.01.12 13:58:17 I don't mind if you want to run something like that, but if there are any specific stats you're after, I can probably extract them for you which might be easier! |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.12 15:01:13 I was wanting to experiment with ranking systems. I find the leaderboard somewhat ... non-representative due to the significant number of good players who do 4 puzzles in a row on the weekends, or similar.
I was thinking of seeing if I could collect a bunch of times for people and apply a variant of the chess ranking system, and see how it behaves. I'm sure you could extract every time for every person ever recorded and given them to me if you wanted, but I think I would get a better feel for how well the rankings play if I experience them at the same time as observing the current rankings. So while historic data can be useful for ensuring the system works even vaguely right, the subjective viewpoint would work best if I had access to continuous data.
If you had a simple page which was 'recent puzzle solvings' and just listed all the puzzles solved in the last half hour (or longer, depending on taste...), my scraper would become trivial, and much more reliable (and less frequent in its visits). But I'll live with whatever I can get.
Thanks though! |
foilman Kwon-Tom Admin Puzzles: 3614 Best Total: 24m 6s | Posted - 2007.01.12 16:13:57 Have a look at this page and see if it's doing roughly what you want:
Puzzles Solved In Last Hour |
drnull Kwon-Tom Obsessive Puzzles: 1053 Best Total: 23m 25s | Posted - 2007.01.12 16:48:52
Quote: Originally Posted by foilman |
Quote: You are not authorised to view this page! |
Pttbtbtbtbbt!!!
Meany!! |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.12 22:20:34 As drnull has discovered, the page only shows if you are logged in. (Correction: logged in as me it would seem. I assume foilman can also see the page ) Which I guess means maybe I should learn how to write a scraper which can impersonate myself rather than just acting as an anonymous user. But as far as I can tell so far, the page looks perfect.
Except for one minor thing, there appears to be no way to determine (short of assuming timezone difference relative to my own time, and accuracy of my own clock relative to yours) which day the page considers to be 'today'. Previously I was stealing that info out of the hrefs for the individual days listed on the leaderboard page. (First href matching given regex gave me a number which I could use to represent today, also told me what day name today is.) One way to make this easy enough to do, would be to make the 'day' column a hyperlink to the appropriate single day leaderboard. Or just stick with what you've got and I'll simply make my program smarter to compensate.
Last edited by tilps - 2007.01.13 02:01:43 |
foilman Kwon-Tom Admin Puzzles: 3614 Best Total: 24m 6s | Posted - 2007.01.12 23:36:36 Try it now...
oh and by the way, I can change "last hour" to anything you want really...
Last edited by foilman - 2007.01.12 23:41:16 |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.12 23:38:12 Excellent! - I just finished working out how to deal with cookies too
Last hour is a nice clean round number. I've upped my script to every 10minutes for the moment, will push that higher once I gain confidence everything is working smoothly. Once I'm really happy, I will change the script from a windows app to a windows service to ensure I don't forget to run it whenever I have to reboot my machine. It would a be a rare case that my computer is off for more than an hour.
Last edited by tilps - 2007.01.13 00:04:52 |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.13 02:40:56 My rating system appears to be basically functional. I have two output files, one is 'absolute ratings as of a week ago' - based only on daily puzzles which are no longer open for competition and 'preliminary ratings as of right now' which are based on all of the puzzles, including this weeks. Absolute ratings therefore change once a day, preliminary ratings change every time someone completes a puzzle. Given I've only been collecting data since yesterday - the stable ratings are purely based on the top 10, for that day, with everyone else being given middle of the road ratings, despite the fact they all did worse (or didn't compete).
Therefore it is way too early to tell if the rating system is 'enjoyable' (to early to tell if the rating system is 'accurate' too for that matter!) but if anyone wants to see my ratings (and foilman says okay!), I'll post them up on my web page or something. |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.13 03:26:37 Hmm, realised there was a situation I wasn't handling, in my code.
If someone does Sunday's puzzle just before we turn over to Sundays puzzle tomorrow, I don't know what either of the 'last solved puzzle' or 'last hour page' will say for that solving once the day flips over. Does it say Sunday (as opposed to today (which is also a Sunday))?
The reason why I thought of this was because I was going to ask (since you offered) if you could jack the page up to 1 week for an hour, so my program filled in the last week - but then realised there would be solving times for the week before's puzzles, which I would have no way of working out which week they were from.
I noticed momentarily earlier that you put the scday number in brackets next to Today - and then changed it to give the actual day name. The scday number would actually work better for me, and if you put it next to Every entry on that page - I could stop parsing the day name entirely and it would work in all situations regardless of how far back the page shows.
Given that I'm the only person who can see the page so far - unless you have plans to make this page generally usable, showing the scday numbers would seem the way to go. Up to you of course. If it is going to be generally usable, making the day names into a link to the scday page with the scday number in the href url, is trivial enough for my regex to handle just fine. |
foilman Kwon-Tom Admin Puzzles: 3614 Best Total: 24m 6s | Posted - 2007.01.13 10:21:30 The numbers for the days are now in there, but hidden from general display. If the last puzzle completed was a week ago, it says "last sunday" (or whatever day it was) but you don't need to worry about that now I guess! |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.13 12:13:09 Nice use of html comments
My scraper has been updated and understands the new format.
If you don't mind, My ratings could do with some more data to use behind them, so if you could turn 'lasthour' into 'lastweek' for half an hour (or however long until you turn it back) my script is currently checking every 30minutes and will pick up and process everything, which would be great.
Last edited by tilps - 2007.01.13 13:16:30 |
foilman Kwon-Tom Admin Puzzles: 3614 Best Total: 24m 6s | Posted - 2007.01.13 20:18:11 It's actually not very easy to change "last hour" to "last week", as the database is constantly clearing itself out - I modified it to store the last hour's results before clearing, but if you want a week's worth you'd have to wait a week!! |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.13 23:24:11 Aha I see - oh well, never mind then! I can just wait a week for mine to collect it then
Thanks for all the help.
Now to the other question, do you mind if I share my ratings on my website? |
foilman Kwon-Tom Admin Puzzles: 3614 Best Total: 24m 6s | Posted - 2007.01.15 10:20:19 No, please go ahead, it'll be interesting to see! |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.15 11:45:03 Latest Ratings Stable ratings
I wouldn't even bother looking at the stable ratings for now, they suffer from extremely incomplete data. Even the latest ratings will take a while to recover from the initial incomplete seed data. Over the next few weeks, I expect that the top end will keep going up and the bottom end going down, relatively rapidly at first, and then slower and eventually peoples ratings will wiggle back and forth as they have bad days and good days. I'll try to remember to update these files each day before I go to work (which is about 3 hours before end of 'kwontom day' different time zones and all.
Last edited by Tilps - 2007.01.23 03:44:43 |
foilman Kwon-Tom Admin Puzzles: 3614 Best Total: 24m 6s | Posted - 2007.01.15 11:50:39 It'll be interesting to see how your rankings handle the case where I have to go back and modify someone's time when they cheat... it doesn't happen often, only two or three times in the past year, but it does happen. |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.15 12:02:42 I had considered that, I will need to add a program for manually fiddling with the results - of course I would have to notice them first, or be told about them. I also probably need a system where people who haven't done any puzzles in the last two weeks are no longer shown in the ratings list. If they start playing again, they will show up again, but in the mean time they disappear.
I regenerate all of my ratings from the beginning of time from scratch every time I run the analysis program, it doesn't take long - so changing past results, or importing additional data for the last 2 years, is not a problem.
btw - for detecting cheaters, I think keeping track of dlmc fully and returning just that number, could be useful for making the decision, keep the limit on the lists, but track the number accurately.
Updated the styling to make it look a bit more like the leaderboard. ... Only vaguely though.
Last edited by tilps - 2007.01.15 13:09:15 |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.15 20:43:32 Well, I remembered this morning.
I think I should also once a week do up a 'diff sheet' ie 'biggest movers' But I'm unsure as to whether I should compare back to the stable at the same time, or keep a copy of the latest around, and diff against a week a goes latest. |
tilps Kwon-Tom Obsessive Puzzles: 6720 Best Total: 18m 37s | Posted - 2007.01.16 06:52:34 This mornings ratings were bogus, I changed my save file format to conserve space, but hadn't updated the analysis code to understand the new format.
I updated again, so the new ones are correct.
Last edited by tilps - 2007.01.16 06:57:54 |