puzzle rating systems? |
ferkel Kwon-Tom Obsessive Puzzles: 536 Best Total: 39m 18s | Posted - 2008.02.25 00:14:49 Tilps,
I notice your puzzle #301 says "Rated SIC2(SI6)". What does this mean? Is there a good way of rating puzzles somehow, or is this just a subjective rating?
In a similar vein, Naivoj said recently "as per my testing it is even 50% harder than the 2nd hardest posted." - what kind of testing is that, Naivoj?
Thanks!
ferkel |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.25 02:41:32 The rating has little to no meaning to anyone who hasn't used my program.
SIC2 means S, simple iterative localized trial and error solver. I, Interactions between 'cells' and intersections are considered. This is sort of a limited form of edge coloring for doing '3 in a corner'. C, edge coloring logic is enabled, the solver will keep track of the 'same/opposite' nature of edges. 2 is a measure of how localized the simple solver is. 2 means that localized consequences up to 2 levels deeper then the trial are considered. Depending on the type of consequences, each level may go as far as two cells away from the edge being trialed or edges one cell away from the ends of the edge being trialed. The rating system gives the Minimum localization factor required to solve the puzzle for a given set of settings - the settings are indicated by the letters.
Other letters I have include N - solve puzzle without avoiding closing the loop too early. If this is selected and returns a rating, then there are no solutions to the puzzle which have multiple loops. R - allow the simple solver to perform nested guesses, having this on effectively doubles the localization factor used, but adds some additional complexity as well. O - track whether a cell is on the inside or outside and use that information to deduce edges. (Currently - I'll probably be adding an O+ soon which performs more advanced 'cell' shading operations) Coming soon - M - when neither a trial or its opposite lead to a localized contradiction, keep the common changes. (Currently this is a non-configurable default which is always on - I am finding however that turning it off generates more 'playable' puzzles. I'll be adding an M+ which performs advanced merging - generatings edge coloring information, more accurately merges common edge coloring changes)
Also, there is F which is to use the full brute force recusion solver. For performance reasons, the full recursion solver first uses the iterative solver to solve as much as possible (Equivelent to SIRCOM(Infinity)) If a puzzle is rated F (and F alone) it cannot be solved using the localized iterative solver and is either crazy hard or its easy solution relies on techniques not currently used by my simple solver, techniques like highlander or advanced cell shading (or advanced coloring discovered in merge).
My current favourite rating system is SIRCO(n) SIRCO(0) puzzles are comparable to easy and medium kwon-toms. SIRCO(1) are closer to hards. (Haven't experimented with SIRCO(2) yet, as both R and O are new features added yesterday and M used to always on by default until this morning.) |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.25 03:33:09
Quote: Originally Posted by ferkel Tilps, ... - what kind of testing is that, Naivoj? |
Basically the statistic I used to rank a puzzle is the #of explorations (or #of trial and error, or #of FP=(Fix Position)) the solver is forced to do after it ran out applying rules. Obviously this is subjective to the amount of rules coded: more rules coded = less exploration required. - The easy puzzles I have posted require no exploration to the solver. The testing I did was to run the solver for all 34 KT beasts posted and then sort them by this statistic "#of explorations". |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.25 04:14:42 I have in the past used statistics like number of trial and errors, but have found them to not correspond well to how the puzzle behaved when I played it.
If you have a puzzle which has 5 'almost patterns' which are trivially obvious to a simple trial/error, but aren't in your pattern database, vs a puzzle with 1 trial and error where it only leads to error if you are checking for closing the loop/coloring and shading and fill in 50 edges first. I'll take the first puzzle any day.
Based on my experience with sudoku ratings I found that rating by 'how smart the solver has to be to solve the puzzle at all' corresponded alot better with how hard the puzzle seemed to me. It may not correspond to how long it takes to solve the puzzle, since the puzzle can be consistanty medium difficulty or just one part of medium difficulty, but it does seem to me to be a better rating of the difficulty of the puzzle.
Additionally, number of explorations is a very variable number. If you are trying explorations top to bottom left to right - try running the same code with randomized exploration location, or bottom to top, right to left, or spiraling in outside to middle, or middle to outside , or exploring near recent success before else where.
On the other hand, a rating such as number of explorations is quick and easy to generate. With my sudoku solver I had a huge problem with generating ratings, a puzzle which could be solved in 1 second by my brute force solver (the brute force solver for sudoku is very fast), but the rating solver would take minutes, as it performed multiple nested binary searches of solvability on variaous 'smartness' parameters and pruned discovered details to the minimum required to make progress to try and produce more accurate ratings. |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.25 12:16:21
Quote: Originally Posted by tilps I have ... ratings. |
Last fall we actually tested the effect of starting resolving puzzles from different corners/directions on the #of.explorations stat and found out that when the #of.explo is really low there are variations otherwise it is pretty stable.
In my opinion most of what you are saying is truer for smaller puzzles (like 10x10) or for easier puzzles, but not that much for larger or harder puzzles. I am not sure where is the threshold but maybe something like 5 = if so then puzzles requiring more that 5 explorations are best rated by the #of.exploration (doesn't mean it is a perfect rating system); otherwise I agree the difficulty of the rules used by the solver is the best rating feature.
For my "EASY" user beasts I will only indicate if highlander or path deductions are required, because the 34 KT Beasts and the ones I have posted can all use hard patterns and/or color rules before going to trial and error(if required). None of those beasts are actually really easy, like the Monday/Tuesday KT 10x10 puzzles are: an "EASY" beast only means it is easier then the average beast posted.
I agree that the example you giving in the 2nd paragraph is a flaw in this rating system, but I would say it probably a flaw in any rating system. Another such example is the solver easily apply highlander patterns but have a hard time recognizing highlander situations that a human player can (should) take advantage of.
The #of.explo is not too good to rate most of the 10x10 KT puzzles because they usually required less than "5" explo, but I say it is good for beasts and user-puzzles(which tend to be really harder than the KT ones). Now for really really hard puzzles the best rating might be the CPU time required by the solver (where brutal force could be used). Again I was really impressed last fall when I tested LoopDeLoop against several of the hardest 10x10 puzzles: it was significantly faster than our solver. |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.25 14:39:50 I think you are counting 'number of explorations' in a way different to I expected - are you only counting them if they achieve anything? |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.26 00:50:38 I've been thinking about this a bit more and have come up with what I think is an interesting approach to generating a puzzle of a certain difficulty level. (Rating a random puzzle however becomes a process with alot more user input and less automation (or considerably more complicated))
The idea is that perceived difficulty is a compromise between number of explorations and the complexity of the explorations. Specifically, the more open space in a puzzle, the harder an operation of a specific difficulty is perceived to be. Therefore with the puzzle generation process being. 1) Generate loop with all clues. 2) Prune clues at random while loop is still solveable with the target solver difficulty. This is extended so that rather than a simple solvability test with the target solver we use an iterative approach First apply target solver without trial and error, just static observation. Next apply 'first difficulty constraint' solver followed by static observation with the target solver again. If percent of puzzle solved is less than first cutoff percent bail. Repeat for each subsequent difficulty constraint and their specified bail out percentage.
Using an extension of my rating notation a user would specify the puzzle to be generated as S2>20%, SI2>40%, SIC1>60%, SIRC1>80%, SIRCO1 For even better quality puzzles it would be ideal to put upper bounds as well, but they extend the generating time considerably since upper bounds can only be checked at the end of the pruning process and require you to restart the process from scratch.
Last edited by Tilps - 2008.02.26 00:50:58 |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.26 06:08:14
Quote: Originally Posted by tilps I think you are counting 'number of explorations' in a way different to I expected - are you only counting them if they achieve anything? |
No, all attempted explorations are counted, but only selective ones are attempted. The selection is emulating to a certain point how many humans are doing it, but good human players will choose better spots to start exploring, and will have a better sequence on the subsequent explorations following inconclusive ones.
Without going into too much details, mainly - only squares that are partially solved are explored: basically this correspond to exploring the edge of solved areas; - but not all possible explorations are done for each of those chosen squares = only the more effective ones are done.
For example for a chosen clue 3 where only his color is known the following exploration is done : - One cross on an unknown edge (for 4 possible explorations), forcing up to 3 lines which tends to be more conclusive. But the opposite exploration is not done: - One line on an unknown edge, because usually little is forced out of this, which tends to be more inconclusive.
Another example: if exactly one color of the 4 "cross squares" surrounding a clue 2 is known, it is good to explore an uncolored "cross square" as being of the same color (for 3 possible explorations), because it is forcing 2 squares to be of the opposite color, which tends to be more conclusive. But it not worth exploring an uncolored "cross square" as being of the opposite color, as it is not forcing the color of any remaining "cross square".
This I hope give a general idea of this selective exploration process. |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.26 07:05:28 Interesting that you are doing shading explorations as well as edge explorations, something for me to consider once I get the more advanced cell shading logic implemented. (Was writing the logic last night for shadings which touch by a corner and thinking how much simpler it would be if I didn't support arbitrary meshes )
My solver explores Everywhere - places which are of little consequence are discarded quickly anyway so have little effect on the processing time. On the other hand - edge trials on 3's and 1's in the middle of nowhere can have massive consequences leading to contradictions which allow for significant progress to be made. But that also serves to make the puzzles generated by my solver significantly harder even though the logic used is simple since the correct starting points for the contradictions are hard for players to spot
Last edited by Tilps - 2008.02.26 07:05:53 |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.26 07:08:40 The following demonstrate (but does not prove) the effectiveness of using this #of explorations stat for rating harder puzzles: on Dec12, 2007 I posted nine 20x14 user puzzles called "Largest- Progressively Harder" (#257-#265) in order of difficulty where the 4 easiest were based on the type of rules required to solve them without any T&E, and the 5 hardest were based on this #of explo stat. You can see the #of solvers decreasing as the difficulty is increasing with only one exception:
#257 1/9- 34 solvers - 0 Explo - Basic color rules required #258 2/9- 32 solvers - 0 Explo - Harder color rules required #259 3/9- 27 solvers - 0 Explo - Harder color rules & path deductions required #260 4/9- 25 solvers - 0 Explo - Harder color rules, path deductions & Highlander required #261 5/9- 25 solvers - 32 Explorations #262 6/9- 15 solvers - 82 Explorations #263 7/9- 17 solvers - 197 Explorations #264 8/9- 14 solvers - 273 Explorations #265 9/9- 13 solvers - 381 Explorations |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.26 08:09:46
Quote: Originally Posted by tilps Interesting ... (Was writing the logic last night for shadings) ... |
Since you are adding color rules, about adding the square coloring feature like in the Kwon-Tom loop puzzle solving page, where one click in the center of a square change his color this way: - when Blank : Left.Click=Red Right.Click=Yellow - when Red : Left.Click=Yellow Right.Click=Blank - when Yellow: Left.Click=Blank Right.Click=Red This would help you validate the new color rules you are coding. Furthermore LoopDeloop would become the best place to solve beast puzzles using color.
Quote: Originally Posted by tilps My solver explores Everywhere... But that also serves to make the puzzles generated by my solver significantly harder ... |
Exploring everywhere is fine, but then you can not use the #of explorations stat to rate a puzzle the way humans tend to solve them.
I would like to see some harder LoopDeLoop puzzle, can you post in the KT user puzzle page a very hard 10x10 and a 20x14 one?
Last edited by Naivoj - 2008.02.26 08:29:54 |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.26 12:01:23
Quote: Originally Posted by naivoj The following demonstrate (but does not prove) the effectiveness of using this #of explorations stat for rating harder puzzles: on Dec12, 2007 I posted nine 20x14 user puzzles called "Largest- Progressively Harder" (#257-#265) in order of difficulty where the 4 easiest were based on the type of rules required to solve them without any T&E, and the 5 hardest were based on this #of explo stat. You can see the #of solvers decreasing as the difficulty is increasing with only one exception:
|
Interesting stats - although the controlled study purest in me states that a) posting them in order of increasing difficulty and b) stating that they are as such - could bias your results
On the other hand, I think your explorations count is much better then I initially expected, because the explorations locations are so controlled. I'll see about posting some hardish puzzles (I won't sit here all day trying to generate them as hard as possible ) |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.26 12:40:44
Quote: Originally Posted by naivoj Since you are adding color rules, about adding the square coloring feature like in the Kwon-Tom loop puzzle solving page, where one click in the center of a square change his color this way: - when Blank : Left.Click=Red Right.Click=Yellow - when Red : Left.Click=Yellow Right.Click=Blank - when Yellow: Left.Click=Blank Right.Click=Red This would help you validate the new color rules you are coding. Furthermore LoopDeloop would become the best place to solve beast puzzles using color.
|
For validation purposes the current ability to display the colors as numbers (see option show inside/outside colors (advanced) on the appearance entry in the settings page) has been sufficient. I will definitely be adding cell coloring capabilities for the end user, but the "determining what the user intended to click on" code will need some significant work.
I only had time to generate one 20x14 puzzle (and its symmetrical to improve the generation time even then.) but I made a couple of 10x10, one of which should be very hard (At least, without using highlander that is...)
Last edited by Tilps - 2008.02.26 20:34:03 |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.27 05:03:15
Quote: Originally Posted by tilps ...a) posting them in order of increasing difficulty and b) stating that they are as such - could bias your results |
Good points, I will redo this right sometimes in the near future. At the time I was more trying to show I could rate them in the right order. But I think it is still a pretty good demonstration, If you solve them yourselves I believe you will agree with the increasing difficulty.
Quote: Originally Posted by tilps ...I'll see about posting some hardish puzzles (I won't sit here all day trying to generate them as hard as possible) |
The advantage of the #explo stat is you can ask the solver to generate hardish puzzles simply by specifying a minimum of #explo, so it is the computer that stay up all night (generating puzzles randomly until one comply), while I am sleeping
Quote: Originally Posted by tilps I only had time to generate one 20x14 puzzle (and its symmetrical to improve the generation time even then.) but I made a couple of 10x10, one of which should be very hard (At least, without using highlander that is...) |
Thanks for posting those. Just a quick comment on the 1st one (#302): everything is forced for me (without highlanders) but the solver need 2 explorations, which means coding improvements are coming! I will comment in more details in a new "spoilers" topic hopefully tomorrow when I have more time. |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.27 08:59:26 I could easily make my program generate puzzles until it is above a minimum - but I usually find that for a puzzle solvers of moderate to small powers, the first puzzle will almost always be exactly that level - and the puzzle solver is for people to generate and play - not leave generating all night to play some other time, so I feel the compromise on responsiveness instead of accuracy is in the users interest. |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.28 08:37:15 Because this "min #of explo" parameter is end-user configurable there is no downside, it's just giving him more flexibility: the user decides, if he wants to wait all night to get a hardish one, or if doesn't want to wait and get a regular one. He can also launch the program once for an all night hardish creation, while he play on a 2nd launch regular puzzles.
The hardest puzzle you have post for me is definitely the 20x14 #304, but for our solver it is 10x10 #305. Personally in #305 I am making one "new" path deduction (too difficult to code) of roughly the same type that I explained for #302 (path deduction #2), but that is only making it a little bit easier.
The density of lines in these 4 puzzles is low, I supposed you set it this way to get a faster response. Is this controlled by the option "Target line length of max"? In the B30 version I downloaded It seems I can not changed or even see the value of this option: - The last word (max) of the label text is actually in top of the input box (font problem maybe?) - and tabbing is skipping that input box field |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.28 10:42:19 It sounds to me that you have non-standard font size yes. - the dialog looks just fine for me. Its the second last text box on the puzzle creation settings page.
The tab does stop there, but its been a while since I normalized the tab stop order - it goes there after consider cell intersection interaction.
As for the line length, in actuality I just forgot to put it higher, the line generation phase takes about a second max - because I only try to generate the loop at most 100 times - although when it does timeout like that the loop generated is often very small. That is one disadvantage of generating a loop the way I do. (Edit, turns out its 10thousand times, and when you set it so high it doesn't reach it, it takes a good 30 seconds to timeout. I noticed this tonight because I had moved my setting up to 0.8 which is extremely hard to reach on hexagon2 and some of the other complicated multishape cell designs)
Last edited by Tilps - 2008.02.28 12:46:40 |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.29 05:57:44 Could you move the input box for option "Target line length of max" one inch to the right, in a future version? This would solve the font problem You are right, tabbing is not skipping that input box field, it is just tabbing in an unexpected sequence. But this field value is not visible. |
Tilps Kwon-Tom Obsessive Puzzles: 6722 Best Total: 18m 37s | Posted - 2008.02.29 06:46:06 With respect to the font issue - I am interested in a few details - what OS are you running and do you have a custom DPI selected? (Large fonts option for instance) |
Naivoj Kwon-Tom Addict Puzzles: 314 Best Total: 33m 50s | Posted - 2008.02.29 07:14:06 XP Professional Service Pack 2. I am not observing a large font. Do you mean "screen resolution" by DPI, If so I am using 1680x1050. |