Message boards : Server and website : More detailed server status page
Author | Message |
---|---|
Is it possible to display the batches (with the number of workunits) in the queues, not just the total number of the tasks? Application unsent in progress valid invalid error
Short runs (2-3 hours on fastest card) 1 48 130 5 17
NOELIA_467x 1 48 130 5 17
Long runs (8-12 hours on fastest card) 114 2,295 1400 34 420
GERARD_FXCXCL12_LIG 34 1,865 970 39 40
GERARD_PTCL2_CTL_IPZ1 15 320 294 22 34
GERARD_PTCL2_CTL_PRZ1 10 201 231 12 22
GERARD_PTCL_CTL_IPZ2 11 181 175 10 12
GERARD_PTCL_CTL_PRZ1 9 117 111 7 9
GERARD_PTCTL_LFE_AIN3 14 104 94 5 8
GERARD_PTCTL_LFE_IBP2 7 97 55 2 5
GERARD_PTCTL_PLA2_AIN3 12 86 64 3 6
GERARD_PTCTL_PLA2_IBP1 67 34 73 4 9
GERARD_VACXCL12_LIG 42 309 211 20 19
SDOERR_ntl9evSSXX3 3 15 15 0 1 These are *not* the actual numbers, so they won't add up. | |
ID: 41895 | Rating: 0 | rate: / Reply Quote | |
Beta version released in the server_status page. I tweaked a bit the original idea but you can see the information you desired. I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all. Hopefully this data will enhance the ability of detecting corrupted batches. | |
ID: 41904 | Rating: 0 | rate: / Reply Quote | |
Beta version released in the server_status page.Thank you! I tweaked a bit the original idea but you can see the information you desired.Well, that's the point of it. :) The another purpose of this list to "announce" new batches. I got a bit surprised about the error rate at first, later I realised the amount of errors a client can make so is not that bad after all.Oh, another idea popped in my mind while I read your words: There should be a top list of the worst hosts (most errors per day) on the performance chart. Is there a way to make a "normalized" error rate column by filtering out these worst hosts? Would such a column be more conclusive? Hopefully this data will enhance the ability of detecting corrupted batches.Hopefully you've already had something like this for internal use before, right? :) | |
ID: 41905 | Rating: 0 | rate: / Reply Quote | |
Nice work!!! | |
ID: 41906 | Rating: 0 | rate: / Reply Quote | |
There should be a top list of the worst hosts (most errors per day) on the performance chart. Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this. Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch. Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite. | |
ID: 41907 | Rating: 0 | rate: / Reply Quote | |
Before you start listing the worst hosts, it would be a good idea to set up a proper criteria for this.At first, I thought to exclude from the statistic only the obviously bad hosts, which fail on every task (for example: host 255774, 180977) or only occasionally finish a task (their error rate is above say 90%, for example: host 179830, 74100). But it could be a more sophisticated statistical algorithm. Errors can be put into 2 categories: hosts errors and non hosts errors (like bad batch of WU's, or the server canceling the units), so make sure the host are labeled with host errors only. Ok, this is obvious, but I don't want to be labeled with a scarlet letter because of a bad batch.I meant "most error in the past 24 hours" by "most errors per day", so this list would be automatically refreshed / fixed hosts would be cleared. The purpose of filtering the worst hosts is to avoid putting a scarlet letter on a batch, caused by the worst hosts failing workunits from a more demanding batch (presumably because the host's GPU is overclocked above its maximum), which result in misleading percentages. A "scarlet letter" on a batch could be dangerous, as it could make some crunchers selectively cancelling workunits from the (mistakenly) worst batches, making the whole process worse. Also, errors that happened a while back (which are mostly back batch errors) should not count either. I would think that the cleaning out the data base of these errors should prerequisite.That's a good idea anyway. | |
ID: 41913 | Rating: 0 | rate: / Reply Quote | |
I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct? | |
ID: 42012 | Rating: 0 | rate: / Reply Quote | |
I am not sure how the Server Status page calculates the error rate, but it looks like to me that units in progress are included as errors, until they complete successfully. Is my observation correct?I think the high error rate is the consequence of that those workunits which immediately (or early) run into an error are returned much earlier than those which run until completion (which takes 5-8-12-24 or even more hours), so the error rate normalizes only after this period. That's why I wanted to exclude those hosts from this calculation which fail every single workunit, because they do not contribute to this project, but actually spam it, making these statistics misleading. | |
ID: 42013 | Rating: 0 | rate: / Reply Quote | |
You are right Retvari. I also have observed this behavior. First days after a new batch is launched the error rate is pretty high. Supposedly, some hosts receive WU and either they manually cancel them or abort for many other reasons. I believe a user stops receiving certain WU batch after an x number of fails, and as the successful WU proceed, the error gets corrected. | |
ID: 42019 | Rating: 0 | rate: / Reply Quote | |
Message boards : Server and website : More detailed server status page