Advanced search

Message boards : Graphics cards (GPUs) : One of my GPUs stopped crunching

Author Message
Profile Scalextrix[Gridcoin]
Send message
Joined: 27 Jan 09
Posts: 34
Credit: 185,313,973
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 37224 - Posted: 5 Jul 2014 | 10:06:04 UTC

Hello, I have a Win 7 x64 machine with a GTX780Ti (GPU 0) and a GTX670 (GPU 1), i DONT RUN 24/7. Yesterday both cards were crunching, this morning only GPU 0 is crunching after startup, tried restarting, no difference.

Event Log:
05/07/2014 10:23:30 | | CUDA: NVIDIA GPU 0: GeForce GTX 780 Ti (driver version 337.88, CUDA version 6.0, compute capability 3.5, 3072MB, 2926MB available, 6022 GFLOPS peak)
05/07/2014 10:23:30 | | CUDA: NVIDIA GPU 1: GeForce GTX 670 (driver version 337.88, CUDA version 6.0, compute capability 3.0, 2048MB, 1958MB available, 2845 GFLOPS peak)
05/07/2014 10:23:30 | | OpenCL: NVIDIA GPU 0: GeForce GTX 780 Ti (driver version 337.88, device version OpenCL 1.1 CUDA, 3072MB, 2926MB available, 6022 GFLOPS peak)
05/07/2014 10:23:30 | | OpenCL: NVIDIA GPU 1: GeForce GTX 670 (driver version 337.88, device version OpenCL 1.1 CUDA, 2048MB, 1958MB available, 2845 GFLOPS peak)
05/07/2014 10:23:30 | | OpenCL: Intel GPU 0: Intel(R) HD Graphics 4000 (driver version 10.18.10.3621, device version OpenCL 1.2, 1195MB, 1195MB available, 173 GFLOPS peak)
05/07/2014 10:23:30 | | OpenCL CPU: Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz (OpenCL driver vendor: Intel(R) Corporation, driver version 3.0.1.10878, device version OpenCL 1.2 (Build 76413))
....
05/07/2014 10:23:30 | Poem@Home | Found app_config.xml
05/07/2014 10:23:30 | GPUGRID | Found app_config.xml
05/07/2014 10:23:30 | | Config: use all coprocessors

My GPUGRID app-config has not been changed recently and reads:
<app_config>
<app>
<name>acemdlong</name>
<max_concurrent>2</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
<app>
<name>acemdshort</name>
<max_concurrent>2</max_concurrent>
<fraction_done_exact/>
<gpu_versions>
<gpu_usage>1</gpu_usage>
<cpu_usage>1</cpu_usage>
</gpu_versions>
</app>
</app_config>

Any idea why GPU 1 just stopped running GPUGRID work today?

Thanks.

mikey
Send message
Joined: 2 Jan 09
Posts: 291
Credit: 2,041,316,115
RAC: 10,369,031
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37226 - Posted: 5 Jul 2014 | 10:48:10 UTC - in response to Message 37224.

Hello, I have a Win 7 x64 machine with a GTX780Ti (GPU 0) and a GTX670 (GPU 1), i DONT RUN 24/7. Yesterday both cards were crunching, this morning only GPU 0 is crunching after startup, tried restarting, no difference.

Any idea why GPU 1 just stopped running GPUGRID work today?

Thanks.


There is a bug in Boinc that occurs when you two Nvidia cards in one machine, but since it worked before unless you changed Boinc versions that is unlikely. There is another bug that SOMETIMES Boinc feeds the cache of one gpu but not the cache of a second gpu in the same machine. The only way I know of to fix either is to put one gpu on one project and the other gpu on a different project. In your case a sub project just might work, maybe you could put the faster gpu on the long units and the slower gpu on the short units?

That way to do that is thru some exclude lines in a cc_config.xml file, something like this that excludes the project Poem from gpu 1.
<exclude_gpu>
<url>http://boinc.fzk.de/poem/</url>
<device_num>1</device_num>
</exclude_gpu>

Now the actual details on how to exclude gpu zero from the short units, or gpu 1 from the long units, is beyond my knowledge, but I do believe it can be done. Maybe PM Jacob Klein as he is a wiz with those things.

Profile Scalextrix[Gridcoin]
Send message
Joined: 27 Jan 09
Posts: 34
Credit: 185,313,973
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 37227 - Posted: 5 Jul 2014 | 10:49:15 UTC

Err, I just found the problem, though Im very confused. I added a POEM@HOME app_config file yesterday as that project is having issues when multiple GPUs are used, so I did an exclude_gpu command. I excluded GPU 1 because GPU 0 produced least errors on POEM. Here is my app-config for POEM:
<app_config>
<app>
<name>poemcl</name>
<max_concurrent>0</max_concurrent>
</app>
<exclude_gpu>
<url>http://boinc.fzk.de/poem/</url>
<device_num>1</device_num>
<type>NVIDIA</type>
<app>poemcl</app>
</exclude_gpu>
</app_config>

Switching the exclude_GPU device_num in POEM, appears to effect whether 1 or 2 GPUs crunch for GPUGRID....! ???

Profile Scalextrix[Gridcoin]
Send message
Joined: 27 Jan 09
Posts: 34
Credit: 185,313,973
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 37228 - Posted: 5 Jul 2014 | 10:53:10 UTC - in response to Message 37227.

I deleted the POEM app_config and Im running 2 GPUGRID tasks again.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 2,322,829,288
RAC: 2,374,848
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37229 - Posted: 5 Jul 2014 | 12:53:16 UTC

Hi Scalextrix,

Couple of suggestions on your app_config for POEM.

First suggestion is to change the <max_concurrent> from "0" to "1". That would say POEM could run a maximum of 1 at a time.

If that doesn't get you going, try moving the <exclude_gpu> parameters to your cc_config.xml file. That is the way I have mine set up and it is working fine with two NVIDIA cards.

Hope that helps.

Profile Scalextrix[Gridcoin]
Send message
Joined: 27 Jan 09
Posts: 34
Credit: 185,313,973
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 37230 - Posted: 5 Jul 2014 | 14:57:35 UTC - in response to Message 37229.

Yes the concurrent was set to 0 temporarily, I didnt get time to say earlier that when I created the app_config file and tried to get POEM only to run on GPU 0, BOINC didnt respect that and still ran tasks on GPU 1 as well.
So I added the concurrent and tried setting to 1, that didnt make a difference so I finally set it to 0, in an attempt to see if POEM would stop crunching; it didnt.
Unfortunately POEM ran out of GPU tasks anyway, so I didnt get a chance to look into it further, then today GPUGRID is impacted.

I recreated the POEM app_config from scratch some hours ago, and so far GPUGRID seems ok.

Profile Scalextrix[Gridcoin]
Send message
Joined: 27 Jan 09
Posts: 34
Credit: 185,313,973
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 37231 - Posted: 5 Jul 2014 | 16:12:47 UTC - in response to Message 37230.

captainjack, can you post how your cc_config.xml is done for POEM please, I just got a GPU task for POEM and its using GPU 1 even though my app_config is set to exclude that GPU.

Reialise this is offtopic for GPUGRID forum, many thanks.

captainjack
Send message
Joined: 9 May 13
Posts: 171
Credit: 2,322,829,288
RAC: 2,374,848
Level
Phe
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37232 - Posted: 5 Jul 2014 | 16:53:08 UTC

Sorry, I am on the road and can't get to my pc. You should be able to go to the POEM message boards and find a cc_config from either skgiven or Jacob Klein that will work for you.

Jacob Klein
Send message
Joined: 11 Oct 08
Posts: 1127
Credit: 1,901,927,545
RAC: 0
Level
His
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwatwat
Message 37233 - Posted: 5 Jul 2014 | 23:02:18 UTC - in response to Message 37232.
Last modified: 5 Jul 2014 | 23:06:21 UTC

exclude_gpu is a cc_config.xml option, and should not be used within an app_config.xml file.
See: http://boinc.berkeley.edu/wiki/client_configuration

For your reference, here is my:
- cc_config.xml (including exclude_gpu references to block out POEM@Home on GPU devices 1 and 2)
- app_config.xml for GPUGrid (I have mine set up using 0.667 cpu currently, so when 2 GPUGrid units run, I reserve 1 core)
- app_config.xml for POEM@Home (I have mine set up using 0.333 gpu currently, so up-to-3 tasks-at-a-time can run on my device 0)

--------------------------------------
cc_config.xml
--------------------------------------

<cc_config> <log_flags> <!-- The 3 flags that are on by default are: file_xfer, sched_ops, task --> <file_xfer>1</file_xfer> <file_xfer_debug>0</file_xfer_debug> <sched_ops>1</sched_ops> <sched_op_debug>0</sched_op_debug> <task>1</task> <task_debug>0</task_debug> <unparsed_xml>1</unparsed_xml> <work_fetch_debug>1</work_fetch_debug> <rr_simulation>0</rr_simulation> <rrsim_detail>0</rrsim_detail> <cpu_sched>0</cpu_sched> <cpu_sched_debug>0</cpu_sched_debug> <cpu_sched_status>0</cpu_sched_status> <coproc_debug>0</coproc_debug> <mem_usage_debug>0</mem_usage_debug> <checkpoint_debug>0</checkpoint_debug> <http_debug>0</http_debug> <http_xfer_debug>0</http_xfer_debug> <network_status_debug>0</network_status_debug> <scrsave_debug>1</scrsave_debug> <notice_debug>0</notice_debug> <android_debug>0</android_debug> <app_msg_receive>0</app_msg_receive> <app_msg_send>0</app_msg_send> <async_file_debug>0</async_file_debug> <benchmark_debug>0</benchmark_debug> <dcf_debug>0</dcf_debug> <disk_usage_debug>0</disk_usage_debug> <priority_debug>0</priority_debug> <gui_rpc_debug>0</gui_rpc_debug> <heartbeat_debug>0</heartbeat_debug> <poll_debug>0</poll_debug> <proxy_debug>0</proxy_debug> <slot_debug>0</slot_debug> <state_debug>0</state_debug> <statefile_debug>0</statefile_debug> <suspend_debug>0</suspend_debug> <time_debug>0</time_debug> <trickle_debug>0</trickle_debug> </log_flags> <options> <!-- =================================================== TESTING OPTIONS =================================================== --> <!-- <start_delay>20</start_delay> <ncpus>12</ncpus> <exclusive_app>NotepadTest01.exe</exclusive_app> <exclusive_gpu_app>NotepadTest02.exe</exclusive_gpu_app> --> <!-- =================================================== REGULAR OPTIONS =================================================== --> <report_results_immediately>0</report_results_immediately> <fetch_on_update>0</fetch_on_update> <max_event_log_lines>50000</max_event_log_lines> <max_file_xfers>10</max_file_xfers> <max_file_xfers_per_project>4</max_file_xfers_per_project> <exclusive_app>iRacingSim.exe</exclusive_app> <exclusive_app>iRacingSim64.exe</exclusive_app> <exclusive_app>Aces.exe</exclusive_app> <exclusive_app>TmForever.exe</exclusive_app> <exclusive_app>TmForeverLauncher.exe</exclusive_app> <!-- ===================================================== SETUP GPUS ====================================================== --> <use_all_gpus>1</use_all_gpus> <!-- =========================================== SETUP GPU 0: GeForce GTX 660 Ti [eVGA FTW] =========================================== --> <!-- <ignore_nvidia_dev>0</ignore_nvidia_dev> --> <!-- Exclude World Community Grid's "Help Conquer Cancer" GPU app (hcc1) on main display - makes graphics slow, even on 660 Ti --> <!-- Commenting out, for now, since this round of hcc1 is completed, and next round may not exhibit the issue. --> <!-- <exclude_gpu> <url>http://www.worldcommunitygrid.org</url> <device_num>0</device_num> <app>hcc1</app> </exclude_gpu> --> <!-- Exclude several projects, since work from other GPU projects should give enough work to keep this GPU busy. --> <!-- Commenting out, because POEM is often out of work, and GPUGrid sometimes does run out. --> <!-- Commenting back in, because 7.4.0 work fetch erroneously fetches work for backup projects even when no GPUs are idle. --> <!-- Commenting out, since all 3 GPUs can now work on main projects. --> <!-- <exclude_gpu> <url>http://einstein.phys.uwm.edu/</url> <device_num>0</device_num> </exclude_gpu> <exclude_gpu> <url>http://albert.phys.uwm.edu/</url> <device_num>0</device_num> </exclude_gpu> <exclude_gpu> <url>http://setiathome.berkeley.edu/</url> <device_num>0</device_num> </exclude_gpu> <exclude_gpu> <url>http://setiweb.ssl.berkeley.edu/beta/</url> <device_num>0</device_num> </exclude_gpu> <exclude_gpu> <url>http://milkyway.cs.rpi.edu/milkyway/</url> <device_num>0</device_num> </exclude_gpu> --> <!-- =========================================== SETUP GPU 1: GeForce GTX 460 =========================================== --> <!-- <ignore_nvidia_dev>1</ignore_nvidia_dev> --> <!-- Exclude POEM's "POEM++ OpenCL version" GPU app (poemcl) from a second heterogeneous GPU, since it does not work properly --> <!-- Also exclude POEM's Test Project, which has the same issue --> <!-- Note: Although 320.18 drivers successfully run smalltest_3, the drivers still do not work right with POEM. --> <!-- Note: Also, it appears that running POEM only on the GTX 460, does not work. So, it must run on the GTX 660 Ti! --> <!-- Note: Tested their new OpenCL application on 3/22/2014 -- still does not start when running only on the GTX 460. So, it must run on the GTX 660 Ti! --> <!-- Commenting out, to more easily test how the issue affects my new arrangment of 3 GPUs --> <!-- 20140624 Commenting back in, as it's still bugged, even in the new 3-GPU system --> <exclude_gpu> <url>http://boinc.fzk.de/poem/</url> <device_num>1</device_num> <app>poemcl</app> </exclude_gpu> <exclude_gpu> <url>http://int-boinctest.int.kit.edu/poem/</url> <device_num>1</device_num> <app>poemcl</app> </exclude_gpu> <!-- Exclude World Community Grid's "Help Conquer Cancer" GPU app (hcc1) on main display - makes graphics slow, even on 660 Ti --> <!-- Commenting out, for now, since this round of hcc1 is completed, and next round may not exhibit the issue. --> <!-- <exclude_gpu> <url>http://www.worldcommunitygrid.org</url> <device_num>1</device_num> <app>hcc1</app> </exclude_gpu> --> <!-- Reminder: For GPUGrid.net, if going to run 2-tasks-on-1-GPU, exclude this GPU (it only has 1 GB memory) --> <!-- Commenting out, decided to include this GPU and run 1 task per GPU. --> <!-- <exclude_gpu> <url>http://www.gpugrid.net</url> <device_num>1</device_num> </exclude_gpu> --> <!-- Exclude several projects, since work from other GPU projects should give enough work to keep this GPU busy. --> <!-- Commenting out, because POEM is often out of work, and GPUGrid sometimes does run out. --> <!-- Commenting back in, because 7.4.0 work fetch erroneously fetches work for backup projects even when no GPUs are idle. --> <!-- Commenting out, since all 3 GPUs can now work on main projects. --> <!-- <exclude_gpu> <url>http://einstein.phys.uwm.edu/</url> <device_num>1</device_num> </exclude_gpu> <exclude_gpu> <url>http://albert.phys.uwm.edu/</url> <device_num>1</device_num> </exclude_gpu> <exclude_gpu> <url>http://setiathome.berkeley.edu/</url> <device_num>1</device_num> </exclude_gpu> <exclude_gpu> <url>http://setiweb.ssl.berkeley.edu/beta/</url> <device_num>1</device_num> </exclude_gpu> <exclude_gpu> <url>http://milkyway.cs.rpi.edu/milkyway/</url> <device_num>1</device_num> </exclude_gpu> --> <!-- =========================================== SETUP GPU 2: GeForce GTX 660 Ti [MSI OC] =========================================== --> <!-- <ignore_nvidia_dev>2</ignore_nvidia_dev> --> <!-- Exclude POEM's "POEM++ OpenCL version" GPU app (poemcl) from a second heterogeneous GPU, since it does not work properly --> <!-- Also exclude POEM's Test Project, which has the same issue --> <!-- Note: Although 320.18 drivers successfully run smalltest_3, the drivers still do not work right with POEM. --> <!-- Note: Also, it appears that running POEM only on the GTX 460, does not work. So, it must run on the GTX 660 Ti! --> <!-- Note: Tested their new OpenCL application on 3/22/2014 -- still does not start when running only on the GTX 460. So, it must run on the GTX 660 Ti! --> <!-- Commenting out, to more easily test how the issue affects my new arrangment of 3 GPUs --> <!-- 20140624 Commenting back in, as it's still bugged, even in the new 3-GPU system --> <exclude_gpu> <url>http://boinc.fzk.de/poem/</url> <device_num>2</device_num> <app>poemcl</app> </exclude_gpu> <exclude_gpu> <url>http://int-boinctest.int.kit.edu/poem/</url> <device_num>2</device_num> <app>poemcl</app> </exclude_gpu> <!-- Exclude several projects, since work from other GPU projects should give enough work to keep this GPU busy. --> <!-- Commenting out, because POEM is often out of work, and GPUGrid sometimes does run out. --> <!-- Commenting back in, because 7.4.0 work fetch erroneously fetches work for backup projects even when no GPUs are idle. --> <!-- Commenting out, since all 3 GPUs can now work on main projects. --> <!-- <exclude_gpu> <url>http://einstein.phys.uwm.edu/</url> <device_num>2</device_num> </exclude_gpu> <exclude_gpu> <url>http://albert.phys.uwm.edu/</url> <device_num>2</device_num> </exclude_gpu> <exclude_gpu> <url>http://setiathome.berkeley.edu/</url> <device_num>2</device_num> </exclude_gpu> <exclude_gpu> <url>http://setiweb.ssl.berkeley.edu/beta/</url> <device_num>2</device_num> </exclude_gpu> <exclude_gpu> <url>http://milkyway.cs.rpi.edu/milkyway/</url> <device_num>2</device_num> </exclude_gpu> --> </options> </cc_config>

--------------------------------------
app_config.xml for GPUGrid
--------------------------------------
<!-- GPUGrid.net --> <!-- GPU tasks do properly use higher process and thread priorities, compared to CPU tasks. --> <!-- GPU tasks sometimes use CPU sometimes don't, based on type of GPU task runs on. --> <!-- Recommend 1 gpu_usage, if user also has CPU projects. --> <!-- Recommend 0.001 cpu_usage, but might try 0.5, since if 2 are running, I KNOW the Kepler is using CPU --> <!-- Also might try 1 cpu_usage, so as not to overcommit per Task Manager's CPU Utilization --> <!-- Although x-at-a-time provides the best per-task-throughput, it ends up using a lot more CPU --> <!-- Switching to 0.4995, such that if an 8-CPU MT job is running, 2 GPUGrid jobs and 1 0.001 GPU job can all run together --> <!-- 0.5 cpu_usage so that 2+ GPU tasks will intentionally reserve at least 1 core --> <!-- 1.0 cpu_usage because, when SETI tasks run on 3rd GPU reserving a core, they still aren't getting enough CPU --> <!-- 0.5 cpu_usage because REC calculations and Process Explorer agree that CPU projects can get more cycles this way (3150cyc * 6inst, vs 2900cyc * 7inst)--> <!-- 0.2 cpu_usage because 334.67 drivers make the Kepler no longer utilize a full core. Not sure if bug or not. --> <!-- 0.4 cpu_usage to reflect what I actually see, and to reserve a core when 0.4 + 0.4 + 0.3 > 1.0 --> <!-- 0.5 cpu_usage so, when a 1-CPU task is running on GTS 240, 2 GPUGrid tasks will still reserve core, keeping CPU always slightly undercommitted --> <!-- 1.0 cpu_usage, in attempt to bolster and increase throughput for GPUGrid tasks, and keep GPU clocked at maximum boost --> <!-- 0.5 cpu_usage, to better load CPU --> <!-- 1.0 cpu_usage, for better throughput --> <!-- 0.5 cpu_usage, to better load CPU --> <!-- 1.0 cpu_usage, for better throughput --> <!-- 0.2 cpu_usage, to match what I actually see getting used --> <!-- 1.0 cpu_usage, for better throughput --> <!-- 0.4 cpu_usage, now that 3 GPUs are running this project, as a better compromise between throughput and CPU load --> <!-- 1.0 cpu_usage, for better throughput --> <!-- 0.666 cpu_usage, to reserve at least 1 CPU when 2+ tasks are running --> <!-- 0.666 gpu_usage, to not allow 2-GPUGrid-on-1-GPU, but to allow GPUGrid+Poem on 1 GPU --> <!-- 1 gpu_usage, because otherwise work fetch fetches too much, making my low-cache settings irrelevant --> <!-- 0.667 cpu_usage, so that when 3 tasks are running, 2 CPUs are reserved, as a better compromise --> <app_config> <!-- Short runs (2-3 hours on fastest card) --> <app> <name>acemdshort</name> <max_concurrent>0</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.667</cpu_usage> </gpu_versions> </app> <!-- Long runs (8-12 hours on fastest card) --> <app> <name>acemdlong</name> <max_concurrent>0</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.667</cpu_usage> </gpu_versions> </app> <!-- ACEMD beta version --> <app> <name>acemdbeta</name> <max_concurrent>0</max_concurrent> <gpu_versions> <gpu_usage>1</gpu_usage> <cpu_usage>0.667</cpu_usage> </gpu_versions> </app> </app_config>

--------------------------------------
app_config.xml for POEM@Home
--------------------------------------
<!-- POEM@Home --> <app_config> <!-- POEM++ OpenCL version My research indicates that increasing to 5-at-a-time provides the best per-task-throughput. But, since each task utilizes a full core during its increased length, x-at-a-time ends up using a lot more CPU. So... I thought about doing less at a time. I'm now using 3-at-a-time as a happy middle-ground. 3/22/2014: With new OpenCL release, the GPU Usage is almost high enough to do only 2-at-a-time. Still recommend 3-at-a-time. Actually, 3 now causes a stutter in the display. But even 1 does. So, staying at 3-at-a-time. Additional testing shows that, when only running POEM OpenCL GPU, 1 task does not stutter, but 2 does. Switching to 1-at-a-time. 6/17/2014: GPU Usage indicates that 2-at-a-time will appropriately saturate the GPU with no other tasks, but need 3-at-a-time to saturate with other tasks. --> <app> <name>poemcl</name> <max_concurrent>0</max_concurrent> <gpu_versions> <gpu_usage>0.333</gpu_usage> <cpu_usage>1</cpu_usage> </gpu_versions> </app> </app_config>

Profile Scalextrix[Gridcoin]
Send message
Joined: 27 Jan 09
Posts: 34
Credit: 185,313,973
RAC: 0
Level
Ile
Scientific publications
watwatwatwatwatwatwatwatwatwatwatwat
Message 37236 - Posted: 6 Jul 2014 | 7:15:14 UTC - in response to Message 37233.

Thanks Jacob, I had not seen that exclude_gpu must be in cc_config not app_config. Ill try that today and im certain that will resolve my issues. Thanks also to captainjack.

'A little knowledge is a dangerous thing.'

Post to thread

Message boards : Graphics cards (GPUs) : One of my GPUs stopped crunching

//