Jump to content

Crash "DXD12 device removed"


photo

Recommended Posts

Posted (edited)

Hello everyone,

As a follow-up for one of my support tickets, we still have very problematic DXD12 device removed crashes since our migration in Unigine 2.19.1.2.

Some of our computers (RTX 2000 Ada Laptop) have never had any crash. Some of our applications (Geforce RTX 5070) crash every 10 minutes, on the same world.

Some (RTX A5000) crash once per week. Others (Geforce RTX 5060 Ti) don't crash.

We monitored the VRAM, we don't take everything. I tried using the default "low" settings for Unigine, it still crashed. Removing the DLSS didn't stop the problem.

Do you have any advice on a Unigine configuration that would ensure a 100% stability, at the cost of performance ? Do I remove CPU multi-threading ? Lowest texture resolution ?

 

Our crashes can happen at the moment of loading the world, or after 10 minutes. They can happen when we click on some other window during running.

The same world configurations/applications worked fine in 2.18 without crashes... on the same setups...

By the way, we updated to the latest Nvidia drivers, we are on Windows 11.

Any help, advice or hint would be appreciated.

 

Edited by K.Wagrez
Posted

Okay, after some other tests.

I don't know if this is the same crash we have, but while doing stress tests on our platforms to obtain a reproducible scenario, I ended up finding a way to nearly 100% crash our GPU. If I set render_supersampling to 2.0 while running, and immediately switch back to 1.0, the GPU just folds, and Unigine produce a DxD12 device removed error.

We added some TdrDelay to make Windows wait at least 60 seconds before throwing the towel, but to no avail.

I'm thinking about two strategies if we don't find a way out:

  • Switching to Vulkan, which removes some performance improvement and the DLAA (extremely bad news). It worked on one of our engineer's rig, which often crashed due to a DxD12 device removed error in the editor. It doesn't crash anymore.
  • Migrating back to Unigine 2.18 (no DLAA for our platforms, still bad news)

Does Unigine have any way of migrating back assets to an earlier version of the engine ? I can reverse the API changes and the fixes I've done to cover for the migration, but the assets are going to be too annoying.

There are still some tests to run, we will try to use the video_debug options, and at some point use the microprofiler on our target platforms.

Trying to migrate to 2.20 is out of the question. There are identified regressions that will require time to migrate, and besides we don't have any insurance that this won't keep happening. If necessary we can try to downgrade the 3D assets, reduce the world's weight, but I would rather understand why it's happening first.

And in case you're wondering, our crashes can happen even when the GPU and the VRAM are barely used. While pumping the supersampling to 2.0, we could abuse our engine without crash (swapping focus, letting it play) while reaching 100% use of the GPU with a 11.2go/12 VRAM usage. It's only when we asked to reduce the supersampling that it crashed.

Posted

Hi Kevin,

Unfortunately, there is no simple way to debug the device removed crashes.

Without a stable reproduction on our side there is very little we can suggest, I'm afraid. Once we would be able to get some content for reproduction and instructions how to get the crashes I think we can provide a fix within several weeks.

Enabling supersampling 2 and then immediately returning it back to 1 may cause some additional instability, but there is no guarantee that it's the same issue that you are having on a regular setup.

Quote

Switching to Vulkan, which removes some performance improvement and the DLAA (extremely bad news). It worked on one of our engineer's rig, which often crashed due to a DxD12 device removed error in the editor. It doesn't crash anymore.

You can also consider switching to DX11 for better performance or disabling the multithreaded rendering in DX12 (render_multithreaded 0). You can stick with the API that provides better overall performance.

Quote

Does Unigine have any way of migrating back assets to an earlier version of the engine ? I can reverse the API changes and the fixes I've done to cover for the migration, but the assets are going to be too annoying.

No, there is no such migration is possible for content. You can copy it as-is and hope that it will render as expected.

 

Quote

While pumping the supersampling to 2.0, we could abuse our engine without crash (swapping focus, letting it play) while reaching 100% use of the GPU with a 11.2go/12 VRAM usage. It's only when we asked to reduce the supersampling that it crashed.

Just tried to reproduce this in a several built-in projects, no crashes on my PC (RTX 3060) when rapidly switching between different supersampling modes. But I'm always have some RAM memory free:

image.png

Once you exhaust the committed memory (or your GPU driver starting to allocate memory in shared, not dedicated VRAM) you may observe any kind of crashes, including the device removed error as well.

You also can try to catch some specific GPU errors using the attached nVidia Aftermath lib. Just copy GFSDK_Aftermath_Lib.x64.dll to the bin directory of your project and run your app with additional parameters:

  • -video_debug_shader 1 -video_debug_crash_dump 1

If Aftermath will be able to detect crash (not always it can do that, unfortunately) it would also create a GPU crash dump in the <Project>/bin/gpu_crash_dumps directory. You can collect some gpu dumps and send them to us for further investigation. However, please note that even with these dumps, it is not guaranteed that we will be able to easily identify the root cause or provide a fix.

Thank you!

GFSDK_Aftermath_Lib.x64.dll

  • Like 1

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Posted

In the latest two releases, we have made significant efforts to improve the stability of the DX12 renderer and have resolved many potential and known “Device Removed” errors. Therefore, we would strongly encourage you to consider migrating your project to the most recent SDK version.

While there are still some differences between versions 2.19.x and 2.21.x, the migration process should be smoother than your initial upgrade to version 2.18.x. We believe that this approach will provide a much greater improvement in stability compared to continuing attempts to address these issues in version 2.19.x. We would also be happy to assist you throughout the migration process. We will do our best to help resolve any difficulties you may encounter and make the transition as smooth and efficient as possible.

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Posted

We will keep on investigating. A few things:
 

  • I didn't want to migrate to 2.20 and 2.21 because of API modification (like procedural meshes in 2.20 or skinned meshes in 2.21) that would cost us maybe weeks/months of work, with eventual regression. I'm very conservative in the upgrade process.

I agree with you that nothing indicates that playing with the supersampling, and the crash of the Dx12 API may be unrelated to the one I create myself. However I'm curious to see if this "supersampling" crash can be done on sample projects, and on other machines. I'm curious to see what makes the application fail. It might be due to some aspect of our decades-old app or the size of our world. I will keep you notified at every test worth mentioning.

By the way, our crashes happen even when our commited memory is relatively low.

Posted

@silent I know it's not a normal use, but I managed to crash very easily the oil platform sample on 2.19.1 by raising the supersamping to 3, then lowering i to 2.9 immediately after, on my RTX 2000 Ada laptop, FullHD, on battery. It's strangely when I modify the sampling "lower" that it crashes. I suspect that on your RTX 3060 you might need greater settings to break it..

Might be completely irrelevant to my problems, but it's good to know. And I would actually be interested to know why it crashes in that case, just by curiosity.

  • Thanks 1
Posted

Hi Kevin,

Have you been able to capture any GPU crash dumps using the NVIDIA Aftermath library? Also, could you please confirm whether the supersampling 3 → supersampling 1 workaround functions correctly on version 2.18.x without causing any crashes?

We have observed that if shared VRAM starts being used after enabling supersampling 3, the likelihood of crashes may increase. In such cases, some critical render buffers might be moved from shared memory back to dedicated memory. At the moment, this is one of the possible explanations we are considering.

We will also attempt to reproduce these crashes on our side. However, please note that we currently do not have access to laptops equipped with Quadro GPUs, so our testing environment may not fully match your setup.

Regarding the API changes:

  • Version 2.21 does not require any major migration effort for skinned meshes. In code, the only required change is to replace ObjectMeshSkinned with ObjectMeshSkinnedLegacy. Mesh and animation files will be migrated automatically using the content migration scripts. You can also combine both approaches within your application: keep all legacy animation functionality where needed while developing new features using the new animation graph system.

  • The functionality of procedural meshes in 2.20 was also designed with existing use cases in mind. We have included a sample project that demonstrates how to work with all of the new modes. If you are experiencing any specific difficulties with the new approach, we would be happy to help and do our best to provide a suitable solution.

Thank you!

How to submit a good bug report
---
FTP server for test scenes and user uploads:

Posted

Still testing.

I used -video_debug_shader 1 and -video_debug_crash_dump 1, alongside the Aftermath dll you provided, and it does not intercept and generate crash dumps when we manage to make our application crash...

We managed to make it crash by minimizing/maximizing our window once, but we could not repeat this crash after.

I wanted to test the -video_debug, but it requires DirectX graphic tools.

Going to do that next.

×
×
  • Create New...