NAV/SQL is completely „stalled“ because of too many CPU Threads

Well, I was confronted with the following issue about a dozen times just in the past two months, so I decided that’s probably worth a BLOG article (hey, it’s “Santa Claus” today :) )!

Symptom:

NAV and SQL Server stop responding, the system completely hangs, you cannot even access SQL Management Studio anymore. IF you can access SSMS you’ll see lots of processes/connections waiting on ASYNC_NETWORK_IO, seemingly blocking themselves.
With some luck, the system recovers within several minutes; if not, you need to reboot the Dynamics NAV services.

This only happens on NAV 2015 or 2016, running on builds released after April 2016 and before November 2016. So far I‘ve seen that, the problem only(?) happens on VMWare running the network-adapter “vmxnet3”, but I can imagine other adapters could have the same issue.

Cause:

There are more CPU Threads running than Worker Threads are available.

Now you might ask: Huh???

Explanations:

OK, this is going to be a longer shot; starting from the beginning …

The CPU is the component of a NAV/SQL system which is actually doing all the work. In terms of SQL Server, it’s the thingy processing the queries, the “questions”, so to speak, asked by NAV. Hence, the number of required CPU depends on the number of questions asked which depends on the number of Users, asking those questions. The more Users, the more CPU. A common and pretty good Best Practice for a SQL Server running NAV is:

1 logical CPU (Core/Thread) per 25 Users/Sessions

For example, you run NAV with 100 Users you’ll need 4 CPU (as it is with Best Practices, that’s just an educated guess, which may need to be adjusted to the individual requirements).

(BTW: the same formula applies to the CPU sizing of a NST; details not discussed here)

Now there’s something called “Expensive Queries” (XQ); that’s queries which put pressure on RAM and CPU, taking long time, giving bad User Experience. If such an XQ is processed, SQL Server might decide to invoke multiple CPU to process the query with parallel CPU Threads.

The SQL Instance setting “Max. Degree Of Parallelism” (MAXDOP) defines, how many CPU could be invoked. The default value 0 (zero) means “all” CPU. Thus, this means, that just one XQ could invoke “all” CPU of that server, which has the risk, that this single query could become a “global killer” by occupying too many CPU so other tasks could not be processed anymore. That’s why MAXDOP always (on any SQL Server) must be limited! Again, there are Best Practices, but we have to distinguish several things:

Scenario MAXDOP
Rest-Of-The-SQL-World-(non-NAV) Half number of CPU
NAV 2009 R2 and lower 1
NAV 2013 and higher 2

To make a long story short: NAV queries are not really capable to run on parallel threads.
If the setting is OK could be seen from the “Wait Statistics”, e.g. the CXPACKET waits.

Then there’s another setting, affecting the CPU: “Max. Worker Threads” (WT). Imagine this as some kind of “entry channel” to the CPU. The default setting is 0 (zero) which means, SQL Server is running on a built-in matrix, defining the number of WT created:

Number of CPU 32bit Computer 64bit Computer
<= 4 256 512
8 288 576
16 352 704
32 480 960

Usually this is absolutely OK and there’s no need to change anything.

For example:

We have 100 Users/Session, thus 4 CPU, then having 512 WT. Even assuming MAXDOP is still at default 0 (zero), assuming that all 100 Users at the very same time are executing such XQ, so even if all these 100 XQ are invoking “parallelism” with 4 threads each, this would generate a total of 400 CPU Threads, which is still below the “Max. Worker Threads” setting of 512. Even this would not completely kill the CPU. Hence, in real life hardly ever a problem will arise from this.

But when our problem occurs – everything’s stalled – you will see, that way more CPU Threads are running! Actually, there are more Threads running than WT are configured!
If that happens, thus all “entry channels” to the CPU are occupied, the CPU is overwhelmed, does not accept any further processes/task and gets totally unresponsive. The system is completely stalled, you cannot even connect SSMS or stuff.

You can check the configured WT and the Current Threads with this query:

SELECT max_workers_count as [Configured_Worker_Threads] FROM sys.dm_os_sys_info
SELECT SUM(current_workers_count) as [Running_Worker_Threads] FROM sys.dm_os_schedulers

So, how come? Assuming the SQL Server is properly sized and configured, how can there be more Threads than Workers defined??? Shouldn’t SQL Server limit the number of Threads?

Well, reason is, that also within the network-layer multiple CPU threads can be invoked (remember the good old OSI-model!?). If large result-sets are communicated over the network, the net-adapter could also invoke multiple CPU to handle the traffic quicker – the feature is called “Large Receive Offload” (LRO).
So it looks like – and the exact reasons are not fully clear to me, as this happens really deep in the communication layer – that a certain combination of NAV build (and the used “Multiple Active Result-Set” (MARS) technology) and network adapter is generating this huge number of threads (MARS ==> ASYN_NETWORK_IO), if large result-sets are transmitted between NST and SQL. As a result, all these threads totally overwhelm the CPU and the whole system goes “south” …

So, what can we do about this?

Workaround:

Completely suppress parallelism of the SQL Server instance:

Max. Degree Of Parallelism = 1
Cost Threshold For Parallelism = 60

Increase the number of configured Worker Threads:

Max. Worker Threads = 2048

EXEC sys.sp_configure N'max degree of parallelism', N'1'
GO
EXEC sys.sp_configure N'cost threshold for parallelism', N'60'
GO
EXEC sys.sp_configure N'max worker threads', N'2048'
GO
RECONFIGURE WITH OVERRIDE
GO

Plan B:
Change your network adapter; e.g. in VMWare this only seem to happen with “vmxnet3”; using the older “E1000” should not cause that problem.

This is dealing with the symptoms, not the cause. So additionally you need to identify the “Expensive Queries” which are triggering this mess! Hence, query- and index-tuning is required; maybe programming changes.

Resolution (supposedly):

Well, as this problem only affects certain NAV builds, we could already guess this whole mess is yet another bug in the NST. Well, it is … uh … was: it should be fixed with these Cumulative Updates:

CU25 for NAV 2015 (Build 47254) of 07.11.2016
https://blogs.msdn.microsoft.com/nav/2016/11/07/cumulative-update-25-for-microsoft-dynamics-nav-2015-has-been-released/

CU13 for NAV 2016 (Build 47256) of 07.11.2016
https://blogs.msdn.microsoft.com/nav/2016/11/07/cumulative-update-13-for-microsoft-dynamics-nav-2016-has-been-released/

Hope this could help you!

No warranties, no guaranties, no support.

Cheers & Merry Christmas!

Author: Jörg Stryk

I'm working with MS Dynamics NAV (formerly Navision) since 1997, with SQL Server since 2003. I have a 100% focus on "Performance Optimization & Troubleshooting" (and things in touch with this). Since 2006 I'm supporting Dynamics customers and partners within their NAV projects as a freelancer; so far I have supported more than 500 projects in 25 countries on 5 continents - and counting! I'm developing & distributing the optimization utility the "SSI Performance Toolbox" and I am the author of "The NAV/SQL Performance Field Guide" (ISBN 978-3-8370-1442-6). Microsoft "Most Valuable Professional" (MVP) 2007-2017.

3 thoughts on “NAV/SQL is completely „stalled“ because of too many CPU Threads”

  1. Thanks for this information, can you comment on NAV 2013 and 2013 R2 with respect to the above

  2. So far I haven’t seen this happen with NAV 2013/R2 …

    Recently I got some e-mail, someone telling me this problem happens on Hyper-V systems (NAV 2015/2016), too.

  3. Thanks for this, I’m having poor performance on a NAV2016 azzure implementation with only 10 users – 4gb ram and cpu usage is maxing out.

Comments are closed.