NAV/SQL is completely „stalled“ because of too many CPU Threads

Well, I was confronted with the following issue about a dozen times just in the past two months, so I decided that’s probably worth a BLOG article (hey, it’s “Santa Claus” today :) )!


NAV and SQL Server stop responding, the system completely hangs, you cannot even access SQL Management Studio anymore. IF you can access SSMS you’ll see lots of processes/connections waiting on ASYNC_NETWORK_IO, seemingly blocking themselves.
With some luck, the system recovers within several minutes; if not, you need to reboot the Dynamics NAV services.

This only happens on NAV 2015 or 2016, running on builds released after April 2016 and before November 2016. So far I‘ve seen that, the problem only(?) happens on VMWare running the network-adapter “vmxnet3”, but I can imagine other adapters could have the same issue.


There are more CPU Threads running than Worker Threads are available.

Now you might ask: Huh???


OK, this is going to be a longer shot; starting from the beginning …

The CPU is the component of a NAV/SQL system which is actually doing all the work. In terms of SQL Server, it’s the thingy processing the queries, the “questions”, so to speak, asked by NAV. Hence, the number of required CPU depends on the number of questions asked which depends on the number of Users, asking those questions. The more Users, the more CPU. A common and pretty good Best Practice for a SQL Server running NAV is:

1 logical CPU (Core/Thread) per 25 Users/Sessions

For example, you run NAV with 100 Users you’ll need 4 CPU (as it is with Best Practices, that’s just an educated guess, which may need to be adjusted to the individual requirements).

(BTW: the same formula applies to the CPU sizing of a NST; details not discussed here)

Now there’s something called “Expensive Queries” (XQ); that’s queries which put pressure on RAM and CPU, taking long time, giving bad User Experience. If such an XQ is processed, SQL Server might decide to invoke multiple CPU to process the query with parallel CPU Threads.

The SQL Instance setting “Max. Degree Of Parallelism” (MAXDOP) defines, how many CPU could be invoked. The default value 0 (zero) means “all” CPU. Thus, this means, that just one XQ could invoke “all” CPU of that server, which has the risk, that this single query could become a “global killer” by occupying too many CPU so other tasks could not be processed anymore. That’s why MAXDOP always (on any SQL Server) must be limited! Again, there are Best Practices, but we have to distinguish several things:

Scenario MAXDOP
Rest-Of-The-SQL-World-(non-NAV) Half number of CPU
NAV 2009 R2 and lower 1
NAV 2013 and higher 2

To make a long story short: NAV queries are not really capable to run on parallel threads.
If the setting is OK could be seen from the “Wait Statistics”, e.g. the CXPACKET waits.

Then there’s another setting, affecting the CPU: “Max. Worker Threads” (WT). Imagine this as some kind of “entry channel” to the CPU. The default setting is 0 (zero) which means, SQL Server is running on a built-in matrix, defining the number of WT created:

Number of CPU 32bit Computer 64bit Computer
<= 4 256 512
8 288 576
16 352 704
32 480 960

Usually this is absolutely OK and there’s no need to change anything.

For example:

We have 100 Users/Session, thus 4 CPU, then having 512 WT. Even assuming MAXDOP is still at default 0 (zero), assuming that all 100 Users at the very same time are executing such XQ, so even if all these 100 XQ are invoking “parallelism” with 4 threads each, this would generate a total of 400 CPU Threads, which is still below the “Max. Worker Threads” setting of 512. Even this would not completely kill the CPU. Hence, in real life hardly ever a problem will arise from this.

But when our problem occurs – everything’s stalled – you will see, that way more CPU Threads are running! Actually, there are more Threads running than WT are configured!
If that happens, thus all “entry channels” to the CPU are occupied, the CPU is overwhelmed, does not accept any further processes/task and gets totally unresponsive. The system is completely stalled, you cannot even connect SSMS or stuff.

You can check the configured WT and the Current Threads with this query:

SELECT max_workers_count as [Configured_Worker_Threads] FROM sys.dm_os_sys_info
SELECT SUM(current_workers_count) as [Running_Worker_Threads] FROM sys.dm_os_schedulers

So, how come? Assuming the SQL Server is properly sized and configured, how can there be more Threads than Workers defined??? Shouldn’t SQL Server limit the number of Threads?

Well, reason is, that also within the network-layer multiple CPU threads can be invoked (remember the good old OSI-model!?). If large result-sets are communicated over the network, the net-adapter could also invoke multiple CPU to handle the traffic quicker – the feature is called “Large Receive Offload” (LRO).
So it looks like – and the exact reasons are not fully clear to me, as this happens really deep in the communication layer – that a certain combination of NAV build (and the used “Multiple Active Result-Set” (MARS) technology) and network adapter is generating this huge number of threads (MARS ==> ASYN_NETWORK_IO), if large result-sets are transmitted between NST and SQL. As a result, all these threads totally overwhelm the CPU and the whole system goes “south” …

So, what can we do about this?


Completely suppress parallelism of the SQL Server instance:

Max. Degree Of Parallelism = 1
Cost Threshold For Parallelism = 60

Increase the number of configured Worker Threads:

Max. Worker Threads = 2048

EXEC sys.sp_configure N'max degree of parallelism', N'1'
EXEC sys.sp_configure N'cost threshold for parallelism', N'60'
EXEC sys.sp_configure N'max worker threads', N'2048'

Plan B:
Change your network adapter; e.g. in VMWare this only seem to happen with “vmxnet3”; using the older “E1000” should not cause that problem.

This is dealing with the symptoms, not the cause. So additionally you need to identify the “Expensive Queries” which are triggering this mess! Hence, query- and index-tuning is required; maybe programming changes.

Resolution (supposedly):

Well, as this problem only affects certain NAV builds, we could already guess this whole mess is yet another bug in the NST. Well, it is … uh … was: it should be fixed with these Cumulative Updates:

CU25 for NAV 2015 (Build 47254) of 07.11.2016

CU13 for NAV 2016 (Build 47256) of 07.11.2016

Hope this could help you!

No warranties, no guaranties, no support.

Cheers & Merry Christmas!

Troubleshooting Essentials

This is the recording of my presentation at NAV TechDays 2015 in Antwerp; about “Tips, Tricks & Tools for Troubleshooters“.
This sort of wraps up all the areas we need to investigated in terms of “Performance Optimization” ad highlights how to approach the issues.
Further I’m introducing several tools which are IMHO quite feasible to get the job done.

A similar article was posted on my old BLOG space (actually it is the last article there), too, but I’d like to have this available here, as well, since I consider this most important and a good starting point for troubleshooters.

Please find attached to this article the download package according to the session. Here a brief description of the scripts etc. it includes:

File/Folder Description
NAVTD 2015 NAV SQL Troubleshooting – STRYK – Session 1_4.pptx The PowerPoint slides
PerfMon Templates Templates for PERFMON.EXE In case of a SQL “Named Instance” you’ll have to modify the template – see Readme.

To process the output-data with PAL: PAL basically only works with US localizations. I have attached a modified PAL.ps1 Powershell script which is supposed to run on any localization. Further, I have removed/modified some output according to my demands – that’s the PAL script I use personally. Those of you running non-English PERFMON can translate the counter names into English using PLT.

Profiler Templates Templates for SQL Profiler; SQL Server 2005, 2008, 2008R2, 2012 and 2014. CAUTION: these templates are not pre-filtered! Using Profiler without filters could create an immense amount of trace data which kills your disk! Apply filters as demanded, e.g. Reads >= 1000 AND Duration >= 20
Scripts All kinds of TSQL Scripts and Examples for NAV/SQL Troubleshooting:
Wait Statistics (Paul Randal).sql Displaying SQL WaitStatistics incl. comments
QEP_MissingIndexes.sql Investigates SQL Procedure Cache to find/analyze Expensive Queries; displaying “Missing Index” proposals (naming “ssi%”) CAUTION: as mentioned during the session: NEVER EVER simply APPLY these proposals without VERIFICATION and CLEAN UP! Else you will cause other problems instead of resolving them!
ReScript_SSI_Indexes.sql Displays and documents custom built indexes; creates code for CREATE and DROP (based on naming “ssi%”)
VerifyIndexUsage_SSI.sql Displays Index Usage of custom built indexes CAUTION; statistic is reset on restart of the SQL Server! You need a reliable uptime of the server for sufficient statistics! (based on naming “ssi%”)
template_TraceCheck(generic2).sql Script to analyze/group SQL Profiler Traces
template_NavSQLTraceAnalysis.sql Script to analyze/group SQL Profiler Traces; incl. NAV Call Stack (by “SQL Trace”)
Block_1_{}.sql to Block_6_{}.sql Scripts to establish event-triggered Block Detection (incl. analysis)
Deadlock_1_{}.sql to Deadlock_3_{}.sql Scripts to establish background tracing of Deadlock Graphs (incl. analysis) Well, in the session I only showed you to export the Deadlock Graphs to XML to be opened in Excel. OK, I decided to share another script (didn’t show this in the session), which is way more convenient: “Deadlock_2_TraceCheck_ssi_dlg_check.sql” creates a Table and a Stored Procedure; this SP could read the Deadlock-Trace file directly and used some SQL magic to parse the XML and saves it into the table. “Deadlock_3_TraceCheck_template.sql” shows how to use that.
NAV2013 Some special features only for NAV 2013 and higher. The problem is that without “User Delegation” we hardly can identify an individual user – involved in blocking – from SQL side.
NAV2013_SessionTrace.sql NAV2013_SessionTrace.tdf SQL Profiling (GUI template or TSQL script) to record the “NAV Call Stack”. This requires to have the “SQL Trace” feature up running on the designated NAV Service Tiers. This trace-data could be used to assign the NAV User ID to the Block Detection recordings. CAUTION: running this profiling could create an immense amount of trace data, which potentially kills the disk! NEVER run this unattended; watch the trace-file size/growth carefully. Use this feature only temporary!
GetSessions_NAV2013.sql GetSessionsBlocks_NAV2013.sql Templates showing how the “Session Trace” could be used and assigned to the Block-Recordings.

Please regard, that most of these features work generic, e.g. independently of any application. Means, you could use that not only with NAV, but with any other SQL database as well!

Everything you use, you


Hope this could help you to troubleshoot your NAV!