For nearly a decade, I’ve been occasionally blogging on the Exchange Team Blog, and later on my personal TechNet blog. Those platforms are stable, easy to use, and perfectly acceptable. But they’re not much fun. I want something I can tweak, break, and put back together again.
Now that cloud hosting has become so cheap (free web sites on Windows Azure!) and managing/updating a web site has become so easy (deployment from GitHub or a local Git repository!), I’ve decided to try blogging on a platform that is basically the complete opposite of every other major blogging platform.
It’s called Jekyll, and it’s the platform used for GitHub Pages. What makes it so different is that your blog is a static site - it’s just html and css files sitting on disk, which are served up to the browser as-is. No controllers, no server-side view engine, and no database. To add a new blog post, you literally just drop a text file in a folder, and run Jekyll to update the html files. Done.
A complex content management system with an underlying database, such as Wordpress, is more user-friendly as a hosted solution. However, when you’re running the site yourself, all that complexity can make for a lot of extra work. Being able to manage my blog posts by just altering text files in a folder is pretty amazing.
Did I mention it also has code highlighting for practically every language under the sun, including Powershell? Now when I post a script that is a hundred lines long, it might actually be somewhat readable.
1 2 3 4 5 6 7 8
if ($ExchangeServer-eq"") { # Choose a PF server $pfdbs = Get-PublicFolderDatabase if ($pfdbs.Length -ne$null) { $ExchangeServer = $pfdbs[0].Server.Name }
Alright, I’ve gushed about Jekyll enough. If you’re interested in a different kind of blogging platform, go check it out. Otherwise, stay tuned for more Exchange-related posts.
Someone recently posted a question on an old blog post of mine:
Bill,
We have eliminated our public folders, and I would like to clean out the MESO folder. There are still hundreds of objects that probably serve no purpose, but I don’t see a way of determining which are still necessary.
Some examples:
EventConfig_Servername (where the server is long gone)
globalevents (also globalevents-1 thru 29)
internal (also internal-1 thru 29)
OAB Version 2-1 (and 2-2 and 2-3)
Offline Address Book Storage group name
OWAScratchPad{GUID} (30 of them)
Schedule+ Free Busy Information Storage group name
StoreEvents{GUID} (31 of them)
SystemMailbox{GUID} (over 700 of them)
Most of the SystemMailboxes are Type: msExchSystemMailbox, but 3 are Type: User. I found one that was created last month. Apart from the SystemMailboxes, most everything else has a whenChanged date of 2010. What to do?
Thanks, Mike
When it comes to public folders, you only need MESO objects for mail-enabled folders, and a folder only needs to be mail-enabled if people are going to send email to it. No one ever needs to send email to any of the system folders that are part of your public folder tree.
Everything in Mike’s list except the very last item is a directory object for a system folder, and even if the public folders were still present in the environment, these objects would serve absolutely no purpose. It is fine to delete them whenever you want, though if the folders themselves are still present, you might want to do it gracefully with Disable-MailPublicFolder.
The SystemMailbox objects are trickier. Each SystemMailbox corresponds to a database, and the database is identified by the GUID between the curly braces. To determine if the SystemMailbox object can be safely deleted, you need to determine if that database still exists. This is easy to do with a simple Powershell command:
Here’s an example from one of my labs. You can see that the first command I typed returned nothing, because the GUID didn’t resolve (I purposely changed the last digit). The second one did resolve, returning the DN of the database.
You could also use a simple script to check all the SystemMailbox objects in a particular MESO container and tell you which ones don’t resolve:
Edit 2014-06-02: For an update on this issue, please see this post.
Edit 2015-06-09: For the most recent update, please see this post. We now have a much better solution.
In Exchange 2010, you may find that public folder replication is failing between two servers. If you enable Content Conversion Tracing as described in my Replication Troubleshooting Part 4 post, you may discover the following error:
1
Microsoft.Exchange.Data.Storage.ConversionFailedException: The message content has become corrupted. ---> Microsoft.Exchange.Data.Storage.ConversionFailedException: Content conversion: Failed due to corrupt TNEF (violation status: 0x00008000)
There are other types of TNEF errors, but in this case we’re specifically interested in 0x00008000. This means UnsupportedPropertyType.
What we’ve found is that certain TNEF properties that are not supposed to be transmitted are making it into public folder replication messages anyway. These properties are 0x12041002 and 0x12051002.
To fix the problem, you can manually remove those properties from the problem items using MFCMapi, or you can use the following script.
The script accesses the public folder via EWS, so you must have client permissions to the folder in order for this to work (just being an administrator is not sufficient). Also, it requires EWS Managed API 2.0. Be sure to change the path of the Import-Module command if you install the API to a different path.
With this syntax, the script only checks for problem items in the specified folder. If you want it to fix those items, you must add -Fix $true to the command. Optionally, you can also add the -Verbose switch if you want it to output the name of every item as it checks it.
Edit: Moved the script to gist.github.com - easier to maintain that way
Edit: Updated the script to automatically recurse subfolders if desired. To do so, add -Recurse $true. For example, to process every single public folder, pass -Recurse $true with no folder path:
Today, I want to talk about another public folder replication problem we see repeatedly. Aren’t you glad PF replication is gone in Exchange 2013?
This is one of the rarer public folder replication issues that we see, and it’s caused by the attributes on the database. Actually, a database in this state sometimes causes a problem and sometimes does not, and I want to explain why that is.
The way this problem surfaces is that you see an event 3085 stating that outgoing replication failed with error 0x8004010f. If you try something like Update Content, you’ll get some error output with a diagnostic context that looks like this:
There are many problems that could cause some diagnostic output that looks similar to this. For this particular problem the error must be MapiExceptionNotFound, and the sequence of Lids will usually be pretty close to what you see here.
This error occurs when the replica list on a public folder contains the GUID of a public folder database which does not have an msExchOwningPFTree value. It’s easy to find a database in this state with an ldifde command to dump the properties of any public folder database objects where this value is not set:
Delete the folder, if you can figure out which one it is.
Populate the msExchOwningPFTree value.
Delete the database in question from the Active Directory.
Option 1 is usually not desirable, but I included it to illustrate the fact that a database in this state only causes a problem if existing folders ever had replicas on it. Keep in mind that the replica list you see in the management tools only shows you the current active replicas. The internal replica list tracks every replica that has ever existed, forever. Even if you remove all replicas from the database in question using the management tools, the GUID of that database is still present in the internal replica list, and it always will be. Thus, you cannot unlink a database from the hierarchy if any existing folder has ever had replicas on it - at least, not without breaking replication.
This is important, because certain third-party software will purposely keep public folder databases around that are not linked to the hierarchy. And that works fine, as long as they don’t have replicas, and never did.
Option 2 is the proper approach to fixing this situation if the database is still alive. Perhaps someone manually cleared the msExchOwningPFTree while troubleshooting or trying to affect the routing of emails to public folders. Just set the value to contain the DN of the hierarchy object. You can check your other PF databases to see what it should look like, as they should all have the same value. A few minutes after setting the value, replication should start working again.
If the database has been decommissioned, perhaps ungracefully, and it no longer exists, then you can go with option 3 and simply delete the Active Directory object for the database using ADSI Edit. When the GUID in the replica list does not resolve to an object in the AD, that’s fine - that’s the normal state for a folder that once had replicas on databases that aren’t around anymore, so it doesn’t cause any problem.
In Exchange 2013, the built-in Delegated Setup role group allows users to install new Exchange 2013 servers after those servers have been provisioned with the /NewProvisionedServer switch. However, you may find that even after provisioning the server, when a member of Delegated Setup attempts to install the server, it fails. The setup log from the delegated setup attempt shows:
1 2 3 4 5 6 7 8 9 10 11 12 13
[11/07/201321:11:33.0015] [1] Failed [Rule:GlobalServerInstall] [Message:You must be a member of the 'Organization Management' rolegroupor a member of the 'Enterprise Admins' groupto continue.]
[11/07/201321:11:33.0031] [1] Failed [Rule:DelegatedBridgeheadFirstInstall] [Message:You must use an account that's a member of the Organization Management rolegroup to install or upgrade the first Mailbox server rolein the topology.]
[11/07/201321:11:33.0031] [1] Failed [Rule:DelegatedCafeFirstInstall] [Message:You must use an account that's a member of the Organization Management rolegroup to install the first Client Access server rolein the topology.]
[11/07/201321:11:33.0031] [1] Failed [Rule:DelegatedFrontendTransportFirstInstall] [Message:You must use an account that's a member of the Organization Management rolegroup to install the first Client Access server rolein the topology.]
[11/07/201321:11:33.0031] [1] Failed [Rule:DelegatedMailboxFirstInstall] [Message:You must use an account that's a member of the Organization Management rolegroup to install or upgrade the first Mailbox server rolein the topology.]
[11/07/201321:11:33.0031] [1] Failed [Rule:DelegatedClientAccessFirstInstall] [Message:You must use an account that's a member of the Organization Management rolegroup to install or upgrade the first Client Access server rolein the topology.]
[11/07/201321:11:33.0031] [1] Failed [Rule:DelegatedUnifiedMessagingFirstInstall] [Message:You must use an account that's a member of the Organization Management rolegroup to install the first Mailbox server rolein the topology.]
This occurs if legacy Exchange administrative group objects exist from when Exchange 2003 was still present in the organization. Unfortunately, setup does not handle this gracefully in the delegated setup scenario.
To fix the problem, you could delete the legacy administrative groups, but we don’t recommend this. Instead, a safer approach is to simply add an explicit deny for the Delegated Setup group on the legacy administrative groups. This prevents setup from seeing those admin groups, and it proceeds as normal. After setup is finished, you can remove the explicit deny to put the permissions back in their normal state.
Setting the explicit deny is fairly easy to do in ADSI Edit, but I’ve also written a simple script to make this easier when you have a lot of legacy admin groups. The script takes no parameters. Run it once to add the Deny, and run it again to remove the Deny:
In Exchange Server, when a call into the Information Store fails, we often report a diagnostic context. This information is extremely useful for those of us in support, because we can often use it to see exactly where the call failed without having to do any additional data collection. Unfortunately, diagnostic context info is mostly useless to customers, because it’s impossible to make sense of it without the source code. In this post, I’ll describe one specific thing you can look for in a diagnostic context to identify calls that are failing due to contention for the mailbox lock.
In Exchange 2013, changing something in a mailbox usually involves acquiring a lock so that other changes cannot be made at the same time. If an operation has grabbed the mailbox lock, any other operations that want to change things have to wait. They will line up and wait for the mailbox lock, and will eventually time out if they don’t get it within a reasonable amount of time. However, there’s a limit to how long the line itself is allowed to get. Once we have more than 10 operations waiting for the lock, any additional operations fail instantly with MAPI_E_TIMEOUT (0x80040401).
If you have a diagnostic context from Exchange 2013, perhaps from an event that was logged in the Application Log, then you can check for this situation by looking for LID 53152 with dwParam 0xA. Here is an example:
You’ll notice the LID we’re interested in is at the top of the remote context. The fact that LID 53152 shows a dwParam of 0xA means that we already have 0xA (decimal 10) operations waiting on the mailbox lock, so we purposely make this call fail instantly, without waiting at all. This usually results in a MapiExceptionTimeout and a StorageTransientException in all sorts of different places.
Once you’ve identified that mailbox contention is causing the error, there’s still the question of why there is so much contention for the mailbox lock. Are there dozens of clients all trying to make changes in the mailbox at the same time? Is an application hammering the mailbox with requests? You still need to investigate to find the root cause, but after understanding this piece, you can at least begin to ask the right questions.
Today, I want to highlight a behavior that isn’t really called out anywhere in any existing documentation I can find. This is the behavior that occurs when Offline Address Book generation on Exchange 2010 logs an event 9414, such as this one:
Event ID: 9414
Source: MSExchangeSA
One or more properties cannot be read from Active Directory for
recipient '/o=Contoso/ou=Administrative Group/cn=Recipients/cn=User 1'
in offline address book for '\Offline Address List - Contoso'.
When we stumble across a bad object like this, the OAB generation process will often skip a few good objects (in addition to the bad object) due to the way we handle the bookmark. As a result, User 1, from the event above, won’t be the only thing missing from your Offline Address Book. If you turn up logging to the maximum so that OABGen logs every object it processes, you can figure out which objects are being skipped by observing which objects do not appear in the event log.
The bottom line is: If you want your OAB to be complete, you must fix the objects that are causing 9414‘s, even if the objects in the 9414‘s aren’t ones you particularly care about.
So, why does it work this way, you ask?
The 9414 event was born in Exchange 2010 SP2 RU6. Before that, one of these bad objects would make OABGen fail completely and log the chain of events in KB 2751581 - most importantly, the 9339:
Event ID: 9339
Source: MSExchangeSA
Description:
Active Directory Domain Controller returned error 8004010e while
generating the offline address book for '\Global Address List'. The
last recipient returned by the Active Directory was 'User 9'. This
offline address book will not be generated.
- \Offline Address Book
Unfortunately, the old 9339 event didn’t know what the actual problem object was. OABGen was working on batches of objects (typically 50 at a time), and when there was a problem with one object in the batch, the whole batch failed. All that OABGen could point to was the last object from the last successful group, which didn’t really help much.
Thus, the OABValidate tool was born. The purpose of this tool is to scour the Active Directory looking for lingering links, lingering objects, and other issues that would trip up OABGen. As Exchange and Windows both changed the way they handled these calls, the behavior would often vary slightly between versions, so OABValidate just flags everything that could possibly be a problem. Which object was actually causing the 9339 wasn’t certain, but if you fixed everything OABValidate highlighted, you would usually end up with a working OAB.
In large environments with hundreds of thousands of mail-enabled objects, cleaning up everything flagged by OABValidate could be a huge, time-consuming process. On top of that, residual AD replication issues could introduce new bad objects even as you were cleaning up the old bad objects.
Finally, thanks to a significant code change in Exchange 2010 SP2 RU6, Exchange was able to identify the actual problem object and point it out in a brand new event, the 9414. In addition, OABGen would skip the object and continue generating the OAB, so that it wasn’t totally broken by a single bad object anymore. This was a huge step forward that not only made OABValidate obsolete for most scenarios, but resulted in a situation where these OABGen errors can often go unnoticed for quite some time.
When someone finally does notice that the OAB is missing stuff, and you go look at your application log, you might think you can ignore these 9414‘s since they don’t mention the object you’re looking for. However, OABGen does still process objects in batches, and when it trips over that one bad object, the rest of the batch typically gets skipped.
So if you find that your OAB is missing objects, the first thing to do is check for 9414‘s and resolve the problems with those objects. While this does take a bit of work, it’s much better than the methods you had to use to resolve this sort of issue before SP2 RU6.
In a post last month, called Cleaning up Microsoft Exchange System Objects (MESO), I described how to determine which objects can be eliminated from the MESO container if you have completely removed public folders from your environment. But what if you still have public folders?
As I mentioned in my previous post, you only need MESO objects for mail-enabled public folders. When you mail-enable a public folder, Exchange creates a directory object for it, and when you mail-disable or delete the folder, Exchange is supposed to delete the directory object. Unfortunately, that doesn’t always work like it should, and you can end up with a lot of public folder objects in the MESO container that don’t point to any existing folder.
To make matters worse, it’s not very easy to figure out which directory objects point to an actual folder. You can’t assume much from the name itself - you could have dozens of public folders all named “Team Calendar” in different parts of the hierarchy, so which directory object points to which folder?
When you send email to a mail-enabled public folder, Exchange uses the legacyExchangeDN attribute on the directory object to look up the folder in the public folder database (or public folder mailbox in the case of Exchange 2013). However, the legacyExchangeDN property on the public folder in the database is an internal property - you can’t see it, even using tools like MFCMapi. So matching them up that way is not an option.
However, you can go in the other direction. Rather than taking a directory object and trying to find the store object, you can start with the store object and find the corresponding directory object easily. This is because if you look at the MAPI property PR_PF_PROXY on the folder, the store finds the correct directory object and returns its objectGUID. This is essentially what happens when you run Get-PublicFolder \Some\Folder | Get-MailPublicFolder in Exchange Management Shell.
Thus, in order to figure out which public folder directory objects are not linked to anything, you would need to retrieve all the directory objects that exist and then determine which ones are linked to folders based on PR_PF_PROXY or the Powershell cmdlets. After you eliminate those, you know that any public folder directory objects left over are not linked to anything, and they can be deleted.
There are a few ways you could go about this. One would be to use a client API such as Exchange Web Services to enumerate the public folders and check the property that way. While I do use EWS in a lot of my scripts, there is one big drawback to using it for this sort of operation - the fact that there is no way to use admin rights via EWS. As I explained in an old post called Public Folder Admin Permissions Versus Client Permissions, it doesn’t matter what admin rights your user account has when you’re using a client like Outlook. Outlook never attempts to pass admin flags at logon, so if you don’t have client permissions to a public folder, you won’t be able to see that public folder, even if you’re logged on as an org admin. EWS works the same way - there is no way to pass admin flags via EWS. This means that if you use EWS, you might not see all the public folders, so you might erroneously delete public folder directory objects that are actually still in use.
You could work around this limitation by granting yourself client permissions to all the public folders. Another option is to use MAPI, where you can pass admin flags. Of course, writing a MAPI tool is not trivial.
A better approach is to just use Exchange Management Shell. While this can be slower than EWS, the management shell uses your admin rights, so you will be able to see all public folders in the hierarchy, even if you don’t have client permissions to them.
However, there is one other caveat to be aware of. Sometimes, public folders can have directory objects when the public folder is not flagged as mail-enabled. This is described in KB 977921. If the folder is in this state, email sent to the folder will succeed, even though the management shell says the folder is not mail-enabled. You should be sure your folders are not in this state before you start making decisions about what to delete based on what Exchange Management Shell says, or else you might delete a directory object for a folder that is actually functioning as a mail-enabled folder.
That said, I created a simple script that demonstrates how you can check for unneeded public folder directory objects using Exchange Management Shell. Note that this script only identifies the unneeded directory objects. I’ll leave the actual deletion of them as an exercise for the reader. Hint: The $value in the loop at the end is the distinguishedName of the directory object. It’s probably a good idea to sanity check the results, and you might want to export the directory objects before you start deleting things.
I recently worked on an issue where the domain controllers kept intentionally disconnecting the Exchange servers. The error messages that described the reason for the disconnect were rather misleading, and we ended up wasting quite a bit of time taking steps that had no chance of improving the situation. In this blog post, I’m going to document this behavior in detail, in hopes of saving anyone else who runs into this a lot of time and effort.
The Problem
The behavior we observed was that Exchange would lose its connection to its config DC. Then, it would change DCs and lose connection to the new one as well. This would repeat until it exhausted all in-site DCs, generated an event 2084, and started hitting out-of-site DCs, often returning the same error. Usually, the error we saw was a 0x51 indicating the DC was down:
Description: Process w3wp.exe () (PID=10860). Exchange Active Directory Provider lost contact with domain controller dc1.bilong.test. Error was 0x51 (ServerDown) (Active directory response: The LDAP serveris unavailable.). Exchange Active Directory Provider will attempt to reconnect with this domain controller when it is reachable.
Network traces revealed that the DC was intentionally closing the LDAP connection. Once we discovered that, we set the following registry value to 2 in order to increase the logging level on the DC:
With that set to 2, the DC started generating a pair of Event ID 1216 events every time it disconnected Exchange. The second 1216 event it generated wasn’t particularly helpful:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Log Name: Directory Service Source: Microsoft-Windows-ActiveDirectory_DomainService Event ID:1216 Task Category: LDAP Interface Level: Warning Description: Internal event: An LDAP client connection was closed because of an error. Client IP: 192.168.0.190:8000 Additional Data Error value: 1236 The network connection was aborted by the local system. Internal ID: c0602f1
But the first one gave us something to go on:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Log Name: Directory Service Source: Microsoft-Windows-ActiveDirectory_DomainService Event ID: 1216 Task Category: LDAP Interface Level: Warning Description: Internal event: An LDAP client connection was closed because of an error. Client IP: 192.168.0.190:8000 Additional Data Error value: 8616 The LDAP servers network send queue has filled up because the client isnot processing the results of it's requests fast enough. No more requests will be processed until the client catches up. If the client does notcatch up then it will be disconnected. Internal ID: c060561
The LDAP client, in this case, is Exchange. So this error means the Exchange server isn’t processing the results of the LDAP query fast enough, right? With this information, we started focusing on the network, and we spent days pouring over network traces trying to figure out where the network bottleneck was, or if the Exchange server itself was just too slow. We also found that sometimes, the 2070 event would show a 0x33 error, indicating the same send queue problem that was usually masked by the 0x51 error:
Description: Process w3wp.exe () (PID=10860). Exchange Active Directory Provider lost contact with domain controller dc1.bilong.test. Error was 0x33 (Busy) (Additional information: The LDAP servers network send queue has filled up because the client isnot processing the results of it's requests fast enough. No more requests will be processed until the client catches up. If the client does notcatch up then it will be disconnected.
Active directory response: 000021A8: LdapErr: DSID-0C06056F, comment: The server is sending data faster than the client has been receiving. Subsequent requests will fail until the client catches up, data 0, v1db1). Exchange Active Directory Provider will attempt to reconnect with this domain controller when it is reachable.
We removed antivirus, looked at NIC settings, changed some TCP settings to try to improve performance, all to no avail. Also, we weren’t able to reproduce the error using various LDAP tools. No matter what we did with Powershell, LDP, ldifde, or ADFind, the DC would not terminate the connection. It was only terminating the Exchange connections.
We eventually found out that this error had nothing to do with how fast the LDAP client was processing results, and it is possible to reproduce it. In fact, you can reproduce this LDAP error at will in any Active Directory environment, and I will show you exactly how to do it.
LDAP Send Queue 101
Here’s how Active Directory’s LDAP send queue limit works. The send queue limit is a per-connection limit, and is roughly 23 MB. When a DC is responding to an LDAP query, and it receives another query over the same LDAP connection, it first checks to see how much data it is already pushing over that connection. If that amount exceeds 23 MB, it terminates the connection. Otherwise, it generates the response to the second query and sends it over the same connection.
Think about that for a minute - it has to receive another LDAP query over the same LDAP connection while it’s responding to other queries. You can do that? Yep. As noted in the wldap32 documentation on MSDN:
The rules for multithreaded applications do not depend on whether each thread shares a connection or creates its own connection. One thread will not block while another thread is making a synchronous call over the same connection. By sharing a connection between threads, an application can save on system resources. However, multiple connections give faster overall throughput.
Until now, I had always thought of LDAP as a protocol where you send one request and wait for the response before sending your next request over that connection. As it turns out, you can have multiple different threads all submitting different requests over the same connection at the same time. The API does the work of lining up the requests and responses and getting the right responses back to the right threads, and LDAP has no problem with this - at least, not until you hit the send queue limit.
This is why we could never reproduce this issue with other LDAP tools. Every single one of those tools issues one request and waits for the response, and in that case, it is impossible to get disconnected due to the send queue limit.
The Solution
In the case of Exchange, we share the config DC connection between multiple threads. One thread would kick off a complete topology rediscovery, which involves querying for all the virtual directories in the environment. In this particular environment, there were thousands of virtual directories, and the properties on the OWA virtual directories can be relatively large. The DC would generate a response containing a page of virtual directory objects (we were using a page size of 1,000), and due to the number of properties on those objects, this response exceeded the 23 MB limit.
By itself, that wasn’t enough to cause a problem. The problem happened when some other thread came along and used the same LDAP connection to ask for something else - maybe it just needed to read a property from a server object. When that second query hit the DC while the DC was still sending us the response to the virtual directory query, the DC killed the connection due to the send queue limit.
So, how can you avoid this? As a user of software, there’s not much you can do except delete objects until the LDAP response is small enough to be under the send queue limit, or reduce the MaxPageSize in the Active Directory LDAP policies to force everything to use a smaller page size.
As a developer of software, there are a few approaches you can take to avoid this problem. One is to not submit multiple queries at the same time over a single connection; either wait for the previous query to return, or open a new connection. Another approach is to reduce the page size used by your query so that the response size doesn’t exceed the send queue limit. That’s the approach we’re taking here, and the page size used for topology rediscovery is being reduced in Exchange so that the LDAP response to the virtual directory query doesn’t exceed the send queue limit in large environments.
Note that this update to Exchange will fix one very specific scenario where you’re hitting this error due to the size of the virtual directory query in an environment with hundreds of CAS servers. Depending on your environment, there may be other ways to produce this error that are unrelated to the virtual directories.
Let’s Break It On Purpose
After I thought I understood what was happening, I wanted to prove it by writing some code that would intentionally hit the send queue limit and cause the DC to disconnect it. This turned out to be fairly easy to do, and the tool is written in such a way that you can use it to reproduce a send queue error in any environment, even without Exchange. Note that causing a send queue error doesn’t actually break anything - it just makes the DC close that particular LDAP connection to that particular application.
In order to produce a send queue error, you need a bunch of big objects. In my lab, I used a Powershell script to create 500 user objects and filled those user objects with multiple megabytes of totally bogus proxyAddress values. Here’s the script:
If you run this script, you’ll end up with some objects that look like this:
Lovely, isn’t it? I needed a way to make these user objects really big, and stuffing a bunch of meaningless data into the proxyAddresses attribute seemed like a good way to do it.
Now that you have enough big objects that you can easily exceed the send queue limit by querying for them, all you need is a tool that will query for them on one thread while another thread performs other queries on the same LDAP connection. To accomplish that, I wrote some C# code and called it LdapSendQueueTest. Find the code on GitHub here: https://github.com/bill-long/LdapSendQueueTest.
Once you compile it, you can use it to query those big objects and reproduce the send queue error:
In this example, 1 is the number of threads to spawn (not counting the main thread, which hammers the DC with tiny queries), and 50 is the page size. Apparently I went a little overboard with the amount of data I’m putting in proxyAddresses, because with these objects, the error reproduces even with just 1 thread and a relatively small page size of 50 or even 30. The only way I can get the tool to complete against these test users is to make the page size truly tiny - about 15 or less.
In any real world scenario, you can probably get away with a larger page size, because your objects probably aren’t as big as the monsters created by this Powershell script. The tool lets you point to whatever container and filter you want, so you can always just test it against a set of real objects and see.
Conclusion
The bottom line is this: When you see this error from Active Directory telling you the client isn’t keeping up, the error doesn’t really mean what it says. If you take a closer look at what the application is doing, you may find that it’s sharing an LDAP connection between threads while simultaneously asking for a relatively large set of data. If that’s what the application is doing, you can reduce the MaxPageSize in the LDAP policies, which will affect all software in your environment, or you can delete some objects or delete some properties from those objects to try to get the size of that particular query down. Ideally, you want the software that’s performing the big query to be updated to use a more appropriate page size, but that isn’t always possible.
It’s my job to solve difficult problems involving Exchange Server, and this often involves a lot of various types of tracing. Almost daily, I find myself needing to parse through huge amounts of text to find the relevant information. For one issue alone, I currently have over 20 GB of traces in the form of text files.
Usually, I can get by with findstr. This handy little tool is included with Windows, is very fast, and supports regular expressions… sort of. Running findstr /? produces this quick reference:
It tells us to refer to the online documentation for full information on findstr regular expressions, but if you go there, you’ll find the same ten options listed. If you’ve ever looked at any regexp documentation, you know there are a lot more options than this. The regexp quick reference on MSDN lists over 70.
Eventually, I ran into an issue where the lack of full regexp support in findstr was a showstopper. I really, really needed to OR two regular expressions and have all the results combined in one set of results, chronologically from the top of the trace to the bottom of the trace. With findstr, there is apparently no way to do this, because it doesn’t support the bar character which represents an OR in a regexp.
A quick search led to a helpful StackOverflow thread (which was closed as “not constructive” for some reason), but it seems the tools of choice for most people are GUI tools - grepWin or PowerGREP. A few people mentioned using plain old grep via Cygwin or GnuWin32. I have Cygwin installed on one of my machines, but that seems like a lot of stuff to install just to search a text file.
Maybe it’s just me, but when it looks like I need to run an installer to accomplish a very basic task, I start cringing and looking for other options. When you support Windows software for a living, you spend a significant chunk of your life clicking through setup screens. That’s one of the reasons I’m in love with Chocolatey. If you haven’t tried Chocolatey, you should. It will significantly reduce the number of setup screens in your life. Check out Boxstarter too, while you’re at it.
Via Chocolatey, I stumbled across a nice little utility called BareGrep, which almost hit every checkbox on my wish list: It’s a tiny exe, no installer, and it accepts command line parameters. Unfortunately, it displays the results in a GUI, which is a deal-breaker.
Finally, I decided the best option for me was to reinvent the wheel using Powershell and .NET. My initial script was very simple and just did this:
I ended up making it a little fancier in order to validate parameters and support multiple files and recursion. It would probably run faster if I converted it to C#, but so far the Powershell version has been fast enough that it doesn’t matter much.
Maybe one day findstr will get updated with better regexp support. Until then, I’m using this script. I still have BareGrep in my path, as well. The GUI results aren’t that bad when I don’t need to pipe the output to a new file and process them further.
There are a lot of little problems I run across that I never investigate, simply because there’s no impact and no one seems to care. I have my hands full investigating issues that are impacting people, so investing time to chase down something else is usually not a good use of my time.
One of those errors is something that MfcMapi returns when you open public folders. In many environments, including some of my own lab environments, if you open MfcMapi and double-click Public Folders, you get a dialog box stating that error 0x8004010fMAPI_E_NOT_FOUND was encountered when trying to get a property list.
If you click OK, a second error indicates that GetProps(NULL) failed with the same error.
After clicking OK on that error, and then double-clicking on the public folders again, you get the same two errors, but then it opens. At this point you can see the folders and everything appears normal.
I’ve been seeing this error for at least five years - maybe closer to ten. It’s hard to say at this point, but I’ve been seeing it for so long, I considered it normal. I never looked into it, because no one cared.
That is, until I got a case on it recently.
Some folks use MfcMapi as the benchmark to determine if things are working. If MfcMapi doesn’t work, then the problem is with Exchange, and their own product can’t be expected to work.
This was the basis for a recent case of mine. A third-party product wasn’t working, so they tried to open the public folders with MfcMapi, and got this error. Therefore, they could not proceed with troubleshooting until we fixed this error.
Of course, as far as I knew, this error was totally normal, and I told them so, but they still wanted us to track it down. Fortunately, this provided a perfect opportunity to chase down one of those little errors that has bothered me for years, but that I never investigated.
By debugging MfcMapi (hey, it’s open source, anyone can debug it) and taking an ExTRA trace on the Exchange side, we discovered that MfcMapi was trying to call GetPropList on an OAB that did not exist. Looking in the NON_IPM_SUBTREE, we only saw the EX: OAB, which Exchange hasn’t used since Exchange 5.5.
In Exchange 2000 and later, we use the various OABs created through the Exchange management tools. The name will still have a legacy DN, but it won’t start with EX:, so it’s easy to distinguish the real OABs from an old unused legacy OAB folder. Here’s what a real OAB looks like in the public folders, when it’s present:
In this case, we didn’t see the real OAB. We only saw the site-based OAB from the Exchange 5.5 days.
It turned out that the real OAB was set to only allow web-based distribution, not PF distribution. That explained why the OAB could not be seen in the NON_IPM_SUBTREE. Despite that fact, MfcMapi was still trying to call GetPropList on it. Since the folder didn’t exist, it failed with MAPI_E_NOT_FOUND.
Thus, one of the great mysteries of the universe (or at least my little Exchange Server universe) is finally solved!
In the customer environment, we fixed the error by enabling PF distribution for the OAB. I doubt this had anything to do with the issue the third-party tool was having, but who knows? At the very least, we were able to move the troubleshooting process forward by solving this, and maybe this blog post will save people from chasing their tails over this error in the future.
Back in January, I wrote a blog post about PF replication failing due to corrupt TNEF. The problem is caused by the presence of a couple of properties that have been deprecated and shouldn’t be present on items anymore. At the time I wrote that post, we thought you could run the cleanup script to remove the properties and live happily ever after. So much for that idea.
We found that, in some environments, the problem kept coming back. Within hours of running the script, public folder replication would break again, and we would discover new items with the deprecated properties.
We recently discovered how that was happening. It turns out that there is a code path in Exchange 2013 where one of the properties is still being set. This means messages containing that property will sometimes get delivered to an Exchange 2013 mailbox. The user can then copy such an item into a public folder. If the public folders are still on Exchange 2010 or 2007, replication for that folder breaks with the corrupt TNEF error:
1
Microsoft.Exchange.Data.Storage.ConversionFailedException: The message content has become corrupted. ---> Microsoft.Exchange.Data.Storage.ConversionFailedException: Content conversion: Failed due to corrupt TNEF (violation status: 0x00008000)
Now that we know how this is happening, an upcoming release of Exchange 2013 will include a fix that stops it from setting this property. You’ll need to continue using the script from the previous post to clean up affected items for now, but there is light at the end of the tunnel.
Over the holiday weekend, I was deleting some old projects out of my coding projects folder when Powershell returned an error stating, “The specified path, file name, or both are too long. The fully qualified file name must be less than 260 characters, and the directory name must be less than 248 characters.” I found that attempting to delete the folder from explorer or a DOS prompt also failed.
This error occurred while I was trying to remove a directory structure that was created by the yeoman/grunt/bower web development tools. Apparently npm or bower, or both, have no problem creating these deep directory structures on Windows, but when you later try to delete them, you can’t.
A little searching turned up several blog posts and a Stack Overflow question. The workaround of appending “\\?\“ to the beginning of the path didn’t seem to work for me.
I found some tools that claimed to be able to delete these files, but as usual, I was annoyed at the idea of having to install a tool or even just download an exe to delete some files.
Edit: Thanks to AlphaFS, this is much easier now. I’ve removed the old script. With AlphaFS, you can delete the folder with a single Powershell command. First, you need to install the AlphaFS module into Powershell, and the easiest way to do that is with PsGet.
So first, if you don’t have PsGet, run the command shown on their site:
Once it’s installed, import the PsGet module, and use it to install AlphaFS. Note the following command refers to what is currently the latest release of AlphaFS, but you might want to check for a later one:
One of the challenges with analyzing complex Exchange issues is data collection. Once the server goes into the failed state, any data collection at that point only shows us what the failed state looks like. It doesn’t show us how it went from working to failing, and sometimes, that’s what we need to see in order to solve the problem.
Certain types of data collection are fairly easy to just leave running all the time so that you can capture this transition from the working state to the failing state. For instance, you can typically start a perfmon and let it run for days until the failure occurs. Similarly, event logs can easily be set to a size that preserves multiple days worth of events.
Other types of data are not so easy to just leave running. Network traces produce so much data that the output needs to be carefully managed. You can create a circular capture, but then you have to be sure to stop the trace quickly at the right time before it wraps. The same applies to ExTRA traces, LDAP client traces, etc.
In several cases over the past year, I’ve solved this problem with a Powershell script. My most recent iteration of the script appears below, but I usually end up making small adjustments for each particular case.
In its current version, running the script will cause it to:
Start a chained nmcap. Note that it expects 3.4 to be present so it can use the high performance capture profile.
Start a circular ExTRA trace.
Start a circular LDAP client trace.
Wait for the specified events to occur.
While it waits, it watches the output folder and periodically deletes any cap files beyond the most recent 5. When the event in question occurs, it then:
Collects a procdump.
Stops the nmcap.
Stops the LDAP client trace.
Stops the ExTRA trace.
Saves the application and system logs.
All of these features can be toggled at the top of the script. You can also change the number of cap files that it keeps, the NIC you want to capture, the PID you want to procdump, etc.
The script will almost certainly need some slight adjustments before you use it for a particular purpose. I’m not intending this to be a ready-made solution for all your data collection needs. Rather, I want to illustrate how you can use Powershell to make this sort of data collection a lot easier, and to give you a good start on automating the collection of some common types of logging that we use for Exchange.
In Exchange 2010 and older, when you mount a public folder database, the Information Store service asks Active Directory for the costs from this site to every other site that contains a public folder database. This is repeated about every hour in order to pick up changes. If a client tries to access a public folder which has no replica in the local site, Exchange uses the site cost information to decide where to send the client. This means that, as with so many other features, public folder referrals will not work properly if something is wrong with AD.
There are several steps involved in determining these costs.
Determine the name of the site we are in, via DsGetSiteName.
Determine the names of all other sites that contain PFs.
Bind to the Inter-Site Topology Generator, via DsBindToISTG.
Send the list of site names for which we want cost info, via DsQuerySitesByCost.
From the sites in the response, we will only refer clients to those between cost 0 and 500.
This gives us a lot of opportunities to break. For example:
Can’t determine the local site name.
Can’t bind to the ISTG.
The costs returned are either infinite (-1) or greater than 500.
I recently had a case where we were fighting one of these issues, and I could not find a tool that would let me directly test DsQuerySitesByCost. So, I created one. The code lives on GitHub, and you can download the binary by going to the Release tab and clicking DsQuerySitesByCost.zip:
Typically, you would want to run this from an Exchange server with a public folder database, so that you can see the site costs from that database’s point of view. The tool calls DsGetSiteName, DsBindToISTG, and DsQuerySitesByCost, so it should expose any issues with these calls and make it easy to test the results of configuration changes.
You can run the tool with no parameters to return costs for all sites, or you can pass each site name you want to cost as a separate command-line argument.
I keep deciding not to write this post, because Exchange 2010 is out of mainstream support. And yet, we are still getting these cases from time to time, so I suppose I will finally write it, and hopefully it helps someone.
In Exchange 2010, we had several bugs that led to the database leaking space. These bugs had to do with the cleanup of deleted items. When a client deletes items, those items go into a cleanup queue. A background process is supposed to come through and process the items out of the cleanup queue to free up that space. Unfortunately, that didn’t consistently work, and sometimes the space would be leaked.
There was an initial attempt to fix this, which did resolve many of the open cases I had at the time, but not all of them. Later, another fix went in, and this resolved the issue for all my remaining cases at the time. Both of those fixes were included in Exchange 2010 SP3 RU1.
After that, we occasionally still see a case where space is leaking even with these fixes in place. But every time we try to trace it so we can fix the problem, the act of turning on tracing fixes the behavior. I’ve been back and forth through that code, and there’s no apparent reason that the tracing should affect the way cleanup actually behaves. Nonetheless, in these rare cases where the fixes didn’t totally fix the problem, tracing fixes it every time. I wish I knew why.
The tracing workaround has its limitations, though. The cleanup queue is not persisted in the database, so tracing only works for an active leak where the database has not yet been dismounted. After the database is dismounted, any leaked space is effectively permanent at that point, and your best bet is to move the mailboxes off. When the entire mailbox is moved, that leaked space will be freed, since it was still associated with the mailbox.
So, how can you tell if you’re being affected by this problem? One option is to just turn on tracing:
Launch ExTRA.
Choose Trace Control. You’ll get a standard warning. Click OK.
Choose a location for the ETL file and choose the option for circular logging. You can make the file as large or as small as you want. It doesn’t really matter, since our goal here isn’t to look at the trace output.
Click the Set manual trace tags button.
At the top, check all eight Trace Type boxes.
Under Components to Trace, highlight the Store component (but don’t check it).
In the Trace Tags for store on the right, check the box next to tagCleanupMsg. We only need this one tag.
Click Start Tracing at the bottom.
Let the trace run for a day or two and observe the effect on database whitespace. If you see significant amounts of space being freed with tracing on, then you’re hitting this problem. Again, this only works if the database has not been dismounted since the space leaked.
Another option is to analyze the database space to see if you’re hitting this problem. Here’s how you do that.
Dismount the database and run
eseutil /ms /v "C:\databases\somedatabase.edb" > C:\spacereport.txt
For the same database, launch Exchange Management Shell and run
Get-MailboxStatistics -Database SomeDatabase | Export-Csv C:\mailboxstatistics.csv
Use my Analyze-SpaceDump.ps1 script to parse the spacereport.txt:
.\Analyze-SpaceDump.ps1 C:\spacereport.txt
Look for the “Largest body tables” at the bottom of the report. These are the largest mailboxes in terms of the actual space they use in the database. These numbers are in megabytes, so if it reports that a body table owns 7000, that means that mailbox owns 7 GB of space in the database.
Grab the ID from the body table. For example, if the table is Body-1-ABCD, then the ID is 1-ABCD. This will correspond to the MailboxTableIdentifier in the mailboxstatistics.csv.
Find that mailbox in the statistics output and add up the TotalItemSize and TotalDeletedItemSize. By comparing that against how much space the body table is using in the database, you know how much space has leaked.
It’s often normal to have small differences, but when you see that a mailbox has leaked gigabytes, then you’re hitting this problem.
You can also compare the overall leaked size with some quick Powershell scripting. When I get these files from a customer, I run the following to add up the mailbox size from the mailbox statistics csv:
This gives you the size in megabytes as reported by Get-MailboxStatistics. Then, you can go look at the Analyze-SpaceDump.ps1 output and compare this to the “Spaced owned by body tables”, which is also in megabytes. The difference between the two gives you an idea of how much total space has leaked across all mailboxes.
Ultimately, the resolution is usually to move the mailboxes. If the database has not been dismounted, you can turn on tagCleanupMsg tracing to recover the space.
The SP3 RU1 fixes made this problem extremely rare in Exchange 2010, and the store redesign in Exchange 2013 seems to have eliminated it completely. As of this writing, I haven’t seen a single case of this on Exchange 2013.
I’ve written a couple of previous posts on the corrupt TNEF issue that causes this error:
1
Microsoft.Exchange.Data.Storage.ConversionFailedException: The message content has become corrupted. ---> Microsoft.Exchange.Data.Storage.ConversionFailedException: Content conversion: Failed due to corrupt TNEF (violation status: 0x00008000)
Previously, the solution was the Delete-TNEFProps.ps1 script. Unfortunately, that script has some limitations. Most notably, it cannot not fix attachments. This is a big problem for some environments where we have a lot of items with these properties on them.
I attempted to find a way to make the script remove the problem properties from attachments, but I could not figure out how to do it. Either this is impossible with EWS, or I’m missing an obscure trick. I finally gave up and went a different route.
For some time, I’ve been (slowly) working on a new tool called MAPIFolders. It is intended as a successor to PFDAVAdmin and ExFolders, though it is still fairly limited compared to those tools. It is also a command-line tool, unlike the older tools. However, it does have some advantages, such as the fact that it uses MAPI. This means it is not tied to deprecated APIs and frameworks like PFDAVAdmin was, and it doesn’t rely on directly loading the Exchange DLLs like ExFolders does. It can be run from any client machine against virtually any version of Exchange, just like any other MAPI client.
Also, because it’s MAPI, I can make it do almost anything, such as blowing away the properties on nested attachments and saving those changes.
Thanks to a customer who opened a case on the TNEF problem, I was able to test MAPIFolders in a significantly large public folder environment with a lot of corrupted TNEF items. After a bit of debugging and fixing things, MAPIFolders is now a far better solution to the TNEF issue than the Delete-TNEFProps script. It can remove the properties from attachments and even nested attachments.
In this series of posts, I’m going to discuss three basic approaches to searching
the content of Exchange mailboxes, and the tradeoffs that come with them. This series
is for developers who are writing applications that talk to Exchange, or scripters
who are using EWS Managed API from Powershell. I’m not going to be talking about
New-MailboxSearch or searching from within Outlook, because in that case, the client
code that executes the search is already written. This series is for people writing
their own Exchange clients.
There are three basic ways to search a mailbox in Exchange Server:
Sort a table and seek to the items you’re interested in. This approach is called a sort-and-seek.
Hand the server a set of criteria and tell it to only return items that match. This is
the Restrict method in MAPI and FindItems in EWS.
Create a search folder with a set of criteria, and retrieve the contents of that folder
to see the matching items.
For most of Exchange Server’s history, approaches 2 and 3 were implemented basically the same way.
Using either approach caused a table to be created in the database. These tables contained a
small amount of information for each item that matched the search, and the tables would hang around
in the database for some amount of time. These tables were called cached restrictions or
cached views. I’m going to call them cached restrictions, because that was the popular
terminology when I started supporting Exchange.
Recorded history basically starts with Exchange 5.5, so let’s start there. Exchange 5.5
saved every single restriction for a certain amount of time. This meant that the first time
you performed an IMAPITable::Restrict()
on a certain folder, you would observe a delay while
Exchange built the table. The second time you performed a IMAPITable::Restrict() on the
same folder with the same restriction criteria, it was fast, because the restriction had been
cached - that is, we now had a table for that restriction in the database, ready to be reused.
Exchange 5.5 continued keeping the cached restriction up to date as the content of the mailbox
changed, just in case the client asked for that same search again. Every time a new item came into
the Inbox, Exchange would update every cached restriction which was scoped to that folder.
Unfortunately, this created a problem. If you had a lot of users sharing a mailbox, or you had
an application that performed searches for lots of different criteria, you ended up with lots
of different cached restrictions - possibly hundreds. Updating hundreds of cached restrictions
every time a new email arrived got expensive and caused significant performance issues.
As Exchange matured, changes were introduced to deal with this issue.
In Exchange 2003, a limit was put in place so Exchange would only cache 11 restrictions for a given folder
(adjustable with msExchMaxCachedViews or PR_MAX_CACHED_VIEWS). This prevented hundreds of
cached restrictions from accumulating for a folder, and neatly avoided that perf hit.
However, this meant that if
you had a user or application creating a bunch of one-off restrictions, the cache would keep
cycling and no search would ever get satisfied from a cached restriction unless you adjusted these values.
If you set the limit too high, then you reintroduced the performance problems that the limit had fixed.
In Exchange 2010, cached restrictions were changed to use dynamic updates instead of updating every time
the mailbox changed. This made it less expensive to cache lots of restrictions, since they didn’t all have
to be kept up to date all the time. However, you could still run into situations where an
application performed a bunch of one-off searches which were only used once but were then cached.
When it came time to clean up those cached restrictions, the cleanup task could impact performance. We
saw a few cases where Exchange 2010 mailboxes would be locked out for hours while the Information Store tried
to clean up restrictions that were created across hundreds of folders.
In Exchange 2013 and 2016, the Information Store is selective about which restrictions it caches.
As a developer of a client, you can’t really predict whether your restriction is going to
get cached, because this is a moving target. As Exchange 2013 and 2016 continue to evolve, they may cache
something tomorrow that they don’t cache today. If you’re going to use the same search repeatedly
in modern versions of Exchange, the only way to be sure the restriction is cached is to create a search
folder. This is the behavior change described in KB 3077710.
In all versions of Exchange, it was always important to think about how you were searching and try
to use restrictions responsibly. Exchange 2013 and 2016 are unique in that they basically
insist that you create a search folder if you want your restriction to be cached.
The next post in this series will explore some sample code that illustrates differences between
Exchange 2013 and Exchange 2010 restriction behavior.
Two years ago, I dove into the wonderful world of static blog generators when I
left my TechNet blog behind and started using Jekyll
to generate an Azure web site. With my newfound freedom from complex content
management systems, I raved about Jekyll in a blog post. But once the honeymoon
was over, some cracks started to appear in the relationship.
Jekyll does not officially support Windows, so you have to jump through some
hoops to get it up and running. This didn’t seem so bad at first, but I’m one
of those people who is constantly tinkering with my PC, buying new hardware,
and upgrading things, so I end up doing a clean install of my OS several times
a year.
Back in the day, a clean install of Windows might sound daunting, but these days,
it only takes minutes. The Windows install itself is pretty fast, and I
have several Boxstarter scripts that use
Chocolatey to install all the software I use. This
means getting back up and running is fairly painless - except for Jekyll.
It seemed like every time I got a clean install, that fresh new clean OS feeling
was soon soured by errors from Jekyll. The hoops I had to jump through to get it
up and running would change slightly each time due to changes in
Ruby or problems
with gems. For a while, I dealt with this issue by blogging from one of my Ubuntu
VMs.
Finally, I started shopping around for something not based on Jekyll and preferably
with no Ruby dependency at all. There are a lot of options,
but for now, I’ve settled on Hexo.
Hexo is powered by Node.js, and since I’m a big fan of
JavaScript and a big fan of npm, this seems like a natural fit. Maybe this will
be enough motivation to continue the series I left off with, or at least to write
a new technical post of some kind.
It’s interesting how fairly obvious settings can break things in very non-obvious ways.
We recently had a case where the customer was not able to create new mailboxes on Exchange 2007. This had worked fine prior to applying some updates. After the updates, the New-Mailbox cmdlet began failing with error indicating that an address generator DLL was missing.
That error was a little misleading. The application log showed a very different error:
1 2 3 4
ID: 2030 Level: Error Source: MSExchangeSA Message: Unable to find the e-mail address 'smtp:someone@contoso.com SMTP:someone@contoso.com ' in the directory. Error '80072020'.
That error code is ERROR_DS_OPERATIONS_ERROR - not very specific. After a lot of tracing, we eventually found that when we created the new mailbox, Exchange was generating the new email addresses and then searching to see if they exist. The expected result is that the search returns 0 results, so we know the new addresses are unique. But in this case, wldap32 was returning code 1, LDAP_OPERATIONS_ERROR.
We used psexec -i -s ldp.exe to launch ldp as localsystem on the Exchange server, and then connected to the DCs. Choosing to bind as Current logged-on user showed that we bound as the computer account, as expected, and the searches worked fine. Then, some additional tracing revealed that we were not connecting to the DCs by name - we were connecting to the domain name, contoso.com in this example.
When we used ldp to connect to the domain name, something interesting happened - we were no longer able to bind as the computer account. The bind would succeed, but would return NT AUTHORITY\Anonymous Logon. Attempting to search while in that state produced:
1 2 3 4 5 6 7
***Searching... ldap_search_s(ld, "(null)", 2, "(proxyAddresses=smtp:user@contoso.com)", attrList, 0, &msg) Error: Search: Operations Error. <1> Server error: 000004DC: LdapErr: DSID-0C0906E8, comment: In order to perform this operation a successful bind must be completed on the connection., data 0, v1db1 Error 0x4DC The operation being requested was not performed because the user has not been authenticated. Result <1>: 000004DC: LdapErr: DSID-0C0906E8, comment: In order to perform this operation a successful bind must be completed on the connection., data 0, v1db1 Getting 0 entries:
That was exactly what we were looking for! Operations error, code 1, which is LDAP_OPERATIONS_ERROR. At this point, we turned our attention to understanding why we could not authenticate to the domain name, when authenticating to the server name worked fine. After all, connecting to the domain name just connected us to one of the DCs that we had already tested directly - we could see that by observing the dnsHostName value. So why would the name we used to connect matter?
The Active Directory engineeers eventually discovered that the _sites container, _tcp container, and other related DNS entries were all missing. Dynamic DNS had been disabled in this environment. Once it was enabled, everything worked.
The moral of this story is to be careful when you disable a setting that is, at face value, a simple and obvious thing. The effects can ripple out in very unexpected ways.