At work, we’ve been applying the full press on getting over to a containerized world where our devops and production machines are definable as code (Infrastructure-as-Code) and treated like cattle instead of pets. But, every once in a while, we invariably experience a reality check on Microsoft’s container support and *cough* reliability.
Containers are the new hotness. Microsoft’s use of containers is still maturing. The combination is forcing our hand into accepting the risk associated with bleeding edge technology and with that comes the trials and tribulations that just make you want to pull your hair out from time to time.
Microsoft is playing catch up to the maturity level already enjoyed by the Linux ecosystem. Hopefully they get there quickly. I figured I’d highlight a couple roadblocks that we encountered on the way to production in the hopes that someone out there might find this information relevant and able to apply these finding to the problems they are seeing.
Watch out for the Hyper-V Crash
We had the brilliant idea of leveraging our already established VMWare infrastructure, provisioning processes, DR/HA and IT staff to administer all of it. Surely, this should have been a no-brainer to leverage what we already had instead of buying new physical hardware, finding rack space, applying a backup strategy, and so on.
Well, getting containers to work on VMs running Windows 2019 OS was a breeze (3 easy powershell commands: https://docs.microsoft.com/en-us/virtualization/windowscontainers/deploy-containers/deploy-containers-on-server)
What we didn’t expect was the trouble we had dealing with Hyper-V adapters on the windows OS. Hyper-V was crashing and causing containers to restart every 30 seconds. All we saw was the vmcompute.exe process (HyperV) dropping and restarting on that exact interval with no crashdump file in sight. We tried using process isolation instead of Hyper-V and it appeared Hyper-V networking was still getting leveraged…and crashing.
It turned out that the version of VMTools that we where using changed how docker was working with HyperV. Once we reverted to an older VMTools within the virtual machine, the containers where rock solid again and would run for weeks without a single restart needed.
In the end, we didn’t need to run our containers inside of Hyper-V at all. We used Process Isolation to keep it simple. This, plus the reversion of the VMTools, gave us the SLAs we where looking for around reliability.
Word on the street is that VMWare is looking to improve their VMTools offering to overcome this issue.
Use Caution when referencing Generic Tags
Version-agnostic tags (like “latest”) in docker have their place. If you do not care about the images changing on a whim and their stability therein, having these generic tags can make the overall maintenance of your containers and their deployments easier to achieve.
This does come at a cost. In most enterprise software scenarios, the goal is to control as many of the variables of the environment that you can. This allows for repeatability and predictability in the development, testing and deployment process, which I’ve found to drive up the teams’ velocity.
Generic tags (like “latest”) throws risk into the equation because you never know what you’re going to get when it comes time to create that container.
A recent example of this that we experienced was leveraging Microsoft’s .NET Framework SDK image as the base image for our build server containers. We where using “mcr.microsoft.com/dotnet/framework/4.8”, which assumed the tag of “latest”. When Microsoft published a newer version of this SDK image, that version broke some build tools. It left us scratching our heads for days. We just couldn’t pin point why a container generated one day stopped working on the next even though the Dockerfile never changed. We immediately assumed infrastructure or a bad configuration setting was the culprit.
What happened is that the latest 4.8 SDK image that got compiled into the docker image could no longer run certain 32-bit applications. We definitely had a couple 32-bit apps that needed to get launched (including…ironically…Microsoft’s own msbuild.exe).
Once we identified that it was the SDK base image causing the problem, we reverted to an older version of the image using a more specific tag. In our case, we used the “4.8–20200114-windowsservercore-ltsc2019” tag and everything was back in a working state.
After an hour of licking our paws, we learned our lesson about when its appropriate (and inappropriate) to leverage generic tags.
Hopefully this two lessons can help someone who might be wrestling with these same types of problems out there.