In software development, the running joke has always been, "it works on my machine." Whether it's broken builds, configuration confusion, or a botched command against the repository – it's easy to get into a bad state where no one else can seem to figure out why things just don't work. Even with careful planning and considerable testing in multiple environments, sometimes you publish your shiny, polished code to Production and nothing seems to work. I now present you with such a tale.
A critical component in the point-of-sale (POS) system that Workstate helped to develop for a multi-billion dollar retailer was a tiny websocket server that lived on the client machine. This component was built before we were engaged, it was very tightly coupled with a lot of the application, and not many people knew it well. What everyone seemed to agree on, was that it just worked. It worked on the dev machines, worked in the QA environment, and worked in the lab that simulated an actual production environment.
The week before production release, it was determined that all connections needed to be secured, including the websocket server. This involved having a certificate issued and loaded, changing the scheme for multiple endpoints, and testing all of the integrations related to the change. Fairly routine, except for the fact that SSL introduced some interesting timing issues that were present in the legacy JavaScript that needed to be addressed (that's a story for another day).
A few days before release, the secured build made it to the property lab and just would not run. I was asked to investigate, so I set out to learn exactly what our inherited socket-handler was doing. I went to work, quickly spinning up a major logging effort and setting trace logs throughout all of the major pieces of the application. I added some "sane" error handling logic, then turned back to the lab to figure out what was happening. Immediately it was clear –the local user didn't have the ability to open the local machine certificate. This is where things got much, much worse.
The infrastructure team was brought in to help, as they were the only ones who had access to make changes to the system. The head of the network team logged in and ran the app – the cert loaded and things seemed to work (for the most part) as expected. Additional access was granted to the user account that would be used, but the cert still didn't load. Due to the imminent release, an executive decision was made to create a domain account with an extremely high permission level and run the application as that user. Problem solved (for now).
As the first site went live, things started to get weird. Most notably, the thoroughly vetted print process was failing and throwing an error (actually, several different errors). Enter more logging! The lab PC was shipped to the site and the logs were captured. What we found was that we couldn't retrieve the default printer. If you went to the Printers dialog, the default printer displayed. The local user had rights to modify the default printer, as well as add/remove devices, but the application running as the highest order domain admin could not pull the default printer.
At first, I thought that perhaps the winspool driver was kicking off on a thread that ignored the account override, therefore it didn't have permission to perform the operation. We ran through some scenarios to check thread identities and force additional impersonation, but nothing worked. No one was ready to give up, but we were very perplexed. Everything worked perfectly when logging onto the machine as the domain account; certificates loaded, and printing worked flawlessly. But to log on as the local user and run as the same domain account – the printing failed.
Finally, an ah-ha moment struck me. The code was trying to find the user's default printer, not the default printer for the machine! By running the app as a disconnected domain account, it wasn't loading default printer information for the local machine, rather whatever default printing might be for said account. This account didn't have a default printer (let alone one on that machine), so it could not load the printer to be used. We added permissions to the certificate that allowed access by the local user, stripped all administrative access, and did one final test on all peripherals. Mystery solved – everything worked as it did in the lab.
So what did we learn?
- Foremost, when it comes to printing – do not run an application via a domain service account if it needs access to profile-related detail. It seems obvious in hindsight, but it was not a consideration by any of the tenured engineers and developers working at the time.
- Pressure and timing can lead to poor decisions. The domain account seemed like a silver bullet, which would buy everyone more time to get to the root of the issue. It turns out that our silver bullet was actually the cause of a much harder problem to solve.
- If you get a lot of smart people in a room, everyone will still use their biggest hammer to solve the problem. Sometimes this hammer smashes things other than the nails. Oops.
All in all, it could have been a lot worse; we learned some great lessons and had an uncompromised outcome. We were also very lucky to have a team of dedicated and intelligent partners on the infrastructure team that really helped us with the troubleshooting process.
We love working with partners who enjoy collaboration and sharing in mutual success. Reach out to us today if you are interested in building a great relationship with a trusted professional services provider.