On Monday, January 11th the first business application for the white-glove .NET Kubernetes internal developer platform will be written by one of our software delivery teams. To support this application running in a production-grade fashion by late January to early February, there are three major tasks to complete.
- Enterprise object storage for the platform
- Deployment to a secondary site and disaster resiliency verification
Enterprise object storage for the platform
Our most challenging constraint is procuring an enterprise object storage system that our storage administrators feel comfortable operating. As I’ve written before the organization I work for is extremely hesitant to put any workload into the public cloud unless it absolutely must. The needs of our storage administrators are focused on providing excellent reliable service. Their focus is on the running and maintenance and any solution that does not come with an excellent user interface is not going to be maintainable in the long run. Kubernetes tooling assumes that object storage is one of the primary storage options available. The velocity of the devlab platform project presents a challenge to our storage administrators. The amount of data we need to be stored in object storage is minimal maybe 50GB. Over time that could grow but we are looking for resilient replicated storage across data centers for our container images, Harbor, and etcd backups. While you might be reading this and thinking, umm why not use minio and be done with it for such a paltry amount of data? Unfortunately, it’s not an option that is supportable for the storage administrators. I’m hopeful we can brainstorm around an interim solution until we can find that enterprise object storage that we feel comfortable running.
What are the benefits of working on a new platform is we can walk away from many of the old standards of logging aggregation we are used to. In this new solution, we are walking away from a very expensive log aggregation tool that is supposed to be focused on security but we’ve used it for everything else as well. We’re now looking at other options on the market that can provide us a singular place to view our logs set up for alerts and do many of the fancy tracing abilities that were used to. If we’re lucky we can also have a place we can collect Prometheus metric data and create additional grafana dashboards. Along with that, we are working to set up normal alerting and monitoring of our Kubernetes clusters across data centers. There’s nothing too difficult here with Kubernetes thankfully it’s easy to redirect logs using fluentd.
Deployment to a Secondary Site
We’re in the process of updating our GitLab pipeline to push helm deployments to different environments and multiple data centers. We also need to figure out how to keep our Rancher Kubernetes clusters in sync across data centers. We only expect to have a Kubernetes cluster per data center and thus keeping those clusters in sync across each other is important to automate. Rancher Fleet might be an option there or there might be other solutions on the market. Once we complete these tasks there will be work to figure out how specific disaster scenarios would be handled and either automate that failover or write out the manual process for it. It seems like with Kubernetes the resiliency almost necessarily might more often than not be better to wait out most problems versus spending the time to start workloads in another data center.
After these three challenges are completed I feel comfortable declaring our white-glove platform production-grade. The first use case will be a proxy API triggered by our Camunda BPMN workflow tool. The platform will graduate to a beta status and further adoption by select teams will continue. it’s possible after six months to one year we can graduate the platform to general availability and begin to allow any team or project onto the platform that fits within the buckets of API, worker, or scheduled job.
I’m excited to share in the future the new challenges we will face. We have a solid team of people with different strengths and abilities collaborating on an exciting platform. Technology is not a barrier for us. Future challenges we will face are training and onboarding of teams.
However, there is one area that I regret which is the speed of development of this platform. This places a large burden on my coworkers in charge of staffing systems that must now be maintained. I continue to reflect on what I could do differently to make this sudden change in the enterprise less jarring and more acceptable for everyone involved.