Managing Machine Learning Workloads Using Kubeflow on AWS with D2iQ Kaptain

4 min read
There are four main impediments to successful adoption of AI/ML in the cloud-native enterprise:
- Novelty: Cloud-native ML technologies have only been developed in the last five years.
- Complexity: There are lots of cloud-native and AI/ML tools on the market.
- Integration: Only a small percentage of production ML systems are model code; the rest is glue code needed to make the overall process repeatable, reliable, and resilient.
- Security: Data privacy and security are often afterthoughts during the process of model creation but are critical in production.
Kubernetes would seem to be an ideal way to address some of the obstacles to getting AI/ML workloads into production. It is inherently scalable, which suits the varying capacity requirements of training, tuning, and deploying models.
Kubernetes is also hardware agnostic and can work across a wide range of infrastructure platforms, and Kubeflow—the self-described ML toolkit for Kubernetes—provides a Kubernetes-native platform for developing and deploying ML systems.
Unfortunately, Kubernetes can introduce complexities as well, particularly for data scientists and data engineers who may not have the bandwidth or desire to learn how to manage it. Kubeflow has its own challenges, too, including difficulties with installation and with integrating its loosely-coupled components, as well as poor documentation.
In this post, we’ll discuss how D2iQ Kaptain on Amazon Web Services (AWS) directly addresses the challenges of moving machine learning workloads into production, the steep learning curve for Kubernetes, and the particular difficulties Kubeflow can introduce.
D2iQ is an AWS Containers Competency Partner, and D2iQ Kaptain is an enterprise Kubeflow product that enables organizations to develop and deploy machine learning workloads at scale. It satisfies the organization’s security and compliance requirements, thus minimizing operational friction and meeting the needs of all teams involved in a successful ML project.