Simplify AI Operations to Surface Insights More Quickly and Easily
Develop and deploy entire ML workloads at scale with consistency and reliability
Benefits Using AI/ML on Kubernetes
Kubernetes has a large, rapidly growing ecosystem of hundreds of open-source and cloud-native services, applications, and tools. When properly implemented into your engineering workflows, Kubernetes can dramatically shorten release cycles, standardize deployment, test workflows, and more.
Because the Cloud Native Computing Foundation (CNCF) landscape serves as a vendor-neutral home for many fast-growing open-source technologies, the open-source code is free to use and modify. If you have the technical know-how and expertise, you can tailor everything to support different needs.
AI/ML models require massive compute power and can be difficult to scale from one machine to another. With Kubernetes, you can automatically scale up workloads and distribute them across a cluster of servers to ensure they have enough resources to run.
Challenges Using AI/ML on Kubernetes
Getting ML Models Into Production
Going from prototype to production is perilous when it comes to machine learning. However, many organizations struggle moving from a prototype on a single machine to a scalable, production-grade deployment. In fact, research has found that the vast majority—87%—of AI projects never make it into production. And for the few models that are ever deployed, it takes 90 days or more to get there. Because of the disconnect between IT and data scientist departments, organizations might not see positive returns on AI/ML investments.
Jupyter Notebooks are popular with data scientists for a reason: they provide a development environment that combines data exploration documentation and the actual code needed to build and train machine learning models while offering full interactivity. However, as soon as you need to execute them in production at scale, notebooks become incredibly challenging to work with because they lack effective version control, testing and debugging, modularity, and extensibility capabilities. Without the tools with which they are already familiar, data scientists are required to switch contexts and learn Kubernetes, which can impede productivity and time-to-market.
Opinionated Set of Services
For enterprise-grade production, there is a need for reliability, scalability, and security. Building a modern platform from scratch requires picking the right technologies from among 880 cloud-native and more than 270 machine learning tools. To add to the complexity, because the cloud-native and machine learning stacks have evolved independently of each other, not all technologies work well together. Managing the compatibility of all the tools required—while also integrating enterprise-grade capabilities, such as observability, authentication, logging, and more—all take time and require expertise data science teams are unlikely to have. This leads to significant wait times for data science teams, as they, or other teams, define, build, and maintain complex environments.
Data Access and Security
Data scientists require full access to modeling enterprise data to ensure the accuracy of their models. However, a lack of governance for deployed workloads, misconfigured access policies, and misplaced laptops can lead to significant data security risks and inefficient use of resources. As a result, data scientists are often limited to smaller, less usable snapshots of the data instead of the full data lake. Organizations must find the right balance between data access and data protection that doesn’t impede productivity or expose the business to potential cryptomining attacks and security breaches.
Develop and Deploy AI/ML on Kubernetes With DKP Kaptain
Deliver Models to Production with Speed and Agility
Go from Prototype to Production in Minutes, Not Months
Kaptain breaks down operational barriers for data scientists to seamlessly move machine learning model prototypes to full-scale deployment with all hyperparameters tuned in a matter of minutes, not months. The platform provides a seamless Python-native user experience across training, tuning, deploying, and tracking of models, increasing production success rates while accelerating time to value. When data scientists no longer have to rely on DevOps Engineers and Operations teams to build a production-ready environment, organizations can go to market faster and at a significantly reduced cost.
Provide Full Support for Jupyter Notebooks
Train, Tune, and Deploy Models from within Notebooks and Kaptain SDK
Kaptain provides data scientists with a familiar notebook-first environment that is preinstalled with the best libraries, workflow tools, and frameworks, with out-of-the-box CPU or GPU support. This makes it easy for data scientists to work with the tools which they are already familiar with to take full ownership of the machine learning lifecycle. Everything can be done from within Jupyter Notebooks and Kaptain SDK without having to switch contexts or learn Kubernetes. With Kaptain, data scientists can train models on GPUs and CPUs, experiment in parallel, and deploy auto-scaling web services with preconfigured load balancers, out-of-the-box canary deployments, and monitoring already set up.
Ensure Day 2 Readiness and Enterprise-Grade Robustness
Provide Unique Functionality for Enterprise Data Science Use Cases
Kaptain is built for success using an integrated, opinionated subset of Kubeflow and all of the necessary enterprise-grade capabilities needed for day two production, including model tracking, hyperparameter tuning, self-service data mounts in Notebook, volume manager support, TensorBoard integration, security, real-time cost management, and other necessary enterprise-grade functionality that we validate regularly with mixed workload testing on large clusters to simulate realistic enterprise environments, at no additional cost. That way, the entire stack is guaranteed to work and scale. Our tutorials show you how to use each of the included components so you don’t have to waste valuable time through trial and error.
Balance Data Access with Data Protection
Enhance Model Accuracy Without Sacrificing Security and Compliance
Kaptain ships with a fully integrated security stack enabled by default based on the tight integration with D2iQ Konvoy. All of the exposed endpoints are secured with authentication, authorization, and end-to-end encryption with Dex and Istio to prevent unauthorized access. Kaptain supports multi-tenancy with fine-grained Role-Based Access Control (RBAC) so ML teams can access shared GPUs in their own isolated workspace and scale environments. By ensuring sufficient data access mechanisms to protect intellectual property, data scientists can enjoy full, but controlled, access to enterprise data lakes to enhance model accuracy.
Key Features and Benefits
Leverage Jupyter-as-a-Service as the primary interface with the best frameworks (TensorFlow, Pytorch, and MXNet), workflow tools (Spark and Horovod), and libraries (Seaborn, statsmodels, SciPy, Keras, scikit-learn, Pyspark, gensim, NLTK, and spaCy) for machine learning.
D2iQ Kaptain SDK
Hide the complexities of Kubernetes and allow data scientists to perform distributed training on a cluster’s resources, conduct experiments in parallel, and deploy auto-scaling services with load balancers, canary deployments, and monitoring.
Provide unique functionality that makes sense for enterprise data science use cases, including real-time cost management, monitoring, logging, volume manager support, self-service data mounts, and more.
ML Pipeline Automation and Portability
Achieve greater reproducibility and productivity with built-in lifecycle management and operational expertise.
Out-of-the-Box CPU and GPU Support
Access GPUs and CPUs in a safe and stable environment with all the necessary drivers properly configured.
Enterprise Security and Multi-Tenancy
Run entire ML pipelines securely and efficiently with fine-grained RBAC, as well as authentication, authorization, and end-to-end encryption with Dex and Istio.
ML in a Box with NVIDIA
Leverage a powerful ML platform, a robust Kubernetes platform, and the NVIDIA DGX in a certified and tested solution.
Provide RBAC-based visibility into the resource consumption and state of Kaptain workloads and easily identify and debug any issues.