Kubeflow Series: Introduction to the basic components of kubeflow

Kubeflow Series (1): Solve one-click installation of kubeflow based on Ali cloud mirror in China

In order to have a more intuitive and in-depth understanding of kubeflow, a simple introduction to the components of kubeflow, first from the machine learning task to see the implementation of kubeflow.

Machine Learning Task Engineering Implementation Process

A modeling task can be divided into four main tasks

  • Business Understanding
  • Data Acquisition and Data Understanding
  • Feature Engineering, Model Training, Model Evaluation
  • Model Deployment, providing model services

A machine learning task is mainly divided into four tasks from start to end. The functions of Kubeflow can be said to be built around these four tasks.

kubeflow

Kubeflow was originally based on tf-operator, but as the project evolved it eventually became a large collection of cloud-based machine learning task tools.From data collection, validation, to model training and service publishing, the widget Kubeflow in almost all steps provides components of the solution:

kubeflow features:

  • Based on k8s, it has cloud native characteristics: elastic scaling, high availability, DevOps, etc.
  • Integrate tools for large amounts of machine learning

structure

The complete structure of kubeflow can be seen in his kustomize installation file:

kustomize/
├── ambassador.yaml
├── api-service.yaml
├── argo.yaml
├── centraldashboard.yaml
├── jupyter-web-app.yaml
├── katib.yaml
├── metacontroller.yaml
├── minio.yaml
├── mysql.yaml
├── notebook-controller.yaml
├── persistent-agent.yaml
├── pipelines-runner.yaml
├── pipelines-ui.yaml
├── pipelines-viewer.yaml
├── pytorch-operator.yaml
├── scheduledworkflow.yaml
├── tensorboard.yaml
└── tf-job-operator.yaml

ambassador Micro Service Gateway
argo for task workflow organization
Dashboard Kanban page for central dashboard kubeflow
tf-job-operator Deep Learning Framework Engine, a CRD built on tensorflow. Resource type kind is TFJob
Training visual UI interface for tensorboard tensorflow
katib superparametric server
pipeline A machine learning workflow component
jupyter An Interactive Business IDE Encoding Environment

TFJob

TFJob is a CRD based on k8s for tensorflow's distributed architecture:

  • Chief is responsible for coordinating training tasks
  • Ps Parameter Server, which provides distributed data storage for model parameters
  • Worker is responsible for the task of the actual training model. In some cases, worker 0 can act as Chief's responsibility.
  • Evaluator is responsible for evaluating performance during training
apiVersion: kubeflow.org/v1beta2
kind: TFJob
metadata:
  name: mnist-train
  namespace: kubeflow
spec:
  tfReplicaSpecs:
    Chief: # Scheduler
      replicas: 1
      template:
        spec:
          containers:
            - command:
              - /usr/bin/python
              - /opt/model.py
              env:
              - name: modelDir
                value: /mnt
              - name: exportDir
                value: /mnt/export
              image: mnist-test:v0.1
              name: tensorflow
              volumeMounts:
              - mountPath: /mnt
                name: local-storage
              workingDir: /opt
            restartPolicy: OnFailure
            volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: local-path-pvc
    Ps: # Parameter Server
      replicas: 1
      template:
        spec:
          containers:
            - command:
              - /usr/bin/python
              - /opt/model.py
              env:
              - name: modelDir
                value: /mnt
              - name: exportDir
                value: /mnt/export
              image: mnist-test:v0.1
              name: tensorflow
              volumeMounts:
              - mountPath: /mnt
                name: local-storage
              workingDir: /opt
            restartPolicy: OnFailure
            volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: local-path-pvc
    Worker: # Compute Node
      replicas: 2
      template:
        spec:
          containers:
            - command:
              - /usr/bin/python
              - /opt/model.py
              env:
              - name: modelDir
                value: /mnt
              - name: exportDir
                value: /mnt/export
              image: mnist-test:v0.1
              name: tensorflow
              volumeMounts:
              - mountPath: /mnt
                name: local-storage
              workingDir: /opt
            restartPolicy: OnFailure
            volumes:
            - name: local-storage
              persistentVolumeClaim:
                claimName: local-path-pvc

tensorboard training visual interface

Mount log files to create tensorboard visualization service

apiVersion: v1
kind: Service
metadata:
  name: tensorboard-tb
  namespace: kubeflow
spec:
  ports:
  - name: http
    port: 8080
    targetPort: 80
  selector:
    app: tensorboard
    tb-job: tensorboard
---
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: tensorboard-tb
  namespace: kubeflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: tensorboard
        tb-job: tensorboard
      name: tensorboard
      namespace: kubeflow
    spec:
      containers:
      - command:
        - /usr/local/bin/tensorboard
        - --logdir=/mnt
        - --port=80
        env:
        - name: logDir
          value: /mnt
        image: tensorflow/tensorflow:1.11.0
        name: tensorboard
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /mnt
          name: local-storage
      serviceAccount: default-editor
      volumes:
      - name: local-storage
        persistentVolumeClaim:
          claimName: mnist-test-pvc

tf-serving

Tenserflow service provides a stable interface for users to invoke to apply the model, and service creates model as a service directly from the model file

apiVersion: v1
kind: Service
metadata:
  labels:
    app: mnist
  name: mnist-service-local
  namespace: kubeflow
spec:
  ports:
  - name: grpc-tf-serving
    port: 9000
    targetPort: 9000
  - name: http-tf-serving
    port: 8500
    targetPort: 8500
  selector:
    app: mnist
  type: ClusterIP
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: mnist
  name: mnist-service-local
  namespace: kubeflow
spec:
  template:
    metadata:
      labels:
        app: mnist
        version: v1
    spec:
      containers:
      - args:
        - --port=9000
        - --rest_api_port=8500
        - --model_name=mnist
        - --model_base_path=/mnt/export
        command:
        - /usr/bin/tensorflow_model_server
        env:
        - name: modelBasePath
          value: /mnt/export
        image: tensorflow/serving:1.11.1
        imagePullPolicy: IfNotPresent
        livenessProbe:
          initialDelaySeconds: 30
          periodSeconds: 30
          tcpSocket:
            port: 9000
        name: mnist
        ports:
        - containerPort: 9000
        - containerPort: 8500
        resources:
          limits:
            cpu: "4"
            memory: 4Gi
          requests:
            cpu: "1"
            memory: 1Gi
        volumeMounts:
        - mountPath: /mnt
          name: local-storage

pipeline

Pipeline is a visual kubeflow task workflow that defines a pipeline with a directed acyclic description, where each step of the process is a component of the container definition.

Run steps:

  • Define an Experimentation Experiment first
  • Then start the task and define a Pipeline
  • Run Pipeline Instance

Introduction to pipeline structure

Pipelines are mainly divided into eight parts:

  • Python SDK: DSL for creating kubeflow pipeline s
  • DSL compiler: Convert Python code to YAML static configuration file
  • Pipeline web server: pipeline front-end service
  • Pipeline Service:pipeline Backend Service
  • Kubernetes resources: Create CRDs to run pipeline s
  • Machine learning metadata service: Used to store data interaction between task flow containers (input/output)
  • Artifact storage: Used to store Metadata and Pipline packages, views
  • Orchestration controllers: Task scheduling, such as Argo Workflow.

case

import kfp
from kfp import dsl

def gcs_download_op(url):
    return dsl.ContainerOp(
        name='GCS - Download',
        image='google/cloud-sdk:272.0.0',
        command=['sh', '-c'],
        arguments=['gsutil cat $0 | tee $1', url, '/tmp/results.txt'],
        file_outputs={
            'data': '/tmp/results.txt',
        }
    )


def echo2_op(text1, text2):
    return dsl.ContainerOp(
        name='echo',
        image='library/bash:4.4.23',
        command=['sh', '-c'],
        arguments=['echo "Text 1: $0"; echo "Text 2: $1"', text1, text2]
    )


@dsl.pipeline(
  name='Parallel pipeline',
  description='Download two messages in parallel and prints the concatenated result.'
)
def download_and_join(
    url1='gs://ml-pipeline-playground/shakespeare1.txt',
    url2='gs://ml-pipeline-playground/shakespeare2.txt'
):
    """A three-step pipeline with first two running in parallel."""

    download1_task = gcs_download_op(url1)
    download2_task = gcs_download_op(url2)

    echo_task = echo2_op(download1_task.output, download2_task.output)

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(download_and_join, __file__ + '.yaml')

jupyter-notebook

jupyter is the most interactive work, and his main work is to use interactive operations to help users quickly understand data and test evaluation models.

There are two main modules, jupyter-web-app and notebook-controller, jupyter architecture:

You can also replace jupyter with jupyterhub, which provides more functionality, the jupyterhub structure:

Tags: Operation & Maintenance jupyter Python SDK MySQL

Posted on Wed, 08 Jan 2020 18:34:39 -0800 by dpluth