Written to Python developers: Ten necessary skills for machine learning

Yun Qi Hao: https://yqh.aliyun.com
The first-hand cloud information, the selected cloud enterprise case base of different industries, and the best practices extracted from many successful cases help you to make cloud decision!

Sometimes, as a data scientist, we often forget the original idea. We are first a developer, then a researcher, and finally a mathematician. Our first responsibility is to quickly find bug free solutions.

Just because we can model doesn't mean we're gods. This is not a reason to write junk code.

Since I began to learn machine learning, I have made many mistakes. So I want to share what I think are the most commonly used skills in machine learning engineering. In my opinion, this is also the most lacking skill in this industry.

I call them data scientists who don't understand software, because a large part of them have not systematically studied computer science courses. And so do I myself.

If I choose to hire a great data scientist and a great machine learning engineer, I will hire the latter.

Let's start my sharing.

Learn to write abstract classes

Once you start writing an abstract class, you'll realize the benefits it brings. Abstract classes force subclasses to use the same method and method name. Many people work on the same project, which is unnecessary and confusing if everyone defines different methods.

1import os
 2from abc import ABCMeta, abstractmethod
 5class DataProcessor(metaclass=ABCMeta):
 6    """Base processor to be used for all preparation."""
 7    def __init__(self, input_directory, output_directory):
 8        self.input_directory = input_directory
 9        self.output_directory = output_directory
11    @abstractmethod
12    def read(self):
13        """Read raw data."""
15    @abstractmethod
16    def process(self):
17        """Processes raw data. This step should create the raw dataframe with all the required features. Shouldn't implement statistical or text cleaning."""
19    @abstractmethod
20    def save(self):
21        """Saves processed data."""
24class Trainer(metaclass=ABCMeta):
25    """Base trainer to be used for all models."""
27    def __init__(self, directory):
28        self.directory = directory
29        self.model_directory = os.path.join(directory, 'models')
31    @abstractmethod
32    def preprocess(self):
33        """This takes the preprocessed data and returns clean data. This is more about statistical or text cleaning."""
35    @abstractmethod
36    def set_model(self):
37        """Define model here."""
39    @abstractmethod
40    def fit_model(self):
41        """This takes the vectorised data and returns a trained model."""
43    @abstractmethod
44    def generate_metrics(self):
45        """Generates metric with trained model and test data."""
47    @abstractmethod
48    def save_model(self, model_name):
49        """This method saves the model in our required format."""
52class Predict(metaclass=ABCMeta):
53    """Base predictor to be used for all models."""
55    def __init__(self, directory):
56        self.directory = directory
57        self.model_directory = os.path.join(directory, 'models')
59    @abstractmethod
60    def load_model(self):
61        """Load model here."""
63    @abstractmethod
64    def preprocess(self):
65        """This takes the raw data and returns clean data for prediction."""
67    @abstractmethod
68    def predict(self):
69        """This is used for prediction."""
72class BaseDB(metaclass=ABCMeta):
73    """ Base database class to be used for all DB connectors."""
74    @abstractmethod
75    def get_connection(self):
76        """This creates a new DB connection."""
77    @abstractmethod
78    def close_connection(self):
79        """This closes the DB connection."""

Fixed random number seed

The repeatability of experiment is very important. Random number seed is our enemy. Special attention should be paid to the setting of random number seed, otherwise it will lead to the splitting of different training / testing data and the initialization of different weights in neural network. These will eventually lead to inconsistent results.

1def set_seed(args):
2    random.seed(args.seed)
3    np.random.seed(args.seed)
4    torch.manual_seed(args.seed)
5    if args.n_gpu > 0:
6        torch.cuda.manual_seed_all(args.seed)

Load a small amount of data first

If you have too much data, and you are working on subsequent coding, such as cleaning up data or modeling, use nrows to avoid loading a large amount of data every time. You can use this method when you only want to test the code rather than actually run the entire program.

It's very suitable for the scenario that your local computer configuration is not enough to handle such a large amount of data, but you like to develop with Jupyter/VS code/Atom.

1f_train = pd.read_csv('train.csv', nrows=1000)

Prediction failure (a sign of mature developers)

Always check the NA (missing value) in the data as it can cause problems. Even if you don't have current data, it doesn't mean it won't show up in the future training cycle. So pay attention to this problem anyway.


Show processing progress

When dealing with big data, it is very important to know the current progress if you can know how much time it will take to finish.

Scheme 1: tqdm

1from tqdm import tqdm
 2import time
 6df['col'] = df['col'].progress_apply(lambda x: x**2)
 8text = ""
 9for char in tqdm(["a", "b", "c", "d"]):
10    time.sleep(0.25)
11    text = text + char

Scheme 2: fast progress

1from fastprogress.fastprogress import master_bar, progress_bar
2from time import sleep
3mb = master_bar(range(10))
4for i in mb:
5    for j in progress_bar(range(100), parent=mb):
6        sleep(0.01)
7        mb.child.comment = f'second bar stat'
8    mb.first_bar.comment = f'first bar stat'
9    mb.write(f'Finished loop {i}.')

Solve the problem of slow Pandas

If you have used pandas, you will know how slow it is sometimes, especially in teamwork. Instead of racking your brains for acceleration solutions, use modin by changing a line of code.

1import modin.pandas as pd

Record function execution time

Not all functions are created equal.

Even if all the code is working properly, it doesn't mean that you write good code. Some soft errors actually slow down your code, so it's necessary to find them. Use this decorator to record the time of the function.

1import time
 3def timing(f):
 4    """Decorator for timing functions
 5    Usage:
 6    @timing
 7    def function(a):
 8        pass
 9    """
12    @wraps(f)
13    def wrapper(*args, **kwargs):
14        start = time.time()
15        result = f(*args, **kwargs)
16        end = time.time()
17        print('function:%r took: %2.2f sec' % (f.__name__,  end - start))
18        return result
19    return wrapp

Don't burn money on the cloud

No one likes engineers who waste cloud resources.

Some of our experiments may last for hours. It's difficult to track it and close the cloud instance when it's done. I've made mistakes myself, and I've seen people who don't shut down for days in a row.

This often happens when we go to work on Friday and leave something running until we come back on Monday It is.

As long as you call this function at the end of execution, your ass will never catch fire again!

try and except are used to wrap the main function. Once an exception occurs, the server will no longer run. I 've dealt with similar cases A kind of

Let's have a little more sense of responsibility. Low carbon environmental protection starts from me. A kind of

1import os
 3def run_command(cmd):
 4    return os.system(cmd)
 6def shutdown(seconds=0, os='linux'):
 7    """Shutdown system after seconds given. Useful for shutting EC2 to save costs."""
 8    if os == 'linux':
 9        run_command('sudo shutdown -h -t sec %s' % seconds)
10    elif os == 'windows':
11        run_command('shutdown -s -t %s' % seconds)

Create and save reports

After a specific point of modeling, all the insights come from the analysis of error and measurement. Make sure to create and save well formed reports for yourself and your boss.

Anyway, management likes reporting, doesn't it? A kind of

1import json
 2import os
 4from sklearn.metrics import (accuracy_score, classification_report,
 5                             confusion_matrix, f1_score, fbeta_score)
 7def get_metrics(y, y_pred, beta=2, average_method='macro', y_encoder=None):
 8    if y_encoder:
 9        y = y_encoder.inverse_transform(y)
10        y_pred = y_encoder.inverse_transform(y_pred)
11    return {
12        'accuracy': round(accuracy_score(y, y_pred), 4),
13        'f1_score_macro': round(f1_score(y, y_pred, average=average_method), 4),
14        'fbeta_score_macro': round(fbeta_score(y, y_pred, beta, average=average_method), 4),
15        'report': classification_report(y, y_pred, output_dict=True),
16        'report_csv': classification_report(y, y_pred, output_dict=False).replace('\n','\r\n')
17    }
20def save_metrics(metrics: dict, model_directory, file_name):
21    path = os.path.join(model_directory, file_name + '_report.txt')
22    classification_report_to_csv(metrics['report_csv'], path)
23    metrics.pop('report_csv')
24    path = os.path.join(model_directory, file_name + '_metrics.json')
25    json.dump(metrics, open(path, 'w'), indent=4)

Write a good API

The result is not good, everything is not good.

You can do a good job of data cleansing and modeling, but you can still create a huge mess at the end. My experience with people tells me that many people don't know how to write good APIs, documentation, and server settings. I'll write another article on this soon, but let me share a brief part first.

The following method is suitable for classic machine learning and deep learning deployment under a low load (such as 1000 / min).

See this combination: fastapi + uvicon + gunicorn

  • The fastest is to write API with fastapi, because this is the fastest, see this article for the reason.
  • Document writing API in fastapi provides us with free documents and test endpoints on http: url/docs. When we change the code, fastapi will automatically generate and update these documents.
  • Worker - use the gunicorn server deployment API, because gunicorn has more than one worker to start, and you should keep at least two workers.

Run these commands to deploy with four workers. You can optimize the number of workers through load testing.

1pip install fastapi uvicorn gunicorn
2gunicorn -w 4 -k uvicorn.workers.UvicornH11Worker main:app

Yunqi online class: https://c.tb.cn/F3.Z8gvnK
For more series live broadcast, please pay attention to the online classroom circle of yunqi to learn about the course update in time!

Original release time: March 9, 2020
By Pratik Bhavsar
This article comes from:“ Official account of AI technology base ”, you can pay attention to“ AI technology base"

Tags: Python JSON Linux Database network

Posted on Mon, 09 Mar 2020 22:51:40 -0700 by kedarnath