Skip to content

train ¤

Training and optimization module, called after extracting, loading, and transforming raw data.

train ¤

train(args: Namespace, df: DataFrame) -> dict[str, Any]

Preprocess the data, binarize the labels, and split the data using functions from data module. Then, initialize a model, train it, predict the labels on all three splits using the trained model, and evaluate the predictions. This function accepts arguments, to which an additional argument threshold may be added before being returned. Basically, threshold is a list of the best threshold tuned for each class.

Parameters:

  • args (Namespace) –

    Arguments containing booleans for preprocessing the posts and hyperparameters for the modeling pipeline.

  • df (DataFrame) –

    Raw data containing posts and their corresponding tags.

Returns:

  • dict[str, Any]

    Artifacts containing arguments, label binarizer, and the trained model.

Source code in tagolym/train.py
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
def train(args: Namespace, df: DataFrame) -> dict[str, Any]:
    """Preprocess the data, binarize the labels, and split the data using 
    functions from [data][] module. Then, initialize a model, train it, 
    predict the labels on all three splits using the trained model, and 
    evaluate the predictions. This function accepts arguments, to which an 
    additional argument `threshold` may be added before being returned. 
    Basically, `threshold` is a list of the best threshold tuned for each 
    class.

    Args:
        args (Namespace): Arguments containing booleans for preprocessing the 
            posts and hyperparameters for the modeling pipeline.
        df (DataFrame): Raw data containing posts and their corresponding tags.

    Returns:
        Artifacts containing arguments, label binarizer, and the trained model.
    """
    # setup
    df = data.preprocess(df, args.nocommand, args.stem)
    tags, mlb = data.binarize(df["tags"])
    classes = mlb.classes_
    X_train, X_val, X_test, y_train, y_val, y_test = data.split_data(
        df[["token", "tags"]], tags, random_state=config.SEED
    )

    # model
    model = Pipeline([
        ("tfidf", TfidfVectorizer(ngram_range=(1, args.ngram_max))),
        ("multilabel", MultiOutputClassifier(
            SGDClassifier(
                penalty="elasticnet",
                random_state=config.SEED,
                early_stopping=True,
                class_weight="balanced",
                loss=args.loss,
                alpha=args.alpha,
                l1_ratio=args.l1_ratio,
                learning_rate=args.learning_rate,
                eta0=args.eta0,
                power_t=args.power_t,
            ),
            n_jobs=-1,
        )),
    ])

    # fit, predict, and evaluate
    model.fit(X_train["token"], y_train)

    y_pred, args = predict.custom_predict(X_val["token"], model, args, y_true=y_val)
    val_metrics = evaluate.get_metrics(y_val, y_pred, classes, df=X_val)

    y_pred, args = predict.custom_predict(X_train["token"], model, args)
    train_metrics = evaluate.get_metrics(y_train, y_pred, classes, df=X_train)

    y_pred, args = predict.custom_predict(X_test["token"], model, args)
    test_metrics = evaluate.get_metrics(y_test, y_pred, classes, df=X_test)

    return {
        "args": args,
        "label_encoder": mlb,
        "model": model,
        "train_metrics": train_metrics,
        "val_metrics": val_metrics,
        "test_metrics": test_metrics,
    }

objective ¤

objective(args: Namespace, df: DataFrame, trial: Trial, experiment: int = 0) -> float

F1 score is a metric chosen to be optimized in hyperparameter tuning. Using arguments chosen in an optuna trial, this function trains the model using train and returns the f1 score of the validation split. It also sets additional attributes to the trial, including precision, recall, and the f1 score on all three splits.

Parameters:

  • args (Namespace) –

    Arguments containing booleans for preprocessing the posts and hyperparameters for the modeling pipeline.

  • df (DataFrame) –

    Raw data containing posts and their corresponding tags.

  • trial (Trial) –

    Process of evaluating an objective function. This object is passed to an objective function and provides interfaces to get parameter suggestion, manage the trial's state, and set/get user-defined attributes of the trial.

  • experiment (int, default: 0 ) –

    Index for two-step optimization: optimizing hyperparameters in preprocessing, vectorization, and modeling; and hyperparameters in the learning algorithm. Defaults to 0.

Raises:

  • ValueError

    Experiment index is neither 0 nor 1.

Returns:

  • float

    F1 score of the validation split.

Source code in tagolym/train.py
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
def objective(args: Namespace, df: DataFrame, trial: Trial, experiment: int = 0) -> float:
    """F1 score is a metric chosen to be optimized in hyperparameter tuning. 
    Using arguments chosen in an optuna trial, this function trains the model 
    using [train][train.train] and returns the f1 score of the validation 
    split. It also sets additional attributes to the trial, including 
    precision, recall, and the f1 score on all three splits.

    Args:
        args (Namespace): Arguments containing booleans for preprocessing the 
            posts and hyperparameters for the modeling pipeline.
        df (DataFrame): Raw data containing posts and their corresponding tags.
        trial (Trial): Process of evaluating an objective function. This 
            object is passed to an objective function and provides interfaces 
            to get parameter suggestion, manage the trial's state, and set/get 
            user-defined attributes of the trial.
        experiment (int, optional): Index for two-step optimization: 
            optimizing hyperparameters in preprocessing, vectorization, and 
            modeling; and hyperparameters in the learning algorithm. Defaults 
            to 0.

    Raises:
        ValueError: Experiment index is neither 0 nor 1.

    Returns:
        F1 score of the validation split.
    """
    # parameters to tune
    if experiment == 0:
        args.nocommand = trial.suggest_categorical("nocommand", [True, False])
        args.stem = trial.suggest_categorical("stem", [True, False])
        args.ngram_max = trial.suggest_int("ngram_max", 2, 4)
        args.loss = trial.suggest_categorical("loss", ["hinge", "log_loss", "modified_huber"])
        args.l1_ratio = trial.suggest_float("l1_ratio", 0.0, 1.0)
        args.alpha = trial.suggest_float("alpha", 1e-5, 1e-2, log=True)
    elif experiment == 1:
        args.learning_rate = trial.suggest_categorical(
            "learning_rate", ["constant", "optimal", "invscaling", "adaptive"]
        )
        if args.learning_rate != "optimal":
            args.eta0 = trial.suggest_float("eta0", 1e-2, 1e-0, log=True)
        if args.learning_rate == "invscaling":
            args.power_t = trial.suggest_float("power_t", 0.1, 0.5)
    else:
        raise ValueError("Experiment not recognized. Try 0 or 1.")

    # train
    artifacts = train(args=args, df=df)

    # set additional attributes
    for split in ["train", "val", "test"]:
        metrics = artifacts[f"{split}_metrics"]["overall"]
        for score in ["precision", "recall", "f1"]:
            trial.set_user_attr(f"{split}_{score}", metrics[f"{score}"])

    return artifacts["val_metrics"]["overall"]["f1"]

tune_threshold ¤

tune_threshold(y_true: ndarray, y_score: ndarray) -> list

The default decision boundary for a binary classification problem is 0.5, which may not be optimal depending on the problem. So, besides tuning arguments, the threshold for each class is also tuned by optimizing the f1 score. What it does is try all possible values of the threshold in a grid from 0 to 1 and pick the one that has the maximum f1 score.

Parameters:

  • y_true (ndarray) –

    Ground truth (correct) target values.

  • y_score (ndarray) –

    Prediction probability of the model.

Returns:

  • list

    List of the best threshold for each class.

Source code in tagolym/train.py
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def tune_threshold(y_true: ndarray, y_score: ndarray) -> list:
    """The default decision boundary for a binary classification problem is 
    0.5, which may not be optimal depending on the problem. So, besides tuning 
    arguments, the threshold for each class is also tuned by optimizing the f1 
    score. What it does is try all possible values of the threshold in a grid 
    from 0 to 1 and pick the one that has the maximum f1 score.

    Args:
        y_true (ndarray): Ground truth (correct) target values.
        y_score (ndarray): Prediction probability of the model.

    Returns:
        List of the best threshold for each class.
    """
    # initialize threshold grid
    grid = np.linspace(0, 1, 101)
    threshold = []

    # find best threshold for each class
    for yt, ys in zip(y_true.T, y_score.T):
        f1 = {}
        for th in grid:
            yp = (ys > th).astype(int)
            f1[th] = f1_score(yt, yp)
        best_th = max(f1, key=f1.__getitem__)
        threshold.append(best_th)

    return threshold