train ¤
Training and optimization module, called after extracting, loading, and transforming raw data.
train ¤
train(args: Namespace, df: DataFrame) -> dict[str, Any]
Preprocess the data, binarize the labels, and split the data using
functions from data module. Then, initialize a model, train it,
predict the labels on all three splits using the trained model, and
evaluate the predictions. This function accepts arguments, to which an
additional argument threshold
may be added before being returned.
Basically, threshold
is a list of the best threshold tuned for each
class.
Parameters:
-
args
(Namespace
) –Arguments containing booleans for preprocessing the posts and hyperparameters for the modeling pipeline.
-
df
(DataFrame
) –Raw data containing posts and their corresponding tags.
Returns:
Source code in tagolym/train.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
|
objective ¤
objective(args: Namespace, df: DataFrame, trial: Trial, experiment: int = 0) -> float
F1 score is a metric chosen to be optimized in hyperparameter tuning. Using arguments chosen in an optuna trial, this function trains the model using train and returns the f1 score of the validation split. It also sets additional attributes to the trial, including precision, recall, and the f1 score on all three splits.
Parameters:
-
args
(Namespace
) –Arguments containing booleans for preprocessing the posts and hyperparameters for the modeling pipeline.
-
df
(DataFrame
) –Raw data containing posts and their corresponding tags.
-
trial
(Trial
) –Process of evaluating an objective function. This object is passed to an objective function and provides interfaces to get parameter suggestion, manage the trial's state, and set/get user-defined attributes of the trial.
-
experiment
(int
, default:0
) –Index for two-step optimization: optimizing hyperparameters in preprocessing, vectorization, and modeling; and hyperparameters in the learning algorithm. Defaults to 0.
Raises:
-
ValueError
–Experiment index is neither 0 nor 1.
Returns:
-
float
–F1 score of the validation split.
Source code in tagolym/train.py
84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|
tune_threshold ¤
tune_threshold(y_true: ndarray, y_score: ndarray) -> list
The default decision boundary for a binary classification problem is 0.5, which may not be optimal depending on the problem. So, besides tuning arguments, the threshold for each class is also tuned by optimizing the f1 score. What it does is try all possible values of the threshold in a grid from 0 to 1 and pick the one that has the maximum f1 score.
Parameters:
-
y_true
(ndarray
) –Ground truth (correct) target values.
-
y_score
(ndarray
) –Prediction probability of the model.
Returns:
-
list
–List of the best threshold for each class.
Source code in tagolym/train.py
141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 |
|