main ¤
The main module that runs everything end-to-end:
- Extract, load, and transform raw data
- Utilize preprocessed data to optimize the pipeline
- Train a model with the best arguments
- Predict on new data using the trained model
elt_data ¤
elt_data(key_path: FilePath) -> None
Query raw data from BigQuery and save it to data
folder in JSON
format.
Parameters:
-
key_path
(FilePath
) –Path to the Google service account private key JSON file used for creating credentials.
Source code in tagolym/main.py
28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
|
train_model ¤
train_model(args_fp: FilePath, experiment_name: str, run_name: str) -> None
Load raw data from data
folder and pass it to train.train to
preprocess the data and train a model with it. Log the metrics, artifacts,
and parameters using MLflow. Save also MLflow run ID and metrics to
config
folder.
Parameters:
-
args_fp
(FilePath
) –Path to the arguments used. Arguments include booleans for preprocessing the posts (whether to exclude command words and to implement word stemming), hyperparameters for the modeling pipeline, and the best threshold for each class.
-
experiment_name
(str
) –User input experiment name for MLflow.
-
run_name
(str
) –User input run name for MLflow.
Source code in tagolym/main.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
|
optimize ¤
optimize(args_fp: FilePath, study_name: str, num_trials: int) -> None
Load raw data from data
folder and optimize given arguments by
maximizing the f1 score in validation split. For search efficiency, the
optimization is done in two steps:
- for hyperparameters in preprocessing, vectorization, and modeling; and
- for hyperparameters in the learning algorithm.
Save also the best arguments to config
folder, name it as
args_opt.json
.
Parameters:
-
args_fp
(FilePath
) –Path to the initial arguments for the entire process. Arguments include booleans for preprocessing the posts and hyperparameters for the modeling pipeline.
-
study_name
(str
) –User input study name for MLflow.
-
num_trials
(int
) –Number of trials for arguments tuning, at minimum 1.
Source code in tagolym/main.py
135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 |
|
load_artifacts ¤
load_artifacts(run_id: Optional[str] = None) -> dict[str, Any]
Load the artifacts of a specific MLflow run ID into memory, including arguments, metrics, model, and label binarizer.
Parameters:
-
run_id
(Optional[str]
, default:None
) –MLflow run ID. Defaults to None.
Returns:
Source code in tagolym/main.py
198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 |
|
predict_tag ¤
predict_tag(texts: list[str], run_id: Optional[str] = None) -> list[dict]
Given a specific MLflow run ID and some posts, predict their labels using preloaded artifacts by calling predict.predict.
Parameters:
-
texts
(list[str]
) –List of posts.
-
run_id
(Optional[str]
, default:None
) –MLflow run ID. If None, run ID will be set from
run_id.txt
insideconfig
folder. Defaults to None.
Returns:
Source code in tagolym/main.py
234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 |
|