Skip to content

evaluate ¤

Given true labels and model predictions, the purpose of this module is to calculate the precision, recall, f1 score of the model, and number of samples. The performance is computed on the overall samples, per-class samples, and per-slice samples. There are 8 slices considered:

  • short tokens, i.e. those that have less than 5 words,
  • six slices in which the posts are tagged as a subtopic but not tagged as the bigger topic covering the subtopic, and
  • tokens that don't have frequent words with more than 3 letters.

short_post ¤

short_post(x: Series) -> bool

Confirm whether a data point has a token with less than 5 words.

Parameters:

  • x (Series) –

    Data point containing a token.

Returns:

  • bool

    Whether the data point has a token with less than 5 words.

Source code in tagolym/evaluate.py
17
18
19
20
21
22
23
24
25
26
27
@slicing_function()
def short_post(x: Series) -> bool:
    """Confirm whether a data point has a token with less than 5 words.

    Args:
        x (Series): Data point containing a token.

    Returns:
        Whether the data point has a token with less than 5 words.
    """
    return len(x["token"].split()) < 5

inequality_not_algebra ¤

inequality_not_algebra(x: Series) -> bool

Confirm whether a data point has "inequality" but not "algebra" as one of its labels.

Parameters:

  • x (Series) –

    Data point containing a list of labels.

Returns:

  • bool

    Whether the data point has "inequality" but not "algebra" as one of its labels.

Source code in tagolym/evaluate.py
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
@slicing_function()
def inequality_not_algebra(x: Series) -> bool:
    """Confirm whether a data point has `"inequality"` but not `"algebra"` as 
    one of its labels.

    Args:
        x (Series): Data point containing a list of labels.

    Returns:
        Whether the data point has `"inequality"` but not `"algebra"` as one 
            of its labels.
    """
    inequality = "inequality" in x["tags"]
    algebra = "algebra" in x["tags"]
    return (inequality and not algebra)

function_not_algebra ¤

function_not_algebra(x: Series) -> bool

Confirm whether a data point has "function" but not "algebra" as one of its labels.

Parameters:

  • x (Series) –

    Data point containing a list of labels.

Returns:

  • bool

    Whether the data point has "function" but not "algebra" as one of its labels.

Source code in tagolym/evaluate.py
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
@slicing_function()
def function_not_algebra(x: Series) -> bool:
    """Confirm whether a data point has `"function"` but not `"algebra"` as 
    one of its labels.

    Args:
        x (Series): Data point containing a list of labels.

    Returns:
        Whether the data point has `"function"` but not `"algebra"` as one of 
            its labels.
    """
    function = "function" in x["tags"]
    algebra = "algebra" in x["tags"]
    return (function and not algebra)

polynomial_not_algebra ¤

polynomial_not_algebra(x: Series) -> bool

Confirm whether a data point has "polynomial" but not "algebra" as one of its labels.

Parameters:

  • x (Series) –

    Data point containing a list of labels.

Returns:

  • bool

    Whether the data point has "polynomial" but not "algebra" as one of its labels.

Source code in tagolym/evaluate.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
@slicing_function()
def polynomial_not_algebra(x: Series) -> bool:
    """Confirm whether a data point has `"polynomial"` but not `"algebra"` as 
    one of its labels.

    Args:
        x (Series): Data point containing a list of labels.

    Returns:
        Whether the data point has `"polynomial"` but not `"algebra"` as one 
            of its labels.
    """
    polynomial = "polynomial" in x["tags"]
    algebra = "algebra" in x["tags"]
    return (polynomial and not algebra)

circle_not_geometry ¤

circle_not_geometry(x: Series) -> bool

Confirm whether a data point has "circle" but not "geometry" as one of its labels.

Parameters:

  • x (Series) –

    Data point containing a list of labels.

Returns:

  • bool

    Whether the data point has "circle" but not "geometry" as one of its labels.

Source code in tagolym/evaluate.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
@slicing_function()
def circle_not_geometry(x: Series) -> bool:
    """Confirm whether a data point has `"circle"` but not `"geometry"` as one 
    of its labels.

    Args:
        x (Series): Data point containing a list of labels.

    Returns:
        Whether the data point has `"circle"` but not `"geometry"` as one of 
            its labels.
    """
    circle = "circle" in x["tags"]
    geometry = "geometry" in x["tags"]
    return (circle and not geometry)

trigonometry_not_geometry ¤

trigonometry_not_geometry(x: Series) -> bool

Confirm whether a data point has "trigonometry" but not "geometry" as one of its labels.

Parameters:

  • x (Series) –

    Data point containing a list of labels.

Returns:

  • bool

    Whether the data point has "trigonometry" but not "geometry" as one of its labels.

Source code in tagolym/evaluate.py
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
@slicing_function()
def trigonometry_not_geometry(x: Series) -> bool:
    """Confirm whether a data point has `"trigonometry"` but not `"geometry"` 
    as one of its labels.

    Args:
        x (Series): Data point containing a list of labels.

    Returns:
        Whether the data point has `"trigonometry"` but not `"geometry"` as 
            one of its labels.
    """
    trigonometry = "trigonometry" in x["tags"]
    geometry = "geometry" in x["tags"]
    return (trigonometry and not geometry)

modular_arithmetic_not_number_theory ¤

modular_arithmetic_not_number_theory(x: Series) -> bool

Confirm whether a data point has "modular arithmetic" but not "number theory" as one of its labels.

Parameters:

  • x (Series) –

    Data point containing a list of labels.

Returns:

  • bool

    Whether the data point has "modular arithmetic" but not "number theory" as one of its labels.

Source code in tagolym/evaluate.py
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
@slicing_function()
def modular_arithmetic_not_number_theory(x: Series) -> bool:
    """Confirm whether a data point has `"modular arithmetic"` but not 
    `"number theory"` as one of its labels.

    Args:
        x (Series): Data point containing a list of labels.

    Returns:
        Whether the data point has `"modular arithmetic"` but not `"number 
            theory"` as one of its labels.
    """
    modular_arithmetic = "modular arithmetic" in x["tags"]
    number_theory = "number theory" in x["tags"]
    return (modular_arithmetic and not number_theory)

keyword_lookup ¤

keyword_lookup(x: Series, keywords: list) -> bool

Confirm whether a token of a data point doesn't have frequent words with more than 3 characters.

Parameters:

  • x (Series) –

    Data point containing a token.

  • keywords (list) –

    Frequent four-letter-or-more words derived from all tokens.

Returns:

  • bool

    Whether the token of the data point doesn't have frequent words with more than 3 letters.

Source code in tagolym/evaluate.py
132
133
134
135
136
137
138
139
140
141
142
143
144
145
def keyword_lookup(x: Series, keywords: list) -> bool:
    """Confirm whether a token of a data point doesn't have frequent words 
    with more than 3 characters.

    Args:
        x (Series): Data point containing a token.
        keywords (list): Frequent four-letter-or-more words derived from all 
            tokens.

    Returns:
        Whether the token of the data point doesn't have frequent words with 
            more than 3 letters.
    """
    return all(word not in x["token"].split() for word in keywords)

make_keyword_sf ¤

make_keyword_sf(df: DataFrame) -> SlicingFunction

Create a SlicingFunction object to use the keyword_lookup function.

Parameters:

  • df (DataFrame) –

    Preprocessed data containing tokens and their corresponding labels.

Returns:

  • SlicingFunction

    Python class for slicing functions, i.e. functions that take a data point as input and produce a boolean that states whether or not the data point satisfies some predefined conditions.

Source code in tagolym/evaluate.py
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
def make_keyword_sf(df: DataFrame) -> SlicingFunction:
    """Create a `SlicingFunction` object to use the [keyword_lookup]
    [evaluate.keyword_lookup] function.

    Args:
        df (DataFrame): Preprocessed data containing tokens and their 
            corresponding labels.

    Returns:
        Python class for slicing functions, i.e. functions that take a data 
            point as input and produce a boolean that states whether or not 
            the data point satisfies some predefined conditions.
    """
    frequent_words = df["token"].str.split(expand=True).stack().value_counts().index[:20]
    keywords = [word for word in frequent_words if len(word) > 3]
    return SlicingFunction(
        name="without_frequent_words",
        f=keyword_lookup,
        resources=dict(keywords=keywords),
    )

average_performance ¤

average_performance(y_true: ndarray, y_pred: ndarray, average: Optional[Literal[micro, macro, weighted]] = 'weighted') -> dict[str, Union[float, int]]

Compute precision, recall, F-measure, and number of samples from model predictions and true labels.

Parameters:

  • y_true (ndarray) –

    Ground truth (correct) target values.

  • y_pred (ndarray) –

    Estimated targets as returned by the model.

  • average (Optional[Literal], default: 'weighted' ) –

    If None, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data:

    Average Description
    "micro" Calculate metrics globally by counting the total true positives, false negatives and false positives.
    "macro" Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
    "weighted" Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters "macro" to account for label imbalance; it can result in an F-score that is not between precision and recall.

    Defaults to "weighted".

Returns:

  • dict[str, Union[float, int]]

    Dictionary containing precision, recall, F-measure, and number of samples.

Source code in tagolym/evaluate.py
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
def average_performance(y_true: ndarray, y_pred: ndarray, average: Optional[Literal["micro", "macro", "weighted"]] = "weighted") -> dict[str, Union[float, int]]:
    """Compute precision, recall, F-measure, and number of samples from model 
    predictions and true labels.

    Args:
        y_true (ndarray): Ground truth (correct) target values.
        y_pred (ndarray): Estimated targets as returned by the model.
        average (Optional[Literal], optional): If `None`, the scores for each 
            class are returned. Otherwise, this determines the type of 
            averaging performed on the data:

            | Average      | Description                                      |
            | ------------ | ------------------------------------------------ |
            | `"micro"`    | Calculate metrics globally by counting the total \
                             true positives, false negatives and false        \
                             positives.                                       |
            | `"macro"`    | Calculate metrics for each label, and find their \
                             unweighted mean. This does not take label        \
                             imbalance into account.                          |
            | `"weighted"` | Calculate metrics for each label, and find their \
                             average weighted by support (the number of true  \
                             instances for each label). This alters `"macro"` \
                             to account for label imbalance; it can result in \
                             an F-score that is not between precision and     \
                             recall.                                          |

            Defaults to `"weighted"`.

    Returns:
        Dictionary containing precision, recall, F-measure, and number of 
            samples.
    """
    metrics = precision_recall_fscore_support(y_true, y_pred, average=average)
    return {
        "precision": metrics[0],
        "recall": metrics[1],
        "f1": metrics[2],
        "num_samples": len(y_true),
    }

get_slice_metrics ¤

get_slice_metrics(y_true: ndarray, y_pred: ndarray, slices: ndarray) -> dict[str, dict]

Apply average_performance with "micro" average to different slices of data.

Parameters:

  • y_true (ndarray) –

    Ground truth (correct) target values.

  • y_pred (ndarray) –

    Estimated targets as returned by the model.

  • slices (ndarray) –

    Slices of data defined by slicing functions.

Returns:

  • dict[str, dict]

    Dictionary containing dictionaries of average performances across slices.

Source code in tagolym/evaluate.py
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
def get_slice_metrics(y_true: ndarray, y_pred: ndarray, slices: ndarray) -> dict[str, dict]:
    """Apply [average_performance][evaluate.average_performance] with 
    `"micro"` average to different slices of data.

    Args:
        y_true (ndarray): Ground truth (correct) target values.
        y_pred (ndarray): Estimated targets as returned by the model.
        slices (ndarray): Slices of data defined by slicing functions.

    Returns:
        Dictionary containing dictionaries of average performances across 
            slices.
    """
    slice_metrics = {}
    for slice_name in slices.dtype.names:
        mask = slices[slice_name].astype(bool)
        if sum(mask):
            slice_metrics[slice_name] = average_performance(y_true[mask], y_pred[mask], "micro")

    return slice_metrics

get_metrics ¤

get_metrics(y_true: ndarray, y_pred: ndarray, classes: ndarray, df: Optional[DataFrame] = None) -> dict[str, dict]

Compute model performance for the overall data (using "weighted" average), across classes, and across slices (using "micro" average).

Parameters:

  • y_true (ndarray) –

    Ground truth (correct) target values.

  • y_pred (ndarray) –

    Estimated targets as returned by the model.

  • classes (ndarray) –

    Complete labels.

  • df (Optional[DataFrame], default: None ) –

    Preprocessed data containing tokens and their corresponding labels. Defaults to None.

Returns:

  • dict[str, dict]

    Dictionary containing dictionaries of average performances for the overall data, across classes, and across slices.

Source code in tagolym/evaluate.py
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
def get_metrics(y_true: ndarray, y_pred: ndarray, classes: ndarray, df: Optional[DataFrame] = None) -> dict[str, dict]:
    """Compute model performance for the overall data (using "weighted" 
    average), across classes, and across slices (using "micro" average).

    Args:
        y_true (ndarray): Ground truth (correct) target values.
        y_pred (ndarray): Estimated targets as returned by the model.
        classes (ndarray): Complete labels.
        df (Optional[DataFrame], optional): Preprocessed data containing 
            tokens and their corresponding labels. Defaults to None.

    Returns:
        Dictionary containing dictionaries of average performances for the 
            overall data, across classes, and across slices.
    """
    # performance
    performance = {"overall": {}, "class": {}}

    # overall performance
    performance["overall"] = average_performance(y_true, y_pred, "weighted")

    # per-class performance
    metrics = precision_recall_fscore_support(y_true, y_pred, average=None)
    for i in range(len(classes)):
        performance["class"][classes[i]] = {
            "precision": metrics[0][i],
            "recall": metrics[1][i],
            "f1": metrics[2][i],
            "num_samples": metrics[3][i],
        }

    # per-slice performance
    if df is not None:
        slices = PandasSFApplier([
            short_post,
            inequality_not_algebra,
            function_not_algebra,
            polynomial_not_algebra,
            circle_not_geometry,
            trigonometry_not_geometry,
            modular_arithmetic_not_number_theory,
            make_keyword_sf(df),
        ]).apply(df)
        performance["slices"] = get_slice_metrics(y_true, y_pred, slices)

    return performance