Skip to content

data ¤

All functions regarding data are written in this module, including data split, preprocessing, and transformation.

Definitions
Term Definition
Post String explaining a math problem written in LaTeX.
Token Preprocessed post.
Tag User input string suggesting in what category a post is. A post could have multiple tags.
Label Preprocessed tag. Only 10 labels are defined.

create_tag_mapping ¤

create_tag_mapping(tags: Series) -> defaultdict[str, list]

Create a dictionary in which each key is a tag and each value is a sublist of complete labels. The mapping is defined if the lowercased tag contains an element of partial labels as its substring.

Partial labels are defined as

["algebra", "geometr", "number theor", "combinator", "inequalit", 
 "function", "polynomial", "circle", "trigonometr", "modul"]
and complete labels are defined as
["algebra", "geometry", "number theory", "combinatorics", "inequality", 
 "function", "polynomial", "circle", "trigonometry", "modular arithmetic"]

For example, the tag ["combinatorial geometry"] will give a key-value pair {"combinatorial geometry": ["combinatorics", "geometry"]}.

Parameters:

  • tags (Series) –

    Collection of list of tags annotated by users.

Returns:

Source code in tagolym/data.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def create_tag_mapping(tags: Series) -> defaultdict[str, list]:
    """Create a dictionary in which each key is a tag and each value is a 
    sublist of complete labels. The mapping is defined if the lowercased tag 
    contains an element of partial labels as its substring.

    Partial labels are defined as 
    ```python
    ["algebra", "geometr", "number theor", "combinator", "inequalit", 
     "function", "polynomial", "circle", "trigonometr", "modul"]
    ```
    and complete labels are defined as
    ```python
    ["algebra", "geometry", "number theory", "combinatorics", "inequality", 
     "function", "polynomial", "circle", "trigonometry", "modular arithmetic"]
    ```

    For example, the tag `["combinatorial geometry"]` will give a key-value 
    pair `{"combinatorial geometry": ["combinatorics", "geometry"]}`.

    Args:
        tags (Series): Collection of list of tags annotated by users.

    Returns:
        Mapping from tag to sublist of complete labels.
    """
    mappings = []
    for plb, clb in zip(config.PARTIAL_LABELS, config.COMPLETE_LABELS):
        similar_tags = set([t.lower() for tag in tags for t in tag if plb in t.lower()])
        mappings.append({tag: clb for tag in similar_tags})

    mapping = defaultdict(list)
    for mpg in mappings:
        for key, value in mpg.items():
            mapping[key].append(value)

    return mapping

preprocess_tag ¤

preprocess_tag(x: list, mapping: defaultdict[str, list]) -> list

Preprocess a list of tags, including: lowercasing, mapping to complete labels, dropping duplicates, and sorting.

Parameters:

  • x (list) –

    List of tags annotated by users.

  • mapping (defaultdict[str, list]) –

    Mapping from tag to sublist of complete labels.

Returns:

  • list

    Preprocessed list of tags.

Source code in tagolym/data.py
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
def preprocess_tag(x: list, mapping: defaultdict[str, list]) -> list:
    """Preprocess a list of tags, including: lowercasing, mapping to complete 
    labels, dropping duplicates, and sorting.

    Args:
        x (list): List of tags annotated by users.
        mapping (defaultdict[str, list]): Mapping from tag to sublist of 
            complete labels.

    Returns:
        Preprocessed list of tags.
    """
    x = [tag.lower() for tag in x]       # lowercase all
    x = map(mapping.get, x)              # map tags
    x = filter(None, x)                  # remove None
    x = [t for tag in x for t in tag]    # flattened tags
    x = sorted(list(set(x)))             # drop duplicates and sort
    return x

extract_features ¤

extract_features(equation_pattern: str, x: str) -> str

Extract LaTeX commands inside math modes from a given text.

For example, this render

Find all functions \(f:(0,\infty)\rightarrow (0,\infty)\) such that for any \(x,y\in (0,\infty)\), $$ xf(x^2)f(f(y)) + f(yf(x)) = f(xy) \left(f(f(x^2)) + f(f(y^2))\right). $$

will become

Find all functions \infty \infty such that for any \in \infty , \left

Parameters:

  • equation_pattern (str) –

    Regex pattern for finding math modes.

  • x (str) –

    Input text written in LaTeX.

Returns:

  • str

    Text with extracted LaTeX commands.

Source code in tagolym/data.py
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
def extract_features(equation_pattern: str, x: str) -> str:
    r"""Extract LaTeX commands inside math modes from a given text.

    For example, this render
    > Find all functions $f:(0,\infty)\rightarrow (0,\infty)$ such that for 
    > any $x,y\in (0,\infty)$, 
    > $$
    > xf(x^2)f(f(y)) + f(yf(x)) = f(xy) \left(f(f(x^2)) + f(f(y^2))\right).
    > $$

    will become
    > Find all functions  \infty \infty  such that for any  \in \infty ,  \left

    Args:
        equation_pattern (str): Regex pattern for finding math modes.
        x (str): Input text written in LaTeX.

    Returns:
        Text with extracted LaTeX commands.
    """
    pattern = re.findall(equation_pattern, x)
    ptn_len = [len(ptn) for ptn in pattern]
    pattern = ["".join(ptn) for ptn in pattern]
    syntax = [" ".join(re.findall(r"\\(?:[^a-zA-Z]|[a-zA-Z]+[*=']?)", ptn)) for ptn in pattern]
    split = ["" if s is None else s for s in re.split(equation_pattern, x)]

    i = 0
    for ptn, length, cmd in zip(pattern, ptn_len, syntax):
        while "".join(split[i : i + length]) != ptn:
            i += 1
        split[i : i + length] = [cmd]

    return " ".join(split)

preprocess_post ¤

preprocess_post(x: str, nocommand: bool = False, stem: bool = False) -> str

Deep clean a post, using extract_features as one of the steps.

Parameters:

  • x (str) –

    Post written in LaTeX.

  • nocommand (bool, default: False ) –

    Whether to remove command words, i.e. ["prove", "let", "find", "show", "given"].

  • stem (bool, default: False ) –

    Whether to apply word stemming.

Returns:

  • str

    Cleaned post.

Source code in tagolym/data.py
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
def preprocess_post(x: str, nocommand: bool = False, stem: bool = False) -> str:
    """Deep clean a post, using [extract_features][data.extract_features] as 
    one of the steps.

    Args:
        x (str): Post written in LaTeX.
        nocommand (bool, optional): Whether to remove command words, i.e. 
            `["prove", "let", "find", "show", "given"]`.
        stem (bool, optional): Whether to apply word stemming.

    Returns:
        Cleaned post.
    """
    x = x.lower()                                       # lowercase all
    x = re.sub(r"http\S+", "", x)                       # remove URLs
    x = x.replace("$$$", "$$ $")                        # separate triple dollars
    x = x.replace("\n", " ")                            # remove new lines
    x = extract_features(config.EQUATION_PATTERN, x)    # extract latex
    x = re.sub(config.ASYMPTOTE_PATTERN, "", x)         # remove asymptote

    # remove stopwords
    x = x.replace("\\", " \\")
    x = " ".join(word for word in x.split() if word not in config.STOPWORDS)

    x = re.sub(r"([-;.,!?<=>])", r" \1 ", x)            # separate filters from words
    x = re.sub("[^A-Za-z0-9]+", " ", x)                 # remove non-alphanumeric chars

    # clean command words
    if nocommand:
        x = " ".join(word for word in x.split() if word not in config.COMMANDS)

    # stem words
    if stem:
        stemmer = PorterStemmer()
        x = " ".join(stemmer.stem(word) for word in x.split())

    # remove spaces at the beginning and end
    x = x.strip()

    return x

preprocess ¤

preprocess(df: DataFrame, nocommand: bool, stem: bool) -> DataFrame

End-to-end data preprocessing on all posts and their corresponding tags, then drop all data points with an empty preprocessed post afterward.

Parameters:

  • df (DataFrame) –

    Raw data containing posts and their corresponding tags.

  • nocommand (bool) –

    Whether to remove command words, i.e. ["prove", "let", "find", "show", "given"].

  • stem (bool) –

    Whether to apply word stemming.

Returns:

  • DataFrame

    Preprocessed data used for modeling.

Source code in tagolym/data.py
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def preprocess(df: DataFrame, nocommand: bool, stem: bool) -> DataFrame:
    """End-to-end data preprocessing on all posts and their corresponding 
    tags, then drop all data points with an empty preprocessed post afterward.

    Args:
        df (DataFrame): Raw data containing posts and their corresponding tags.
        nocommand (bool): Whether to remove command words, i.e. `["prove", 
            "let", "find", "show", "given"]`.
        stem (bool): Whether to apply word stemming.

    Returns:
        Preprocessed data used for modeling.
    """
    mapping = create_tag_mapping(df["tags"])
    df["token"] = df["post_canonical"].apply(preprocess_post, args=(nocommand, stem))
    df["tags"] = df["tags"].apply(preprocess_tag, args=(mapping,))
    df = df[df["token"] != ""].reset_index(drop=True)
    return df

binarize ¤

binarize(labels: Series) -> tuple[ndarray, Transformer]

Convert labels into a binary matrix of size (n_samples, n_labels) indicating the presence of a complete label. For example, the labels ["algebra", "inequality"] will be transformed into [1, 0, 0, 0, 0, 1, 0, 0, 0, 0]. Besides returning the transformed labels, it also returns the MultiLabelBinarizer object used later in downstream processes for converting the matrix back to labels.

Parameters:

  • labels (Series) –

    Collection of list of preprocessed tags.

Returns:

  • label_indicator ( ndarray ) –

    Binary matrix representation of labels.

  • mlb ( Transformer ) –

    Transformer that converts labels to label_indicator.

Source code in tagolym/data.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
def binarize(labels: Series) -> tuple[ndarray, Transformer]:
    """Convert labels into a binary matrix of size `(n_samples, n_labels)` 
    indicating the presence of a complete label. For example, the labels 
    `["algebra", "inequality"]` will be transformed into `[1, 0, 0, 0, 0, 1, 
    0, 0, 0, 0]`. Besides returning the transformed labels, it also returns 
    the `MultiLabelBinarizer` object used later in downstream processes for 
    converting the matrix back to labels.

    Args:
        labels (Series): Collection of list of preprocessed tags.

    Returns:
        label_indicator: Binary matrix representation of `labels`.
        mlb: Transformer that converts `labels` to `label_indicator`.
    """
    mlb = MultiLabelBinarizer()
    label_indicator = mlb.fit_transform(labels)
    return label_indicator, mlb

split_data ¤

split_data(X: DataFrame, y: ndarray, train_size: float = 0.7, random_state: Optional[RandomState] = None) -> Iterable[Union[DataFrame, ndarray]]

Using utils.IterativeStratification, split the tokens and their corresponding labels into 3 parts with (customizable) 70/15/15 proportions, each respectively for model training, validation, and testing.

Parameters:

  • X (DataFrame) –

    Preprocessed posts.

  • y (ndarray) –

    Binarized labels.

  • train_size (float, default: 0.7 ) –

    Fraction of training data. Defaults to 0.7.

  • random_state (Optional[RandomState], default: None ) –

    Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. Defaults to None.

Returns:

  • Iterable[Union[DataFrame, ndarray]]

    Tuple containing train-validation-test split of tokens and labels.

Source code in tagolym/data.py
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
def split_data(X: DataFrame, y: ndarray, train_size: float = 0.7, random_state: Optional[RandomState] = None) -> Iterable[Union[DataFrame, ndarray]]:
    """Using [utils.IterativeStratification][], split the tokens and their 
    corresponding labels into 3 parts with (customizable) 70/15/15 
    proportions, each respectively for model training, validation, and testing.

    Args:
        X (DataFrame): Preprocessed posts.
        y (ndarray): Binarized labels.
        train_size (float, optional): Fraction of training data. Defaults to 
            0.7.
        random_state (Optional[RandomState], optional): Controls the shuffling 
            applied to the data before applying the split. Pass an int for 
            reproducible output across multiple function calls. Defaults to 
            None.

    Returns:
        Tuple containing train-validation-test split of tokens and labels.
    """
    stratifier = utils.IterativeStratification(
        n_splits=3,
        order=2,
        sample_distribution_per_fold=[train_size, (1-train_size)/2, (1-train_size)/2],
        random_state=random_state,
    )

    indices = []
    for _, idx in stratifier.split(X, y):
        indices.append(idx.tolist())

    X_train, y_train = X.iloc[indices[0]], y[indices[0]]
    X_val, y_val = X.iloc[indices[1]], y[indices[1]]
    X_test, y_test = X.iloc[indices[2]], y[indices[2]]

    return X_train, X_val, X_test, y_train, y_val, y_test