In this section we introduce the dataset for the competition, dataset format and the dataset collection.

Offline Handwritten Formula: For Task 1, we use the scanned offline images of the handwritten mathematical formulas. These formulas are written on documents of different materials and scanned into offline images. Different from images in other tasks in computer vision, handwritten formula images vary a lot in both length and height. We provide both the LaTeX format and Symbol Label Graph format ground truth for the images in the training set and validation set. In addition, we also provide the bounding boxes of all the math symbols in each formula images, so that some methods such as symbol detection can be performed.

For Task 1, we will collect two subsets for training. The first subset contains the HMEs that have the same classes with the HMEs used in ICDAR 2019 CROHME competition, which can support better comparison research. This subset will contain about 10k images. The second subset will be collected using another set of transcripts, and also will contain about 10k images. The training and validation subsets will not be divided explicitly, and participants can use the 20k images flexibly to train their models. For the test dataset, we will collect another 1500 expression images. For the first subset, the offline HMEs will be annotated at symbol level: the latex transcripts, the corresponding label graph and bounding boxes of symbols will be provided. For the second subset, the latex transcript and the label graph will be provided but the bounding boxes of symbols will not be provided.

For Task 2, formulas in the training set are marked to the symbol level to provide more possibilities for different methods. The validation set consists of two part: a query set and a formula image set. For the second subtask, the online strokes of both the training set and the query set are provided. Participants can use the validation set to measure the precision and recall of their systems. Validation sets of Task 2 give the basic ground truth but do not mark the position of the symbols. Participants can flexibly develop training sample combinations and model learning methods.

For Task 2, we will collect about 300 document images from real scenarios, each containing about 10 mathematical expressions, as the databases. These images will be annotated at line level/expression level, i.e., the bounding boxes of expressions will be annotated, and the latex transcript and the label graph will be provided but the bounding boxes of symbols will not be provided, which can be viewed as weakly annotated samples. Regarding the queries, we will collect about 200 offline expression images and 200 online expressions for validation, and 300 offline and 300 online samples for test.