MNIST... Again?

MNIST is a great dataset for classification tasks, but if you want another alternative, check out Kuzushiji-MNIST!
August 17, 2019 - 20:18:22

Header image courtesy of Clanuwat et al. "Deep Learning for Classical Japanese Literature".


If you're in the process of getting into machine learning, you've almost certainly come across the (in?)famous MNIST dataset consisting of hand-written digits. It is a very commonly used introductory dataset for learning about classification tasks (as seen on articles and resources such as the Neural Networks and Deep Learning free e-book), and appropriately so — it's an interesting dataset with features that are easy to work with, especially for beginners.

However visually pleasing these digits are, it can quickly get boring using the same dataset to practice classification tasks with different machine learning models.

This is where Kuzushiji-MNIST comes in — a similarly structured, but very different dataset to MNIST.

Kuzushiji-MNIST

Kuzushiji-MNIST is a dataset that is similarly structured to the popular MNIST dataset for hand-written digit recognition. The dataset is one of three introduced in "Deep Learning for Classical Japanese Literature", Tarin Clanuwat et al. and it is regarded as being the source of more challenging classification tasks than MNIST due to the many variations in each of the 10 characters.

Kuzushiji-MNIST consists of 70,000 examples (28x28 pixel grayscale images) uniformly distributed across each of the 10 character classes. The dataset is already divided into 60,000 training examples and 10,000 test examples.

Division Number of examples Examples per class
Training images 60,000 6,000
Testing images 10,000 1,000

Each of these examples is an image of a cursive hand-written (Kuzushiji) character in Japanese Hiragana.

Hiragana is one of the two components of the Japanese writing system (the other being Kanji).

Kuzushiji refers to the cursive hand-written variations of Hiragana and Kanji characters. However, the Kuzushiji-MNIST dataset only consists of Kuzushiji Hiragana characters.


The 10 Kuzushiji Hiragana characters - each row represents a single class.
The first column depicts each character's modern Hiragana counterpart. (Source: Clanuwat et al.)

If we have a look at some two-dimensional projection of the MNIST digits and the Kuzushiji-MNIST characters, we can see that the decision boundaries between each Kuzushiji-MNIST class are more complicated.


Scatter plot of the UMAP embeddings for the training examples of the MNIST and Kuzushiji-MNIST datasets.
This plot visualizes the difficulty of separating the classes of the Kuzushiji-MNIST dataset when compared to MNIST.

These two-dimensional projections are embeddings generated by the Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction technique which will be explained in further detail in my next blog post.


For more information on Kuzushiji-MNIST and example classification using multinomial logistic regression and neural network approaches, please have a look at my GitHub repository eonu/kuzushiji-mnist.