Skip to content โ†’

Deep learning and toposes

Judging from this and that paper, deep learning is the string theory of the 2020s for geometers and representation theorists.

If you want to know quickly what neural networks really are, I can recommend the post demystifying deep learning.

The typical layout of a deep neural network has an input layer L0 allowing you to feed N0 numbers to the system (a vector v0โ†’โˆˆRN0), an output layer Lp spitting Np numbers back (a vector vpโ†’โˆˆRNp), and pโˆ’1 hidden layers L1,โ€ฆ,Lpโˆ’1 where all the magic happens. The hidden layer Li has Ni virtual neurons, their states giving a vector viโ†’โˆˆRNi.



Picture taken from Logical informations cells I

For simplicity letโ€™s assume all neurons in layer Li are wired to every neuron in layer Li+1, the relevance of these connections given by a matrix of weights WiโˆˆMNi+1ร—Ni(R).

If at any given moment the โ€˜stateโ€™ of the neural network is described by the state-vectors v1โ†’,โ€ฆ,vpโˆ’1โ†’ and the weight-matrices W0,โ€ฆ,Wp, then an input v0โ†’ will typically result in new states of the neurons in layer L1 given by

v1โ†’โ€ฒ=c0(W0.v0โ†’+v1โ†’)

which will then give new states in layer L2

v2โ†’โ€ฒ=c1(W1.v1โ†’โ€ฒ+v2โ†’)

and so on, rippling through the network, until we get as the output

vpโ†’=cpโˆ’1(Wpโˆ’1.vpโˆ’1โ†’โ€ฒ)

where all the ci are fixed smooth activation functions ci:RNi+1โ†’RNi+1.

This is just the dynamic, or forward working of the network.

The learning happens by comparing the computed output with the expected output, and working backwards through the network to alter slightly the state-vectors in all layers, and the weight-matrices between them. This process is called back-propagation, and involves the gradient descent procedure.

Even from this (over)simplified picture it seems doubtful that set valued (!) toposes are suitable to describe deep neural networks, as the Paris-Huawei-topos-team claims in their recent paper Topos and Stacks of Deep Neural Networks.

Still, there is a vast generalisation of neural networks: learners, developed by Brendan Fong, David Spivak and Remy Tuyeras in their paper Backprop as Functor: A compositional perspective on supervised learning (which btw is an excellent introduction for mathematicians to neural networks).

For any two sets A and B, a learner Aโ†’B is a tuple (P,I,U,R) where

  • P is a set, a parameter space of some functions from A to B.
  • I is the interpretation map I:Pร—Aโ†’B describing the functions in P.
  • U is the update map U:Pร—Aร—Bโ†’P, part of the learning procedure. The idea is that U(p,a,b) is a map which sends a closer to b than the map p did.
  • R is the request map R:Pร—Aร—Bโ†’A, the other part of the learning procedure. The idea is that the new element R(p,a,b)=aโ€ฒ in A is such that p(aโ€ฒ) will be closer to b than p(a) was.

The request map is also crucial is defining the composition of two learners Aโ†’B and Bโ†’C. Learn is the (symmetric, monoidal) category with objects all sets and morphisms equivalence classes of learners (defined in the natural way).

In this way we can view a deep neural network with p layers as before to be the composition of p learners
RN0โ†’RN1โ†’RN2โ†’โ‹ฏโ†’RNp
where the learner describing the transition from the i-th to the i+1-th layer is given by the equivalence class of data (Ai,Bi,Pi,Ii,Ui,Ri) with
Ai=RNi, Bi=RNi+1, Pi=MNi+1ร—Ni(R)ร—RNi+1
and interpretation map for p=(Wi,vโ†’i+1)โˆˆPi
Ii(p,viโ†’)=ci(Wi.viโ†’+vโ†’i+1)
The update and request maps (encoding back-propagation and gradient-descent in this case) are explicitly given in theorem III.2 of the paper, and they behave functorial (whence the title of the paper).

More generally, we will now associate objects of a topos (actually just sheaves over a simple topological space) to a network op p learners
A0โ†’A1โ†’A2โ†’โ‹ฏโ†’Ap
inspired by section I.2 of Topos and Stacks of Deep Neural Networks.

The underlying category will be the poset-category (the opposite of the ordering of the layers)
0โ†1โ†2โ†โ‹ฏโ†p
The presheaf on a poset is a locale and in this case even the topos of sheaves on the topological space with p+1 nested open sets.
X=U0โЇU1โЇU2โЇโ‹ฏโЇUp=โˆ…
If the learner Aiโ†’Ai+1 is (the equivalence class) of the tuple (Ai,Ai+1,Pi,Ii,Ui,Ri) we will now describe two sheaves W and X on the topological space X.

W has as sections ฮ“(W,Ui)=โˆj=ipโˆ’1Pi and the obvious projection maps as the restriction maps.

X has as sections ฮ“(X,Ui)=Aiร—ฮ“(W,Ui) and restriction map to the next smaller open
ฯi+1i : ฮ“(X,Ui)โ†’ฮ“(X,Ui+1)(ai,(pi,pโ€ฒ))โ†ฆ(pi(ai),pโ€ฒ)
and other retriction maps by composition.

A major result in Topos and Stacks of Deep Neural Networks is that back-propagation is a natural transformation, that is, a sheaf-morphism Xโ†’X.

In this general setting of layered learners we can always define a map on the sections of X (for every open Ui), ฮ“(X,Ui)โ†’ฮ“(X,Ui)
(a,(pi,pโ€ฒ))โ†ฆ(R(pi,ai,pi(ai)),(U(pi,ai,pi(ai)),pโ€ฒ)
But, in order for this to define a sheaf-morphism, compatible with the restrictions, we will have to impose restrictions on the update and restriction maps of the learners, in general.

Still, in the special case of deep neural networks, this compatibility follows from the functoriality property of Backprop as Functor: A compositional perspective on supervised learning.

To be continued.

Published in geometry math