Skip to content โ†’

Tag: Spivak

Learners and Poly

Brendan Fong, David Spivak and Remy Tuyeras cooked up a vast generalisation of neural networks in their paper Backprop as Functor: A compositional perspective on supervised learning.

Hereโ€™s a nice introduction to neural networks for category theorists by Bruno Gavranovic. At 1.49m he tries to explain supervised learning with neural networks in one slide. Learners show up later in the talk.

Poly is the category of all polynomial functors, that is, things of the form
p=โˆ‘iโˆˆp(1)yp[i] : Setsโ†’SetsSโ†ฆโจ†iโˆˆp(1)Maps(p[i],S)
with p(1) and all p[i] sets.

Last time I gave Spivakโ€™s โ€˜corollaโ€™ picture to think about such functors.

I prefer to view pโˆˆPoly as an horribly discrete โ€˜sheafโ€™ P over the โ€˜spaceโ€™ p(1) with stalk p[i]=Pi at point iโˆˆp(1).



A morphism pโ†’q in Poly is a map ฯ†1:p(1)โ†’q(1), together with for all iโˆˆp(1) a map ฯ†i#:q[ฯ†1(i)]โ†’p[i].

In the sheaf picture, this gives a map of sheaves over the space p(1) from the inverse image sheaf ฯ†1โˆ—Q to P.



But, unless you dream of sheaves in the night, by all means stick to Spivakโ€™s corolla picture.

A learner Aโ†’B between two sets A and B is a complicated tuple of things (P,I,U,R):

  • P is a set, a parameter space of some maps from A to B.
  • I is the interpretation map I:Pร—Aโ†’B describing the maps in P.
  • U is the update map U:Pร—Aร—Bโ†’P, the learning procedure. The idea is that U(p,a,b) is a map which sends a closer to b than the map p did.
  • R is the request map R:Pร—Aร—Bโ†’A.

Hereโ€™s a nice application of Polyโ€™s set-up:

Morphisms PyPโ†’Maps(A,B)ร—Maps(Aร—B,A)yAร—B in Poly coincide with learners Aโ†’B with parameter space P.

This follows from unpacking the definition of morphism in Poly and the process CT-ers prefer to call Currying.

The space-map ฯ†1:Pโ†’Maps(A,B)ร—Maps(Aร—B,A) gives us the interpretation and request-map, whereas the sheaf-map ฯ†# gives us the more mysterious update-map Pร—Aร—Bโ†’P.

Learn(A,B) is the category with objects all the learners Aโ†’B (for all paramater-sets P), and with morphisms defined naturally, that is, maps between the parameter-sets, compatible with the structural maps.

A surprising result from David Spivakโ€™s paper Learnersโ€™ Languages is

Learn(A,B) is a topos. In fact, it is the topos of all set-valued representations of a (huge) directed graph GAB.

This will take some time.

Letโ€™s bring some dynamics in. Take any polynmial functor pโˆˆPoly and fix a morphism in Poly
ฯ†=(ฯ†1,ฯ†[โˆ’]) : p(1)yp(1)โ†’p
with space-map ฯ†1 the identity map.

We form a directed graph:

  • the vertices are the elements of p(1),
  • vertex iโˆˆp(1) is the source vertex of exactly one arrow for every aโˆˆp[i],
  • the target vertex of that arrow is the vertex ฯ•[i](a)โˆˆp(1).

Hereโ€™s one possibility from Spivakโ€™s paper for p=2y2+1, with the coefficient 2-set {green dot, yellow dot}, and with 1 the singleton {red dot}.



Start at one vertex and move after a minute along a directed edge to the next (possibly the same) vertex. The potential evolutions in time will then form a tree, with each node given a label in p(1).

If we start at the green dot, we get this tree of potential time-evolutions



There are exactly #p[i] branches leaving a node labeled iโˆˆp(1), and all subtrees emanating from equal labelled nodes are isomorphic.

If we had started at the yellow dot we had obtained a labelled tree isomorphic to the subtree emanating here from any yellow dot.

We can do the same things for any morphism in Poly of the form
ฯ†=(ฯ†1,ฯ†[โˆ’]) : SySโ†’p
Now, we have a directed graph with vertices the elements sโˆˆS, with as many edges leaving vertex s as there are elements aโˆˆp[ฯ†1(s)], and with the target vertex of the edge labeled a starting in s the vertex ฯ†[ฯ†1(s)](A).

Once we have this directed graph on #S vertices we can label vertex s with the label ฯ†1(s) from p(1).

In this way, the time evolutions starting at a vertex sโˆˆS will give us a p(1)-labelled rooted tree.

But now, it is possibly that two distinct vertices can have the same p(1)-labeled tree of evolutions. But also, trees corresponding to equal labeled vertices can be different.

Right, I guess weโ€™re ready to define the graph GAB and prove that Learn(A,B) is a topos.

In the case of learners, we have the target polynomial functor p=CyAร—B with C=Maps(A,B)ร—Maps(Aร—B,A), that is
p(1)=Cand allp[i]=Aร—B

Start with the free rooted tree T having exactly #Aร—B branches growing from each node.

Hereโ€™s the directed graph GAB:

  • vertices vฯ‡ correspond to the different C-labelings of T, one C-labeled rooted tree Tฯ‡ for every map ฯ‡:vtx(T)โ†’C,
  • arrows vฯ‡โ†’vฯ‰ if and only if Tฯ‰ is the rooted C-labelled tree isomorphic to the subtree of Tฯ‡ rooted at one step from the root.

A learner Aโ†’B gives a set-valued representation of GAB.

We saw that a learner Aโ†’B is the same thing as a morphism in Poly
ฯ†=(ฯ†1,ฯ†[โˆ’]) : PyPโ†’CyAร—B
with P the parameter set of maps.

Hereโ€™s what we have to do:

1. Draw the directed graph on vertices pโˆˆP giving the dynamics of the morphism ฯ†. This graph describes how the learner can cycle through the parameter-set.

2. Use the map ฯ†1 to label the vertices with elements from C.



3. For each vertex draw the rooted C-labeled tree of potential time-evolutions starting in that vertex.

In this example the time-evolutions of the two green vertices are the same, but in general they can be different.



4. Find the vertices in GAB determined by these C-labeled trees and note that they span a full subgraph of GAB.



5. The vertex-set Pv consists of all elements from p whose (C-labeled) vertex has evolution-tree Tv. If vโ†’w is a directed edge in GAB corresponding to an element (a,b)โˆˆAร—B, then the map on the vertex-sets corresponding to this edge is
fv,(a,b) : Pvโ†’Pwpโ†ฆฯ†[ฯ†1(p)](a,b)



A set-valued representation of GAB gives a learner Aโ†’B.

1. Take a set-valued representation of GAB, that is, the finite or infinite collection of vertices V in GAB where the vertex-set Pv is non-empty. Note that these vertices span a full subgraph of GAB.

And, for each directed arrow vโ†’w in this subgraph, labeled by an element (a,b)โˆˆAร—B we have a map
fv,(a,b) : Pvโ†’Pw

2. The parameter set of our learner will be P=โŠ”vPv, the disjoint union of the non-empty vertex-sets.

3. The space-map ฯ†1:Pโ†’C will send an element in Pv to the C-label of the root of the tree Tv. This gives us already the interpretation and request maps
I:Pร—Aโ†’BandR:Pร—Aร—Bโ†’A

4. The update map U:Pร—Aร—Bโ†’P follows from the sheaf-map we can define stalk-wise
ฯ†[ฯ†1(p)](a,b)=fv,(a,b)(p)
if pโˆˆPv.

Thatโ€™s all folks!

Learn(A,B) is equivalent to the (covariant) functors GABโ†’Sets.

Changing the directions of all arrows in GAB any covariant functor GABโ†’Sets becomes a contravariant functor GABoโ†’Sets, making Learn(A,B) an honest to Groth topos!

Every topos comes with its own logic, so we have a โ€˜learnersโ€™ logicโ€™. (to be continued)

One Comment

Poly

Following up on the deep learning and toposes-post, I was planning to do something on the logic of neural networks.

Prepping for this I saw David Spivakโ€™s paper Learnerโ€™s Languages doing exactly that, but in the more general setting of โ€˜learnersโ€™ (see also the deep learning post).

And then โ€ฆ I fell under the spell of Poly.

Spivak is a story-telling talent. A long time ago I copied his short story (actually his abstract for a talk) โ€œPresheaf, the cobblerโ€ in the Children have always loved colimits-post.

Last week, he did post Poly makes me happy and smart on the blog of the Topos Institute, which is another great read.

If this is way too โ€˜fluffyโ€™ for you, perhaps you should watch his talk Poly: a category of remarkable abundance.

If you like (applied) category theory and have some days to waste, you can binge-watch all 15 episodes of the Poly-course Polynomial Functors: A General Theory of Interaction.

If you are more the reading-type, the 273 pages of the Poly-book will also kill a good number of your living hours.

Personally, I have no great appetite for category theory, I prefer to digest it in homeopathic doses. And, Iโ€™m allergic to co-terminology.

So then, how to define Poly for the likes of me?

Poly, you might have surmised, is a category. So, we need โ€˜objectsโ€™ and โ€˜morphismsโ€™ between them.

Any set A has a corresponding โ€˜representable functorโ€™ sending a given set S to the set of all maps from A to S
yA : Setsโ†’SetsSโ†ฆSA=Maps(A,S)
This looks like a monomial in a variable y (y for Yoneda, of course), but does it work?

What is y1, where 1 stands for the one-element set {โˆ—}? Maps(1,S)=S, so y1 is the identity functor sending S to S.

What is y0, where 0 is the empty set โˆ…? Well, for any set S there is just one map โˆ…โ†’S, so y0 is the constant functor sending any set S to 1. That is, y0=1.

Going from monomials to polynomials we need an addition. We add such representable functors by taking disjoint unions (finite or infinite), that is
โˆ‘iโˆˆIyAi : Setsโ†’SetsSโ†ฆโจ†iโˆˆIMaps(Ai,S)
If all Ai are equal (meaning, they have the same cardinality) we use the shorthand IyA for this sum.

The objects in Poly are exactly these โ€˜polynomial functorsโ€™
p=โˆ‘iโˆˆIyp[i]
with all p[i]โˆˆSets. Remark that p(1)=I as for any set A there is just one map to 1, that is yA(1)=Maps(A,1)=1, and we can write
p=โˆ‘iโˆˆp(1)yp[i]
An object pโˆˆPoly is thus described by the couple (p(1),p[โˆ’]) with p(1) a set, and a functor p[โˆ’]:p(1)โ†’Sets where p(1) is now a category with objects the elements of p(1) and no morphisms apart from the identities.

We can depict p by a trimmed down forest, Spivak calls it the corolla of p, where the tree roots are the elements of p(1) and the tree with root iโˆˆp(1) has one branch from the root for any element in p[i]. The corolla of p=y2+2y+1 looks like



If M is an m-dimensional manifold, then you might view its tangent bundle TM set-theoretically as the โ€˜corollaโ€™ of the polynomial functor MyRm, the tree-roots corresponding to the points of the manifold, and the branches to the different tangent vectors in these points.

Morphisms in Poly are a bit strange. For two polynomial functors p=(p(1),p[โˆ’]) and q=(q(1),q[โˆ’]) a map pโ†’q in Poly consists of

  • a map ฯ•1:p(1)โ†’q(1) on the tree-roots in the right direction, and
  • for any iโˆˆp(1) a map q[ฯ•1(i)]โ†’p[i] on the branches in the opposite direction

In our manifold/tangentbundle example, a morphism MyRmโ†’y1 sends every point pโˆˆM to the unique root of y1 and the unique branch in y1 picks out a unique tangent-vector for every point of M. That is, vectorfields on M are very special (smooth) morphisms MuRmโ†’y1 in Poly.

A smooth map between manifolds Mโ†’N, does not determine a morphism MyRmโ†’NyRn in Poly because tangent vectors are pushed forward, not pulled back.

If instead we view the cotangent bundle Tโˆ—M as the corolla of the polynomial functor MyRm, then everything works well.

But then, I promised not to use co-terminologyโ€ฆ

Another time I hope to tell you how Poly helps us to understand the logic of learners.

Comments closed

Deep learning and toposes

Judging from this and that paper, deep learning is the string theory of the 2020s for geometers and representation theorists.

If you want to know quickly what neural networks really are, I can recommend the post demystifying deep learning.

The typical layout of a deep neural network has an input layer L0 allowing you to feed N0 numbers to the system (a vector v0โ†’โˆˆRN0), an output layer Lp spitting Np numbers back (a vector vpโ†’โˆˆRNp), and pโˆ’1 hidden layers L1,โ€ฆ,Lpโˆ’1 where all the magic happens. The hidden layer Li has Ni virtual neurons, their states giving a vector viโ†’โˆˆRNi.



Picture taken from Logical informations cells I

For simplicity letโ€™s assume all neurons in layer Li are wired to every neuron in layer Li+1, the relevance of these connections given by a matrix of weights WiโˆˆMNi+1ร—Ni(R).

If at any given moment the โ€˜stateโ€™ of the neural network is described by the state-vectors v1โ†’,โ€ฆ,vpโˆ’1โ†’ and the weight-matrices W0,โ€ฆ,Wp, then an input v0โ†’ will typically result in new states of the neurons in layer L1 given by

v1โ†’โ€ฒ=c0(W0.v0โ†’+v1โ†’)

which will then give new states in layer L2

v2โ†’โ€ฒ=c1(W1.v1โ†’โ€ฒ+v2โ†’)

and so on, rippling through the network, until we get as the output

vpโ†’=cpโˆ’1(Wpโˆ’1.vpโˆ’1โ†’โ€ฒ)

where all the ci are fixed smooth activation functions ci:RNi+1โ†’RNi+1.

This is just the dynamic, or forward working of the network.

The learning happens by comparing the computed output with the expected output, and working backwards through the network to alter slightly the state-vectors in all layers, and the weight-matrices between them. This process is called back-propagation, and involves the gradient descent procedure.

Even from this (over)simplified picture it seems doubtful that set valued (!) toposes are suitable to describe deep neural networks, as the Paris-Huawei-topos-team claims in their recent paper Topos and Stacks of Deep Neural Networks.

Still, there is a vast generalisation of neural networks: learners, developed by Brendan Fong, David Spivak and Remy Tuyeras in their paper Backprop as Functor: A compositional perspective on supervised learning (which btw is an excellent introduction for mathematicians to neural networks).

For any two sets A and B, a learner Aโ†’B is a tuple (P,I,U,R) where

  • P is a set, a parameter space of some functions from A to B.
  • I is the interpretation map I:Pร—Aโ†’B describing the functions in P.
  • U is the update map U:Pร—Aร—Bโ†’P, part of the learning procedure. The idea is that U(p,a,b) is a map which sends a closer to b than the map p did.
  • R is the request map R:Pร—Aร—Bโ†’A, the other part of the learning procedure. The idea is that the new element R(p,a,b)=aโ€ฒ in A is such that p(aโ€ฒ) will be closer to b than p(a) was.

The request map is also crucial is defining the composition of two learners Aโ†’B and Bโ†’C. Learn is the (symmetric, monoidal) category with objects all sets and morphisms equivalence classes of learners (defined in the natural way).

In this way we can view a deep neural network with p layers as before to be the composition of p learners
RN0โ†’RN1โ†’RN2โ†’โ‹ฏโ†’RNp
where the learner describing the transition from the i-th to the i+1-th layer is given by the equivalence class of data (Ai,Bi,Pi,Ii,Ui,Ri) with
Ai=RNi, Bi=RNi+1, Pi=MNi+1ร—Ni(R)ร—RNi+1
and interpretation map for p=(Wi,vโ†’i+1)โˆˆPi
Ii(p,viโ†’)=ci(Wi.viโ†’+vโ†’i+1)
The update and request maps (encoding back-propagation and gradient-descent in this case) are explicitly given in theorem III.2 of the paper, and they behave functorial (whence the title of the paper).

More generally, we will now associate objects of a topos (actually just sheaves over a simple topological space) to a network op p learners
A0โ†’A1โ†’A2โ†’โ‹ฏโ†’Ap
inspired by section I.2 of Topos and Stacks of Deep Neural Networks.

The underlying category will be the poset-category (the opposite of the ordering of the layers)
0โ†1โ†2โ†โ‹ฏโ†p
The presheaf on a poset is a locale and in this case even the topos of sheaves on the topological space with p+1 nested open sets.
X=U0โЇU1โЇU2โЇโ‹ฏโЇUp=โˆ…
If the learner Aiโ†’Ai+1 is (the equivalence class) of the tuple (Ai,Ai+1,Pi,Ii,Ui,Ri) we will now describe two sheaves W and X on the topological space X.

W has as sections ฮ“(W,Ui)=โˆj=ipโˆ’1Pi and the obvious projection maps as the restriction maps.

X has as sections ฮ“(X,Ui)=Aiร—ฮ“(W,Ui) and restriction map to the next smaller open
ฯi+1i : ฮ“(X,Ui)โ†’ฮ“(X,Ui+1)(ai,(pi,pโ€ฒ))โ†ฆ(pi(ai),pโ€ฒ)
and other retriction maps by composition.

A major result in Topos and Stacks of Deep Neural Networks is that back-propagation is a natural transformation, that is, a sheaf-morphism Xโ†’X.

In this general setting of layered learners we can always define a map on the sections of X (for every open Ui), ฮ“(X,Ui)โ†’ฮ“(X,Ui)
(a,(pi,pโ€ฒ))โ†ฆ(R(pi,ai,pi(ai)),(U(pi,ai,pi(ai)),pโ€ฒ)
But, in order for this to define a sheaf-morphism, compatible with the restrictions, we will have to impose restrictions on the update and restriction maps of the learners, in general.

Still, in the special case of deep neural networks, this compatibility follows from the functoriality property of Backprop as Functor: A compositional perspective on supervised learning.

To be continued.

One Comment