This allowed us to view the text-snippets as points in a Lawvere pseudoquasi metric space, and to define a ‘topos’ of enriched presheaves on it, including the Yoneda-presheaves containing semantic information of the snippets.
In Obsidian, a vault is a collection of notes (with their tags and other meta-data), together with all links between them.
The vault of the language-poset will have one note for every text-snipped, and have a link from note to note if is a text-fragment in .
In their paper, Bradley, Terilla and Vlassopoulos use the enrichment structure where is the conditional probablity of the fragment to be extended to the larger text .
Most Obsidian vaults are a lot more complicated, possibly having oriented cycles in their internal link structure.
Still, it is always possible to turn the notes of the vault into a category enriched over , in multiple ways, depending on whether we want to focus on the internal link-structure or rather on the semantic similarity between notes, or any combination of these.
Let be a set of searchable data from your vault. Elements of may be
Assign a positive real number to every . We see as the ‘relevance’ we attach to the search term . So, it is possible to emphasise certain key-words or tags, find certain links more important than others, and so on.
For this relevance function , we have a function defined on all subsets of
Take a note from the vault and let be the set of search terms from contained in .
We can then define a (generalised) Jaccard distance for any pair of notes and in :
This distance is symmetric, for all notes , and the crucial property is that it satisfies the triangle inequality, that is, for all triples of notes , and we have
How does this help to make the vault into a category enriched over ?
The poset is the category with objects all numbers , and a unique morphism between two numbers iff . This category has limits (infs) and colimits (sups), has a monoidal structure with unit object , and an internal hom
We say that the vault is an enriched category over if for every pair of notes and we have a number satisfying for all notes
for all triples of notes and .
Starting from any relevance function we define for every pair and of notes the distance function satisfying the triangle inequality. If we now take
then the triangle inequality translates for every triple of notes and into
That is, every relevance function makes into a category enriched over .
Two simple relevance functions, and their corresponding distance and enrichment functions are available from Obsidian’s Graph Analysis community plugin.
To get structural information on the link-structure take as the set of all incoming and outgoing links in your vault, with relevance function the constant function .
‘Jaccard’ in Graph Analysis computes for the current note the value of for all notes , so if this value is , then the corresponding enrichment value is .
To get semantic information on the similarity between notes, let be the set of all words in all notes and take again as relevance function the constant function .
To access ‘BoW’ (Bags of Words) in Graph Analysis, you must first install the (non-community) NLP plugin which enables various types of natural language processing in the vault. The install is best done via the BRAT plugin (perhaps I’ll do a couple of posts on Obsidian someday).
If it gives for the current note the value for a note , then again we can take as the enrichment structure .
Graph Analysis offers more functionality, and a good introduction is given in this clip:
Calculating the enrichment data for custom designed relevance functions takes a lot more work, but is doable. Perhaps I’ll return to this later.
Mathematically, it is probably more interesting to start with a given enrichment structure on the vault , describe the category of all enriched presheaves and find out what we can do with it.
In the topology of dreams we looked at Sibony’s idea to view dream-interpretations as sections in a fibered space.
The ‘points’ in the base-space and fibers consisting of chunks of text, perhaps connected by links. The topology and shape of this fibered space is still shrouded in mystery.
Let’s look at a simple approach to turn a large number of texts into a topos, and define a loose metric on it.
or to watch her Categories for AI talk: ‘Category Theory Inspired by LLMs’:
Let’s start with a collection of notes. In the paper, they consider all possible texts written in some language, but it may be a set of webpages to train a language model, or a set of recollections by someone.
Next, shred these notes into chunks of text, and point one of these to all the texts obtained by deleting some words at the start and/or end of it. For example, the note ‘a red rose’ will point to ‘a red’, ‘red rose’, ‘a’, ‘red’ and ‘rose’ (but not to ‘a rose’).
You may call this a category, to me it is just as a poset . The maximal elements are the individual words, the minimal elements are the notes, or websites, we started from.
A down-set of this poset is a subset of closed under taking smaller elements, that is, if and , then .
The intersection of two down-sets is again a down-set (or empty), and the union of down-sets is again a downset. That is, down-sets define a topology on our collection of text-snippets, or if you want, on language-fragments.
For example, the open determined by the word ‘red’ is the collection of all text-fragments containing this word.
The corresponding presheaf topos is then just the category of all (set-valued) presheaves on this topological space.
As an example, the Yoneda-presheaf of a text-snippet is the contra-variant functor
sending any to the unique map from to , and if then we map it to . If is a down-set (an open of over topological space) then the sections of over are if for all we have , and otherwise.
The presheaf already contains some semantic information about the snippet as it gives all contexts in which appears.
Perhaps interesting is that the ‘points’ of the topos are the notes we started from.
Recall that Connes and Gauthier-Lafaey want to construct a topos describing someone’s unconscious, and points of that topos should be the connection with that person’s consciousness.
Suppose you want to unravel your unconscious. You start by writing down a large set of notes containing all relevant facts of your life. Then you construct from these notes the above collection of snippets and its corresponding pre-sheaf topos. Clearly, you wrote your notes consciously, but probably the exact phrasing of these notes, or recurrent themes in them, or some text-combinations are ruled by your unconscious.
Ok, it’s not much, but perhaps it’s a germ of an potential approach…
Now we come to the interesting part of the paper, the ‘enrichment’ of this poset.
Surely, some of these text-snippets will occur more frequently than others. For example, in your starting notes the snippet ‘red rose’ may appear ten time more than the snippet ‘red dwarf’, but this is not visible in the poset-structure. So how can we bring in this extra information?
If we have two text-snippets and and , that is, is a connected sub-string of . We can compute the conditional probability which tells us how likely it is that if we spot an occurrence of in our starting notes, it is part of the larger sentence . These numbers can be easily computed and from the rules of probability we get that for snippets we have that
so these numbers (all between and ) behave multiplicative along paths in the poset.
Nice in theory, but it requires an awful lot of computation. From the paper:
The reader might think of these probabilities as being most well defined when is a short extension of . While one may be skeptical about assigning a probability distribution on the set of all possible texts, it’s reasonable to say there is a nonzero probability that cat food will follow I am going to the store to buy a can of and, practically speaking, that probability can be estimated.
Indeed, existing LLMs successfully learn these conditional probabilities using standard machine learning tools trained on large corpora of texts, which may be viewed as providing a wealth of samples drawn from these conditional probability distributions.
It may be easier to have an estimate of this conditional probability for immediate successors (that is, if is obtained from by adding one word at the beginning or end of it), and then extend this measure to all arrows in the poset by taking the maximum of products along paths. In this way we have for all that
The upshot is that this measure turns our poset (or category) into a category ‘enriched’ over the unit interval (suitably made into a monoidal category).
I’ll spare you the details, just want to flash out the corresponding notion of ‘enriched presheaves’ which are the objects of the semantic category in the paper, which is the enriched version of the presheaf category .
An enriched presheaf is a function (not functor)
satisfying the condition that for all text-snippets we have that
Note that the enriched (or semantic) Yoneda presheaf satisfies this condition, and now this data not only records the contexts in which appears, but also measures how likely it is for to appear in a certain context.
Another cute application of the condition on the measure is that it allows us to define a ‘distance function’ (satisfying the triangle inequality) on all text-snippets in by
So, the higher the closer lies to , and now the snippet (example ‘red’) not only defines the open set in of all texts containing , but now we can structure the snippets in this open set with respect to this ‘distance’.
In this way we can turn any language, or a collection of texts in a given language, into what Lawvere called a ‘generalized metric space’.
It looks as if we are progressing slowly in our, probably futile, attempt to understand Alain Connes’ and Patrick Gauthier-Lafaye’s claim that ‘the unconscious is structured like a topos’.
Even if we accept the fact that we can start from a collection of notes, there are a number of changes we need to make to the above approach:
there will be contextual links between these notes
we only want to retain the relevant snippets, not all of them
between these ‘highlights’ there may also be contextual links
texts can be related without having to be concatenations
we need to implement changes when new notes are added
… (much more)
Perhaps, we should try to work on a specific ‘case’, and explore all technical tools that may help us to make progress.
Olivia Caramello, who also contributed to the seminar, posts on her blog Around Toposes that the proceedings of this lectures series is now available from the SMF.
Olivia’s blogpost links also to the YouTube channel of the seminar. Several of these talks are well worth your time watching.
In 1973, Grothendieck gave three lectures series at the Department of Mathematics of SUNY at Buffalo, the first on ‘Algebraic Geometry’, the second on ‘The Theory of Algebraic Groups’ and the third one on ‘Topos Theory’.
This MathOverflow (soft) question links to this page stating:
“The copyright of all these recordings is that of the Department of Mathematics of SUNY at Buffalo to whose representatives, in particular Professors Emeritus Jack DUSKIN and Bill LAWVERE exceptional thanks are due both for the preservation and transmission of this historic archive, the only substantial archive of recordings of courses given by one of the greatest mathematicians of all time, whose work and ideas exercised arguably the most profound influence of any individual figure in shaping the mathematics of the second half od the 20th Century. The material which it is proposed to make available here, with their agreement, will form a mirror site to the principal site entitled “Grothendieck at Buffalo” (url: ).”
Sadly, the URL is still missing.
Fortunately, another answer links to the Grothendieck project Thèmes pour une Harmonie by Mateo Carmona. If you scroll down to the 1973-section, you’ll find there all of the recordings of these three Grothendieck series of talks!
To whet your appetite, here’s the first part of his talk on topos theory on April 4th, 1973:
For all subsequent recordings of his talks in the Topos Theory series on May 11th, May 18th, May 25th, May 30th, June 4th, June 6th, June 20th, June 27th, July 2nd, July 10th, July 11th and July 12th, please consult Mateo’s website (under section 1973).