doc/internals.tex


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163

\documentclass{article}

\setlength{\textwidth}{6.75in} % 1 inch side margins
\setlength{\textheight}{9in} % ~1 inch top and bottom margins

\setlength{\headheight}{0in}
\setlength{\topmargin}{0in}
\setlength{\headsep}{0in}

\setlength{\oddsidemargin}{0in}
\setlength{\evensidemargin}{0in}

\title{Botan Internals}
\author{Jack Lloyd (lloyd@randombit.net)}
\date{August 20, 2006}

\newcommand{\filename}[1]{\texttt{#1}}
\newcommand{\manpage}[2]{\texttt{#1}(#2)}

\newcommand{\function}[1]{\textbf{#1}}
\newcommand{\type}[1]{\texttt{#1}}
\renewcommand{\arg}[1]{\textsl{#1}}

\begin{document}

\maketitle

\tableofcontents

\parskip=5pt

\section{Introduction}

This document is intended to document some of the trickier and/or more
complicated parts of Botan. This is not going to be terribly useful if
you just want to use the library, but for people wishing to understand
how it works, or contribute new code to it, it will hopefully prove
helpful.

I've realized that a lot of things Botan does internally are pretty
hard to understand, and that a lot of things are only inside my head,
which is a bad place for them to be (things tend to get lost in there,
not to mention the possibility that I'll get hit by a truck next
week).

This document is currently very incomplete. I'll be working on it as I
have time.

\pagebreak

\section{Filter}

Filter is one of the core abstractions of the library. It represents

\section{Pipe}

Pipe is, conceptually, a tree structure of Filter objects. There is a
single unique top, and an arbitrary number of leaves (which are
SecureQueue objects).  SecureQueue is a simple Filter that buffers its
input.

Writing into the pipe writes into the top of the tree. The filter at
the top of the tree writes its output into the next Filter, and so on
until eventually data trickles down into the bottommost Filters, where
the data is stored for later retrieval.

When a new message is started, Pipe searches through the tree of
Filters and finds places where the \arg{next} field of the Filter is
NULL. This implies that it was the lowest layer of the Filter tree
that the user added. It then adds SecureQueue objects onto these
Filters. These queues are also stored in a \type{std::vector} called
\arg{messages}. This is how the Pipe knows how to read from them later
without doing a tree traversal every time.

Pipe will, if asked, destroy the existing tree structure, in order to
create a new one. However, the queue objects are not deleted, because
Pipe might need to read from them later.

An optimization in future versions will involve deleting empty queues
that we ``know'' can't be written to, and then replace their field in
\arg{messages} with NULL. On reading, Pipe will know that this means
that the queue is empty, and act as if such a queue was really
there. This is relatively minor, because in recent versions an empty
queue only takes up a few dozen bytes (previous to 0.8.4 or so, an
empty queue still took up 4 kilobytes of memory).

\section{Library Initialization}

A lot of messy corner cases.

\section{Lookup Mechanism}

Most objects know their name, and they know how to create a new copy
of themselves. We build mapping tables that map from an algorithm name
into a single instance of that algorithm. The tables themselves can be
found in \filename{src/lookup.cpp}.

There are a set of functions named \function{add\_algorithm} that can
be used to populate the tables. We get something out of the table with
\function{retrieve\_x}, where x is the name of a type
(\texttt{block\_cipher}, \texttt{hash}, etc). This returns a const
pointer to the single unique instance of the algorithm that the lookup
tables know about. If it doesn't know about it, it falls back on
calling a function called \function{try\_to\_get\_x}. These functions
live in \filename{src/algolist.cpp}. They are mostly used to handle
algorithms which need (or at least can have) arguments passed to them,
like \type{HMAC} and \type{SAFER\_SK}. It will return NULL if it can't
find the algorithm at all.

When it's asked for an algorithm it doesn't know about (ie, isn't in
the mapping tables), the retrieval functions will ask the try-to-get
functions if \emph{they} know about it. If they do, then the object
returned will be stored into the table for later retrieval.

The functions \function{get\_x} call the retrieval functions. If we
get back NULL, an exception is thrown. Otherwise it will call the
\function{clone} method to get a new copy of the algorithm, which it
returns.

The various functions like \function{output\_length\_of} call the
retrieval function for each type of object that the parameter in
question (in this case, \texttt{OUTPUT\_LENGTH}) might be meaningful
for. If it manages to get back an object, it will return (in this
case) the \texttt{OUTPUT\_LENGTH} field of the object. No allocations
are required to call this function: all of its operations work
directly on the copies living in the lookup tables.

\section{Allocators}

A big (slow) mess.

\section{BigInt}

Read ``Handbook of Applied Cryptography''.

\section{PEM/BER Identification}

We have a specific algorithm for figuring out if something is PEM or
BER. Previous versions (everything before 1.3.0) requried that the
caller specify which one it was, and they had to be right. Now we use
a hueristic (aka, an algorithm that sometimes doesn't work right) to
figure it out. If the first character is not 0x30 (equal to ASCII
'0'), then it can't possibly be BER (because everything we care about
is enclosed in an ASN.1 SEQUENCE, which for BER/DER is encoded as
beginning with 0x30). Roughly 99.9% of PEM blocks \emph{won't} have a
random 0 character in front of them, so we are mostly safe (unless
someone does it on purpose, in which case, please hit them for me).
But to be sure, if there is a 0, then we search the first \emph{N}
bytes of the block for the string ``-----BEGIN ``, which marks the
typical start of a PEM block. The specific \emph{N} depends on the
variable ``base/pem\_search'', which defaults to 4 kilobytes.

So, you can actually fool it either way: that a PEM file is really
BER, or that a BER file is actually PEM. To fool it that a BER file is
PEM, just have the string ``-----BEGIN `` somewhere (I can't imagine
this string shows up in certificates or CRLs too often, so if it is
there it means somebody is being a jerk). If a file starts with 0 and
has at least ``base/pem\_search'' byte more junk in the way, it won't
notice that its PEM at all. In either case, of course, the loading
will fail, and you'll get a nice exception saying that the decoding
failed.

\end{document}