summaryrefslogtreecommitdiff
blob: bc945da8e84ac1080a47e1a7d647118811a9dab5 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
---
GLEP: 44
Title: Manifest2 format
Author: Marius Mauch <genone@gentoo.org>
Type: Standards Track
Status: Replaced
Version: 1
Created: 2005-12-04
Last-Modified: 2022-07-08
Post-History: 2005-12-06, 2006-01-23, 2006-09-03
Content-Type: text/x-rst
Replaced-By: 74
---


Abstract
========

This GLEP proposes a new format for the Portage Manifest and digest file system
by unifying both filetypes into one to improve functional and non-functional
aspects of the Portage Tree.


Motivation
==========

Please see [#reorg-thread]_ for a general overview.
The main long term goals of this proposal are to:

- Remove the tiny digest files from the tree. They are a major annoyance as on a
  typical configuration they waste a lot of disk space and the simple transmission
  of the names for all digest files during a ``emerge --sync`` needs a substantial
  amount of bandwidth.
- Reduce redundancy when multiple hash functions are used
- Remove potential for checksum collisions if a file is recorded in more than one
  digest file
- Difference between filetypes for a more flexible verification system


Specification
=============

The new Manifest format would change the existing format in the following ways:

- Addition of a filetype specifier, currently planned are

  * ``AUX`` for files directly used by ebuilds (e.g. patches or initscripts),
    located in the ``files/`` subdirectory

  * ``EBUILD`` for all ebuilds

  * ``MISC`` for files not directly used by ebuilds like ``ChangeLog`` or
    ``metadata.xml`` files

  * ``DIST`` for release tarballs recorded in the ``SRC_URI`` variable of an ebuild,
    these were previously recorded in the digest files

  Future portage improvements might extend this list (for example with types 
  relevant for eclasses or profiles)

- Only have one line per file listing all information instead of one line per 
  file and checksum type

- Remove the separated digest-* files in the ``files/`` subdirectory

Each line in the new format has the following format:

::

	<filetype> <filename> <filesize> <chksumtype1> <chksum1> ... <chksumtypen> <chksumn>


However theses entries will be stored in the existing Manifest files.

An `actual example`__ for a (pure) Manifest2 file..

.. __: glep-0044-extras/manifest2-example.txt


Compatibility Entries
---------------------

To maintain compatibility with existing portage versions a transition period after
is the introduction of the Manifest2 format is required during which portage 
will not only have to be capable of using existing Manifest and digest files but
also generate them in addition to the new entries.
Fortunately this can be accomplished by simply mixing old and new style entries
in one file for the Manifest files, existing portage versions will simply ignore
the new style entries. For the digest files there are no new entries to care 
about.

Scope
-----

It is important to note that this proposal only deals with a change of the 
format of the digest and Manifest system.  

It does not expand the scope of it to cover eclasses, profiles or anything
else not already covered by the Manifest system, it also doesn't affect
the Manifest signing efforts in any way (though the implementations of both
might be coupled).  

Also while multiple hash functions will become standard with the proposed
implementation they are not a specific feature of this format [#multi-hash-thread]_.

Number of hashes
----------------

While using multiple hashes for each file is a major feature of this proposal
we have to make sure that the number of hashes listed is limited to avoid
an explosion of the Manifest size that would revert the main benefit of this proposal
(reducing tree size). Therefore the number of hashes that will be generated
will be limited to three different hash functions. For compatibility though we
have to rely on at least one hash function to always be present, this proposal
suggest to use SHA1 for this purpose (as it is supposed to be more secure than MD5
and currently only SHA1 and MD5 are directly available in python, also MD5 doesn't
have any benefit in terms of compatibility).

Rationale
=========

The main goals of the proposal have been listed in the `Motivation`_, here now 
the explanation why they are improvements and how the proposed format will 
accomplish them.

Removal of digest files
-----------------------

Normal users that don't use a "tuned" filesystem for the portage tree are 
wasting several dozen to a few hundred megabytes of disk space with the current
system, largely caused by the digest files. 
This is due to the filesystem overhead present in most filesystems that
have a standard blocksize of four kilobytes while most digest files are under 
one kilobyte in size, so this results in approximately a waste of three kilobytes
per digest file (likely even more). At the time of this writing the tree contains
roughly 22.000 digest files, so the overall waste caused by digest files is
estimated at about 70-100 megabytes.
Furthermore it is assumed that this will also reduce the disk space wasted by 
the Manifest files as they now contain more content, but this hasn't been 
verified yet.

By unifying the digest files with the Manifest these tiny files are eliminated
(in the long run), reducing the apparent tree size by about 20%, benefiting
both users and the Gentoo infrastructure.

Reducing redundancy
-------------------

When multiple hashes are used with the current system 
both the filename and filesize are repeated for every checksum type used as each
checksum is standalone. However this doesn't add any functionality and is 
therefore useless, so the new format removes this redundancy.
This is a theoretical improvement at this moment as only one hash function is in 
use, but expected to change soon (see [#multi-hash-thread]_).

Removal of checksum collisions
------------------------------

The current system theoretically allows for a ``DIST`` type file to be recorded 
in multiple digest files with different sizes and/or checksums. In such a case
one version of a package would report a checksum violation while another one 
would not. This could create confusion and uncertainty among users.
So far this case hasn't been observed, but it can't be ruled out with the 
existing system.
As the new format lists each file exactly once this would be no longer possible.

Flexible verification system
----------------------------

Right now portage verifies the checksum of every file listed in the Manifest
before using any file of the package and all ``DIST`` files of an ebuild 
before using that ebuild. This is unnecessary in many cases:

- During the "depend" phase (when the ebuild metadata is generated) only 
  files of type ``EBUILD`` are used, so verifying the other types isn't
  necessary. Theoretically it is possible for an ebuild to include other
  files like those of type ``AUX`` at this phase, but that would be a
  major QA violation and should never occur, so it can be ignored here.
  It is also not a security concern as the ebuild is verified before parsing
  it, so each manipulation would show up.

- Generally files of type ``MISC`` don't need to be verified as they are 
  only used in very specific situations, aren't executed (just parsed at most)
  and don't affect the package build process.

- Files of type ``DIST`` only need to be verified directly after fetching and
  before unpacking them (which often will be one step), not every time their 
  associated ebuild is used.


Backwards Compatibility
=======================

Switching the Manifest system is a task that will need a long transition period 
like most changes affecting both portage and the tree. In this case the 
implementation will be rolled out in several phases:

1. Add support for verification of Manifest2 entries in portage

2. Enable generation of Manifest2 entries in addition to the current system

3. Ignore digests during ``emerge --sync`` to get the size-benefit clientside. 
   This step may be omitted if the following steps are expected to follow soon.

4. Disable generation of entries for the current system

5. Remove all traces of the current system from the tree (serverside)

Each step has its own issues. While 1) and 2) can be implemented without any 
compatibility problems all later steps have a major impact:

- Step 3) can only be implemented when the whole tree is Manifest2 ready 
  (ideally speaking, practically the requirement will be more like 95% coverage
  with the expectation that for the remaining 5% either bugs will be filed after 
  step 3) is completed or they'll be updated at step 5).

- Steps 4) and 5) will render all portage versions without Manifest2 support 
  basically useless (users would have to regenerate the digest and Manifest 
  for each package before being able to merge it), so this requires a almost
  100% coverage of the userbase with Manifest2 capable portage versions
  (with step 1) completely implemented).

Another problem is that some steps affect different targets:

- Steps 1) and 3) target portage versions used by users

- Steps 2) and 4) target portage versions used by devs

- Step 5) targets the portage tree on the cvs server

While it is relatively easy to get all devs to use a new portage version this is 
practically impossible with users as some don't update their systems regularly.
While six months are probably sufficient to reach a 95% coverage one year is 
estimated to reach an almost-complete coverage. All times are relative to the
stable-marking of a compatible portage version.

No timeframe for implementation is presented here as it is highly dependent
on the completion of each step.

In summary it can be said that while a full conversion will take over a year
to be completed due to compatibility issues mentioned above some benefits of the
system can selectively be used as soon as step 2) is completed.


Other problems
==============

Impacts on infrastructure
-------------------------

While one long term goal of this proposal is to reduce the size of the tree
and therefore make life for the Gentoo Infrastructure easier this will only 
take effect once the implementation is rolled out completely. In the meantime 
however it will increase the tree size due to keeping checksums in both formats.
It's not possible to give a usable estimate on the degree of the increase as 
it depends on many variables such as the exact implementation timeframe, 
propagation of Manifest2 capable portage versions among devs or the update 
rate of the tree. It has been suggested that Manifest files that are not gpg 
signed could be mass converted in one step, this could certainly help but only 
to some degree (according to a recent research [#gpg-numbers]_ about 40% of 
all Manifests in the tree are signed, but this number hasn't been verified).


Reference Implementation
========================

A patch for a prototype implementation of Manifest2 verification and partial 
generation has been posted at [#manifest2-patch]_, it will be reworked before
being considered for inclusion in portage. However it shows that adding support 
for verification is quite simple, but generation is a bit tricky and will 
therefore be implemented later.


Options
=======

Some things have been considered for this GLEP but aren't part of the proposal 
yet for various reasons:

- timestamp field: the author has considered adding a timestamp field for 
  each entry to list the time the entry was created. However so far no practical
  use for such a feature has been found.

- convert size field into checksum: Another idea was to treat the size field
  like any other checksum. But so far no real benefit (other than a slightly 
  more modular implementation) for this has been seen while it has several 
  drawbacks: For once, unlike checksums, the size field is definitely required
  for all ``DIST`` files, also it would slightly increase the length of 
  each entry by adding a ``SIZE`` keyword.

- removal of the ``MISC`` type: It has been suggested to completely drop
  entries of type ``MISC``. This would result in a minor space reduction
  (its rather unlikely to free any blocks) but completely remove the ability
  to check these files for integrity. While they don't influence portage 
  or packages directly they can contain viable information for users, so
  the author has the opinion that at least the option for integrity checks 
  should be kept.

Credits
=======

Thanks to the following persons for their input on or related to this GLEP
(even though they might not have known it):
Ned Ludd (solar), Brian Harring (ferringb), Jason Stubbs (jstubbs),
Robin H. Johnson (robbat2), Aron Griffis (agriffis)

Also thanks to Nicholas Jones (carpaski) to make the current Manifest system 
resistent enough to be able to handle this change without too many transition
problems.

References
==========

.. [#reorg-thread] Marius Mauch. "Digest reorganization and enhancements".
   gentoo-dev mailing list, 2004-10-08,
   Message-ID 20041008184325.316fc227\@andy.genone.homeip.net,
   https://archives.gentoo.org/gentoo-dev/message/4afc65da379b8570a4cec654976da862

.. [#multi-hash-thread] Marius Mauch. "Multi hash support in portage - status".
   gentoo-dev mailing list, 2005-11-24,
   Message-ID 20051124010432.33eecead\@sven.genone.homeip.net,
   https://archives.gentoo.org/gentoo-dev/message/f97ff5732872ffe44ef05627b7a19cc1

.. [#gpg-numbers] Robin H. Johnson. "Gentoo key signing practices and official
   Gentoo keyring". gentoo-core mailing list, 2005-11-17,
   Message-ID 20051117075838.GB15734\@curie-int.vc.shawcable.net

.. [#manifest2-patch] https://archives.gentoo.org/gentoo-portage-dev/message/f2b5be6629510343bd50418429912b1d

Copyright
=========

This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License.  To view a copy of this license, visit
https://creativecommons.org/licenses/by-sa/3.0/.