summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMichał Górny <mgorny@gentoo.org>2018-02-07 14:20:45 +0100
committerMichał Górny <mgorny@gentoo.org>2018-02-07 14:22:08 +0100
commite4dc2627c8107339b13e20709125e2d9fc91ffde (patch)
tree91db033ddfee0185c390017e03c57b14f6252c71 /glep-0075.rst
parentglep-0075: Clarify structure key description (diff)
downloadglep-e4dc2627c8107339b13e20709125e2d9fc91ffde.tar.gz
glep-e4dc2627c8107339b13e20709125e2d9fc91ffde.tar.bz2
glep-e4dc2627c8107339b13e20709125e2d9fc91ffde.zip
glep-0075: Extend rationale for splitting algorithm
Extend and refactor the rationale for splitting algorithm. Explicitly state the goals, list all the options that occurred during the ml discussion.
Diffstat (limited to 'glep-0075.rst')
-rw-r--r--glep-0075.rst116
1 files changed, 91 insertions, 25 deletions
diff --git a/glep-0075.rst b/glep-0075.rst
index 157514e..00d14c3 100644
--- a/glep-0075.rst
+++ b/glep-0075.rst
@@ -187,43 +187,98 @@ Rationale
=========
Algorithm for splitting distfiles
---------------------------------
-In the original debate that occurred in bug #534528 [#BUG534528]_,
-three possible solutions for splitting distfiles were listed:
+The possible algorithms were considered with the following goals
+in mind:
-a. using initial portion of filename,
+- the number of files in a single directory should not exceed 1000,
-b. using initial portion of file hash,
+- the total size of files in a single directory is not considered
+ relevant,
-c. using initial portion of filename hash.
+- the solution should preferably be future-proof,
-The significant advantage of the filename option was simplicity. With
-that solution, the users could easily determine the correct subdirectory
-themselves. However, it's significant disadvantage was very uneven
-shuffling of data. In particular, the TeΧ Live packages alone count
-almost 23500 distfiles and all use a common prefix, making it impossible
-to split them further.
+- moving distfiles should be avoided once it is deployed.
-The alternate option of using file hash has the advantage of having
-a more balanced split. Furthermore, since hashes are stored
-in Manifests using them is zero-cost. However, this solution has three
-significant disadvantages:
+It should also be noted that at this moment the package having most
+distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
+with the number of 8556 distfiles. All of them start with a common
+prefix of ``texlive-module-``. This specific prefix is used by a total
+of 23435 distfiles.
-1. The hash values are unknown for newly-downloaded distfiles, so
- ``repoman`` (or an equivalent tool) would have to use a temporary
- directory before locating the file in appropriate subdirectory.
+In the original debate that occurred in bug #534528 [#BUG534528]_
+and the mailing list review of the initial version of this GLEP [#ML1]_,
+four fundamental ideas for splitting distfiles were listed:
+
+a. using initial portion of filename,
+
+b. using initial portion of file hash,
+
+c. using initial portion of filename hash,
+
+d. using package category (and package name).
+
+The initial filename idea was to use the first character of filename,
+possibly followed by a longer part which was the idea historically
+used e.g. by PyPI Python package hosting. Its main advantage is
+simplicity. The users can easily determine the correct subdirectory
+by just looking at the distfile name. Sadly, this solution is not only
+very uneven but does not solve the problem. As mentioned above,
+the TeΧ Live packages share a long common prefix that make it impossible
+to split it properly with other packages on fixed-length prefixes.
+
+This idea has been followed by an adaptive proposal by Andrew Barchuk
+[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
+mapped to groups by a common prefix but instead each group contains
+all files between two prefixes being used (like in a dictionary).
+However, it has been pointed out that while this option can provide
+very even results initially, it is impossible to predict how it would
+be affected by future distfile changes and there will be a risk of
+needing to change the groups in the future. Furthermore, it is
+relatively complex and requires explicitly listing or obtaining used
+groups.
+
+Another option was to use an initial portion of distfile hashes. Its
+main advantage is that cryptographic hash algorithms can provide
+a more balanced split with random data. Furthermore, since hashes are
+stored in Manifests using them has no cost for users. However, this
+solution has three disadvantages:
+
+1. Not all files in the distfile tree are covered by package Manifests.
+ Additional files are injected into the mirrors, and those will
+ not have a clearly-defined location.
2. User-provided distfiles (e.g. for fetch-restricted packages) with
hash mismatches would be placed in the wrong subdirectory,
potentially causing confusing errors.
-3. Not all files in the distfiles tree are covered by package Manifests
- --- there are additional files that are injected into distfiles.
+3. The hash values are unknown for newly-downloaded distfiles, so
+ ``repoman`` (or an equivalent tool) would have to use a temporary
+ directory before locating the file in appropriate subdirectory.
-Using filename hashes has proven to provide a similar balance
-to using file hashes. Furthermore, since filenames are known up front
-this solution does not suffer from the both listed problems. While
-hashes need to be computed manually, hashing short string should not
-cause any performance problems.
+Using filename hashes has proven to provide a similar balance to using
+file hashes. Furthermore, since filenames are known up front this
+solution does not suffer from the listed problems. While hashes need
+to be computed manually, hashing short string should not cause
+any performance problems.
+
+Jason Zaman has suggested to use package categories (and package names)
+[#PKGNAME]_. However, this solution has multiple problems:
+
+a. it does not solve the problem for large packages such as TeΧ Live,
+
+b. it introduces many unnecessarily small directories,
+
+c. it requires an explicit knowledge of which package distfiles
+ belong to,
+
+d. it does not provide an explicit solution to the problem of distfiles
+ shared by multiple packages,
+
+e. it does not provide a solution to the problem of injected distfiles.
+
+All the options considered, the filename hash solution was selected
+as one that solves all the forementioned problems while introducing
+relatively low complexity and being reasonably future-proof.
.. figure:: glep-0075-extras/by-filename.png
@@ -327,6 +382,17 @@ References
of DISTDIR
(https://bugs.gentoo.org/534528)
+.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
+ (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
+
+.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
+ for each directory computed in a way to have the files distributed evenly'
+ (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
+
+.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
+ as the packages themselves)
+ (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)
+
Copyright
=========