[Exherbo-dev] [RFC] Rethinking fetchers

Alex Elsayed eternaleye at gmail.com
Sun Mar 4 02:54:47 UTC 2012


In thinking about how downloading distfiles currently works, I had an
idea that might improve the situation as it stands.

In the past, Paludis has had requests to support BitTorrent for
distfiles for a variety of reasons. Some of these made sense; some did
not.

I'm going to suggest a full revamp of how fetchers work to make the
entire system more flexible.

First of all, the current system of putting an executable in
$SHAREDIR/paludis/fetchers/ named for the URI scheme it handles and
simply calling it for such URIs resembles the original .bash hooks to
some degree. I propose to take fetchers in the same direction,
supporting .fetch fetchers to start and potentially adding .py and
other fetchers later. For .fetch, the syntax would be very close to
.hook. I have included a copy of 'docurl' redone in this format
at the end of this email. It may be necessary to add some sort of
'priority' rating as well, but I'll leave that for later discussion.

One nice result of this is that it should be much easier to support a
new download manager in the future. Also, one can have multiple
handlers for each scheme installed - perhaps we could even include a
check_working_$URI set of calls into the API so that if, say, the
underlying program is broken it could fall back until one works.

This would also make it much easier to support another idea I had.

I was reading up on various file transfer protocols today, and ran
across aria2. It supports taking multiple URIs (of varying protocols)
which point to the same file, and using them to download the file in
the best manner possible. Depending on what options are set, it may
simply pick the fastest mirror, pick multiple mirrors and download
chunks in parallel, whatever. Using the new .hook system, we could add
an indicator of whether the downloader supports this usage (pass
multiple sources for the same file). I think the best way to do this
would be via a 'multisource' response in fetch_auto_names(). Then,
fetch_multisource() could be called with a list of alternate sources
(specifically, only those which are of a URI scheme accepted by the
other entries in fetch_auto_names()). Such a feature would potentially
provide major performance gains.

In order to make best use of this, I have a multi-stage plan.

1.) Introduce the new fetcher format and port the default and demo
fetchers to it.
2.) Introduce at least one fetcher with such features
3.) If a fetcher advertises multisource capability, pass it *all*
possible mirrors for mirror://foo/ DOWNLOADS.

The stages listed above require no changes to repositories at all.
This next stage does:

4.) For one-off mirroring, or mirrored files that don't have mirror://
entries configured, allow some sort of 'grouping' syntax for DOWNLOADS
to specify 'these are all the same file by from different sources'.

We may also be able to use this to resolve the manifests problem.
Metalink version 4 [RFC 5854] is a download description format that
includes provisions for hashes, PGP signatures, multiple sources, and
other benefits. It is directly supported by a number of download
managers (including aria2), can name multiple sources, supports magnet
URIs as a source, and allows assigning priorities to sources so we can
make our own mirrors have a low priority so they are used as
fallbacks.

Considering that it also supports multiple files (with the name to
save the file as being independent of the name at the URI) it may make
a decent replacement for DOWNLOADS altogether if we can find a good
way to integrate it with exheres.

One option might be to have a server that does something similar to
unavailable. It would read all repositories, download the relevant
distfiles, make a Metalink per exheres encompassing that exheres'
DOWNLOADS (with hashes), delete the downloaded files, commit the
Metalinks to git, and publish the result to a repository that gets
synced. If we do this, we could potentially replace our implicit
mirror:// with a metamirror://, which expands to a list of URIs that
point to Metalink files.

To support this, we would probably need to either have another entry
in fetch_auto_names() to indicate support for this, or have another
function (fetch_multisource_formats()?) that says what types of
multisource are supported. I would prefer the latter, as I can think
of three off the top of my head - argv, Metalink, and magnet.

After we have that, there are two reasonable ways to implement the
actual downloading based on those Metalinks:
1.) Download the Metalink, then point the downloader at it in a
two-step process.
2.) Let the downloader handle the whole process (see aria2's
--follow-Metalink option)

Personally I'd prefer option 2, but as I said either seems reasonable
to me.

----- docurl.fetch -----
#!/usr/bin/env bash
# vim: set sw=4 sts=4 et :

# Curl fetcher for paludis
# Set EXTRA_CURL in paludis' bashrc for extra options for curl.

export PATH="$(${PALUDIS_EBUILD_DIR}/utils/canonicalise
${PALUDIS_EBUILD_DIR}/utils/ ):${PATH}"
source ${PALUDIS_EBUILD_DIR}/echo_functions.bash

old_set=$-
set -a
for f in ${PALUDIS_BASHRC_FILES}; do
    [[ -f "${f}" ]] && source "${f}"
done
[[ "${old_set}" == *a* ]] || set +a

docurl() {
    if [[ -n "${PALUDIS_USE_SAFE_RESUME}" ]] ; then
        docurl_saferesume
    else
        docurl_noresume
    fi
}

check_partial() {
    if [[ -f "${2}.-PARTIAL-" ]] ; then
        if [[ $(wrapped_getfsize "${2}".-PARTIAL- ) -ge 123456 ]] ; then
            einfo_unhooked "Attempting resume using ${2}.-PARTIAL-"
        else
            einfo_unhooked "Not attempting resume using ${2}.-PARTIAL- (too small)"
            echo rm -f "${2}".-PARTIAL-
            rm -f "${2}".-PARTIAL-
        fi
    fi
}

docurl_saferesume() {
    check_partial
    echo ${CURL_WRAPPER} ${LOCAL_CURL:-curl} ${EXTRA_CURL} --connect-timeout 30
--retry 1 --fail -L -C - -o "${2}".-PARTIAL- "${1}" 1>&2
    if ${CURL_WRAPPER} ${LOCAL_CURL:-curl} ${EXTRA_CURL} --connect-timeout 30
--retry 1 --fail -L -C - -o "${2}".-PARTIAL- "${1}" ; then
        echo mv -f "${2}".-PARTIAL- "${2}"
        mv -f "${2}".-PARTIAL- "${2}"
        exit 0
    else
        rm -f "${2}"
        exit 1
    fi
}

docurl_noresume() {
    echo ${CURL_WRAPPER} ${LOCAL_CURL:-curl} ${EXTRA_CURL} --connect-timeout 30
--retry 1 --fail -L -o "${2}" "${1}" 1>&2
    if ${CURL_WRAPPER} ${LOCAL_CURL:-curl} ${EXTRA_CURL} --connect-timeout 30
--retry 1 --fail -L -o "${2}" "${1}" ; then
        exit 0
    else
        rm -f "${2}"
        exit 1
    fi
}

fetch_ftp() {
  docurl
}

fetch_http() {
  docurl
}

fetch_https() {
  docurl
}

fetch_auto_names() {
    echo ftp http https
}




More information about the Exherbo-dev mailing list