[Exherbo-dev] Integrity checking

Niels Ole Salscheider niels_ole at salscheider-online.de
Fri Feb 20 21:15:14 UTC 2015

On Thursday 19 February 2015, 15:38:23, Alex Elsayed wrote:
> Niels Ole Salscheider wrote:
> > Hi,
> > 
> > this topic has been discussed several times but Exherbo still lacks
> > integrity checking for distfiles. The main reason why it has never been
> > implemented is that it is said to break the development workflow.
> > 
> > We had some proposals to use git-annex for distfiles but there has not
> > been any work on it. Also, one problem with git-annex is that it is
> > written in Haskell which we do not want in the system set.
> True, there hasn't been any work on it - but git-annex being written in
> Haskell is less of a problem than it could be.
> Git-annex' actual data structures are very well-defined, and writing a (very
> minimal) implementation over libgit2 for the sake of fetching wouldn't be
> especially onerous so far as I can see. I'm interested in doing it (and was
> one of the primary people bringing up git-annex in the past discussions),
> but simply don't have time due to being swamped at work.
> We might still require a full git-annex to _generate_ the appropriate
> metadata, but that is a.) less common and b.) could be phased out as the
> 'minimal' one gained featuress.

Ok, sure. I am not opposed to using git annex but it would require somebody to 
do the work - and it seems that there is no one who is able and willing to do 

> > Yet, I think that integrity checking is important: it might not completely
> > protect against e. g. the NSA but without it you become an easy target -
> > for example, if you have to install something when only some untrusted and
> > shared wireless connection is available. Integrity checking also helps to
> > detect corrupt downloads or when upstream silently modifies a released
> > tarball which can be annoying.
> There's a difference between "protects against accidental corruption" and
> "protects against malicious compromise."
> One of the big problems with various suggested strategies is designing for
> the former, and then talking about it doing the latter.

So let's define the thread model...

For my current proposal, we would have to trust
a) all Exherbo developers and their machines
b) that the servers hosting the git repositories are not compromised
c) that the initially fetched tarballs (and the checksums published by 
upstream) do not contain malicious code

It would most notably protect against
1) corrupted downloads
2) (most) MITM attacks
3) compromised file mirrors
4) compromised upstream file servers after the initial fetch

I admit that a-c are huge assumptions but I guess that we cannot do much about 
a and c.
It would however be nice to avoid b, especially since this does not only 
include Exherbo's servers but also services like Github where unofficial 
repositories are hosted. But that might complicate things...

Having (only) a protection against scenarios 1-4 is still a good thing since 
these are somewhat likely to occur.
1 has happened to me several times (and is just annoying). 2 can be a real 
thread because it is just too easy to do for someone (including "script 
kiddies") who has control over one of the intermediate hops. 3 and 4 are 
popular for non-targeted attacks since it might allow to compromise many 

> > I would therefore like to discuss how a simple checksum based approach
> > could look like and in which way it might break someone's workflow.
> Sure, sounds good to me.
> > Such an approach could for example add a [[ checksum = SHA256 ]]
> > annotation to each file in DOWNLOADS.
> Immediately, we have a problem: Simple 'git mv' bumps aren't viable anymore.
> > We could have a tool that automatically fetches the files in the exheres
> > and that adds / updates the  checksum annotations.
> This makes it slightly better, but
> a.) We still have churn in the diffs (a SHA256 isn't tiny)
> b.) Exlibs can no longer declare DOWNLOADS based on exparams (since the url
> takes input from the exheres, but is actually formatted in the explib)
> without taking a checksum exparam for _each_ generated URL
> c.) We now have a tool making modifications to a file that should be written
> by humans, and is in a notoriously difficult format to parse correctly
> (Bash scripts). If it's just a sed, then we're going to have /problems/.
> See the discussions way back when about why making EAPI a variable in
> ebuilds is *insane*.

Ok, I admit that there are problems with putting the checksum in the exheres 
itself. So let's put all checksums in a defined location in the repository and 
use some file format that can be parsed easily.

> > It would also present the computed checksums to the
> > user so that he can compare them with published checksums from other
> > sources. Maybe we could even run this tool as a git commit hook so that
> > the process becomes mostly transparent...
> "Transparent" and "mindfulness" are somewhat contradictory things - and if
> you're using checksums to make sure something hasn't been backdoored (rather
> than simply corrupted), then you _need_ mindfulness on the part of
> whoever's setting the checksums.

Agreed, we would have to find the right trade-off here.

> > We do not only have to check the downloaded files but also our
> > repositories.
> Perhaps - although it depends on the guarantees you want to make. Figure out
> your threat model _first_, because if you design without a clear threat
> model in mind you're going to get a nonfunctional mess.
> While it may not work for our needs, The Update Framework is a good example
> of how to do it _right_, by figuring out a solid threat model and then
> taking a principled approach to handling it. It also can be read as an
> illustrative text on exactly how hard the problem is. (Hint: the answer is
> "very".)

I was not aware of The Update Framework and will have to take a look at it 
during the weekend.

> > Git helps here a bit since the SHA1 hashes change when history is
> > rewritten.
> They do, but there's a real limit on how much this helps.
> > Rewriting history is however something that occurs often during
> > development.
> Arguable - with repositories, it's actually quite rare as I understand it.
> Gerrit never does so to my knowledge, and a repository generally doesn't
> _need_ it anyway.
> Even for large things like multibuild, historically they've simply been
> merged into master. No rewriting.

That's what I tried to express. There are no (or should at least) be _no_ 
situation when the history of public repositories is rewritten.
But this can happen during development on the developer's machine: You commit 
some new exheres, execute cave sync --local, see that you made a mistake and 
amend your commit. No the hash changes and cave must not refuse to sync to 
your local repository afterwards.

> > Therefore I suggest that we use forced pulls only from some configurable
> and trusted suffixes (so that "cave sync arbor --suffix local" continues to
> work when rewriting history during development).
> Suffixes are arbitrary strings [drawn from the valid characters]. This isn't
> going to fly.

This would not be based on arbitrary strings. I would expect there to be some 
switch in /etc/paludis/repositories/yourrepo.conf that a developer can use to 
indicate that there is a trusted suffix (e. g. local). He would then be 
responsible for guaranteeing that it is not compromised and cave would accept 
forced pulls from this location.

> > But we should make git complain if we pull from the official
> > repositories and history is rewritten. Of course, it is still possible to
> > add malicious commits to the top but these are hopefully a bit easier to
> > spot.
> Spotting a malicious commit when the "malice" is choosing a SHA256 that
> matches a backdoored tarball? Something people will just look at and say
> "Eh, that long string of hex is in the right place"?
> No problem, I'm sure we'll catch it with the first glance.

Point taken. But I still think that it would be good to warn if history is 
rewritten, even if it is only because it is not supposed not happen.

> All that does is push the issue we have today ("Does this file match the
> _correct_ upstream file") to sync time, rather than fetch time. No benefit
> to security, save for them needing to get the bad checksum into git, rather
> than just the bad file onto the mirrors.
> To be honest, the only thing preventing that is our relative obscurity as a
> distro. Nobody's going to consider it worth the effort.
> > Also, we can prevent most MITM attacks by using https for our
> > repositories.
> Sure, HTTPS is great! Shame about how easy it is to get BS certificates.

Right, but it makes things at least a bit harder. We might also distribute 
certificate fingerprints in the repository and complain if they don't match. 
This would limit the attack surface for MITM to the first fetch (given the new 
fingerprints are announced "early enough" for certificate updates which would 
be a problem for repositories on third party services).

However, I agree that this might not be enough to protect the repositories, 
depending on the thread model (cf. "b" from above). In order to get rid of "b" 
we would probably also need to sign the commits and have a way to establish 
which public keys are trusted.

> > What do you think?
> Well-intentioned, but with insufficient precision in modeling the underlying
> problem.
> > Ole
> _______________________________________________
> Exherbo-dev mailing list
> Exherbo-dev at lists.exherbo.org
> http://lists.exherbo.org/mailman/listinfo/exherbo-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part.
URL: <http://lists.exherbo.org/pipermail/exherbo-dev/attachments/20150220/2e836ed3/attachment.asc>

More information about the Exherbo-dev mailing list