[Exherbo-dev] Integrity checking

Alex Elsayed eternaleye at gmail.com
Thu Feb 19 23:38:23 UTC 2015

Niels Ole Salscheider wrote:

> Hi,
> this topic has been discussed several times but Exherbo still lacks
> integrity checking for distfiles. The main reason why it has never been
> implemented is that it is said to break the development workflow.
> We had some proposals to use git-annex for distfiles but there has not
> been any work on it. Also, one problem with git-annex is that it is
> written in Haskell which we do not want in the system set.

True, there hasn't been any work on it - but git-annex being written in 
Haskell is less of a problem than it could be.

Git-annex' actual data structures are very well-defined, and writing a (very 
minimal) implementation over libgit2 for the sake of fetching wouldn't be 
especially onerous so far as I can see. I'm interested in doing it (and was 
one of the primary people bringing up git-annex in the past discussions), 
but simply don't have time due to being swamped at work.

We might still require a full git-annex to _generate_ the appropriate 
metadata, but that is a.) less common and b.) could be phased out as the 
'minimal' one gained featuress.

> Yet, I think that integrity checking is important: it might not completely
> protect against e. g. the NSA but without it you become an easy target -
> for example, if you have to install something when only some untrusted and
> shared wireless connection is available. Integrity checking also helps to
> detect corrupt downloads or when upstream silently modifies a released
> tarball which can be annoying.

There's a difference between "protects against accidental corruption" and 
"protects against malicious compromise."

One of the big problems with various suggested strategies is designing for 
the former, and then talking about it doing the latter.

> I would therefore like to discuss how a simple checksum based approach
> could look like and in which way it might break someone's workflow.

Sure, sounds good to me.

> Such an approach could for example add a [[ checksum = SHA256 ]]
> annotation to each file in DOWNLOADS.

Immediately, we have a problem: Simple 'git mv' bumps aren't viable anymore.

> We could have a tool that automatically fetches the files in the exheres
> and that adds / updates the  checksum annotations.

This makes it slightly better, but
a.) We still have churn in the diffs (a SHA256 isn't tiny)
b.) Exlibs can no longer declare DOWNLOADS based on exparams (since the url 
takes input from the exheres, but is actually formatted in the explib) 
without taking a checksum exparam for _each_ generated URL
c.) We now have a tool making modifications to a file that should be written 
by humans, and is in a notoriously difficult format to parse correctly (Bash 
scripts). If it's just a sed, then we're going to have /problems/. See the 
discussions way back when about why making EAPI a variable in ebuilds is 

> It would also present the computed checksums to the
> user so that he can compare them with published checksums from other
> sources. Maybe we could even run this tool as a git commit hook so that
> the process becomes mostly transparent...

"Transparent" and "mindfulness" are somewhat contradictory things - and if 
you're using checksums to make sure something hasn't been backdoored (rather 
than simply corrupted), then you _need_ mindfulness on the part of whoever's 
setting the checksums.

> We do not only have to check the downloaded files but also our
> repositories.

Perhaps - although it depends on the guarantees you want to make. Figure out 
your threat model _first_, because if you design without a clear threat 
model in mind you're going to get a nonfunctional mess.

While it may not work for our needs, The Update Framework is a good example 
of how to do it _right_, by figuring out a solid threat model and then 
taking a principled approach to handling it. It also can be read as an 
illustrative text on exactly how hard the problem is. (Hint: the answer is 

> Git helps here a bit since the SHA1 hashes change when history is
> rewritten.

They do, but there's a real limit on how much this helps.

> Rewriting history is however something that occurs often during
> development.

Arguable - with repositories, it's actually quite rare as I understand it. 
Gerrit never does so to my knowledge, and a repository generally doesn't 
_need_ it anyway.

Even for large things like multibuild, historically they've simply been 
merged into master. No rewriting.

> Therefore I suggest that we use forced pulls only from some configurable 
and trusted suffixes (so that "cave sync arbor --suffix local" continues to 
work when rewriting history during development).

Suffixes are arbitrary strings [drawn from the valid characters]. This isn't 
going to fly.

> But we should make git complain if we pull from the official
> repositories and history is rewritten. Of course, it is still possible to
> add malicious commits to the top but these are hopefully a bit easier to
> spot.

Spotting a malicious commit when the "malice" is choosing a SHA256 that 
matches a backdoored tarball? Something people will just look at and say 
"Eh, that long string of hex is in the right place"?

No problem, I'm sure we'll catch it with the first glance.

All that does is push the issue we have today ("Does this file match the 
_correct_ upstream file") to sync time, rather than fetch time. No benefit 
to security, save for them needing to get the bad checksum into git, rather 
than just the bad file onto the mirrors.

To be honest, the only thing preventing that is our relative obscurity as a 
distro. Nobody's going to consider it worth the effort.

> Also, we can prevent most MITM attacks by using https for our
> repositories.

Sure, HTTPS is great! Shame about how easy it is to get BS certificates.

> What do you think?

Well-intentioned, but with insufficient precision in modeling the underlying 

> Ole

More information about the Exherbo-dev mailing list