A hash160 Collision

This is a collision finders pool, because its main purpose is to find a hash160 collision. It may be necessary to describe in detail what a hash160 collision is. In order to understand the nomenclature here, you should have at least a basic knowledge about the bitcoin address generation process, i.e. how a BTC address is generated from a private key. This article in the Bitcoin wiki explains the process quite nice.

As you can see, a hash160 is basically the public key (the EC coordinates) value hashed with SHA256 and the result of that hashed with RIPEMD160. Because the output value of RIPEMD160 is 20 bytes in size, whereas the input value is 32 bytes in size, it is inevitable that there do exist the same output values for different input values. RIPEMD160 in this case may or may not be a Surjective function, but because its domain is 296 times bigger than its codomain, there must be collisions. In fact, this is the very reason why there should be - on average - about 296 private keys for each bitcoin address. RIPEMD160 definitely is NOT an Injective function.

In cryptography, it is considered good property of every hash function if it evenly distributes the values in its codomain. Because the codomain of the SHA256 function is the domain of the RIPEMD160, we can expect the 296 valid private keys to every address also to be evenly distributed in the 256bit search space, each on average in a block of 160bit size. Because each such block is as good as any other, we could inspect the first 2160private keys and should generate all possible bitcoin addresses out of these.

What We Don't Do

Many people believe a collision is the accidental creation of a private key that is identical to some already existing/used private key. This would be similar to the following code:

while(1) {
  if (rand(2^256) == rand(2^256) {
    print "We got ourselves a re-used private key!\n";

Obviously, the probability for this to happen is 1 in 2512, which is such a ridiculously small number, that there is nothing in the physical world to describe it by example. As a collision is the finding of a different private key to a given address this process would be pointless for the pool even if it were feasible.

What We Do

The general process of this pool can be described best with the following pseudo-code:

adr1 = ripemd160(sha256(pubkey(rand(2^256-2^160)+2^160)))
for (a = 0 to 2^160) {
  adr2 = ripemd160(sha256(pubkey(a)))
  if (adr1 == adr2) {
    print "We got ourselves a collision!\n";

Meaning "Given a bitcoin address 'adr1' from a random(unknown) private key of numeric value between 2160 and 2256: find another private key in the interval between 0 and 2160 which will evaluate to the same bitcoin address."

The interval 0 to 2160 is still a pretty big pile of numbers. In fact, way too big for some single machine to process and this is where usually a pool of machines/clients is used. The LBC pool does some more things to make this search for a collision more feasible:

Interval Partitioning

The pool solves the problem of work distribution among many different clients, who promise to do a certain amount of searching. Depending on their proclaimed speed and promised time to work on the problem they are assigned a small interval out of the huge 0 to 2160 search space.

It keeps track which client was given when what interval to work on and when this promised work is due for PoW (proof of work). The huge 160bit search space is therefore partitioned into smaller search space intervals and those already searched are reassembled again in the pool DB. This way, the pool makes sure very fast clients can co-work side-by-side with way slower clients. The pool also makes sure that work issued to the clients is not done twice, so in the general case your client gets to work on an interval no other client has seen before.

Instant Check-All

At the moment, the pool is looking for private key collisions of P2PKH addresses. We are looking only at addresses with funds on them (see below for the why) and at the moment there are around 9 million of these in use. The pool uses a probabilistic data structure called Bloom filter to do the checking for each newly generated hash160 against all of the known hash160 with funds on them.

So when the pool has a performance of - say - 25 MKeys/s, this actually means that every second, the pool generates 50 million hash160 (for each private key the compressed and uncompressed hash160 is generated) and checks these against the ~ 9 million existing hash160 with funds on them. This is equivalent to 450 trillion checks per second.

Because we assume the private keys to addresses with funds on them are also uniformly distributed in the search space, this means that the effective search space until any collision for these is found is 160bit - log2(9000000)bit = ~137bit.

Fun fact: The current Top 500 combined performance, could generate and check around 13 TKeys/s. With this performance, it would require 424971148634821481904 years to sweep through a 137bit search space. However, it's projected performance is such that 5 years later, it would require only 42497114863482148190 years to complete the task (382474033771339333714 years less).

The Case For Funds

As the pool is checking the generated hash160 against those existing hash160 with funds on them, there were comments that the pools purpose is not to find a collision, but to "crack private keys to get hold of funds". The process described above should make it abundantly clear that we are searching for collisions. However, searching for collisions is only one half of the task of finding collisions.

In order to find a collision, it would indeed be sufficient to find any collision between any generated private key against any other private key. The problem with this approach is, that in order to be able to check for such a collision, you would have to store all the generated private keys so far in a central place and if you had a distributed search infrastructure, you would have to make this updated database available to all clients for each key they have generated.

Clearly, this approach is not feasible, because the overhead for doing that would use a multiple times the ressources as if you had a single client working on the task. So this approach would render the whole problem unparallelizable.

Storing all addresses with funds on them in a single data structure on the other hand is perfectly feasible. And this data structure can be distributed among the clients. What's even more important, that the funds on these addresses themself are a very high motivation for the rightful owner to claim them back in case an alternate private key has been found.

Addresses without funds on them are getting out of focus. If you had used some address previously and this address had now no funds on it, even if someone found an alternate private key to it, you would neither notice and probably not even care. If - on the other hand - funds of some of your addresses in use was suddenly transferred to another address, you'd most certainly would want to have it back.

Claiming to be the rightful owner is actually very easy with the pool. Because is searches only in the first 160bit search space, it is very unlikely to find normal private keys that have been generated by a regular wallet software (meaning the first 96bits of the PK being 0). Therefore, as the rightful owner all you have to do is to present your private key to the address which by a chance of 296 to 1 will be a different private key than the one the pool has found.