nucypher/doc/architecture.rst

NuCypher
========

Depencencies / technologies
=============================

* Python 3.5+
* rpcudp - python3.5 branch
* kademlia - python3.5 branch
* lmdb for persistence
* Rekeys and metadata represented as Python dicts, msgpacked and encrypted,
  stored in lmdb
* C bindings to OpenSSL for encryption (?)
* PyCryptodome / PyCrypto for symmetric block ciphers
* buildout for building (more convenient when using custom git dependencies?)

Decentralized network
========================

`Kademlia <https://github.com/bmuller/kademlia>`_ by default (see kademlia.network.server) saves data in multiple nodes,
and also clients are servers there.

We need to split up client and server (that is, get and set methods of the
client don't save data in the current node).

In the first version of the protocol, we will use m-of-n threshold re-encryption
for ECIES. It means, that instead of one re-encryption key, we will generate
n re-encryption keys and store each with one node in the network.

By default, Kademlia stores data *copied* to *several* closest nodes. Instead,
we want find n closest and responding nodes and store rekeys with them, w/o
duplicating. The methods get() and set() in ``kademlia.network.Server`` are to
be used only as documentation. We will have to write our own ClientServer class.

The protocol (``kademlia.protocol.KademliaProtocol``) is also to be re-written for
reencryption rather than returning data.
When connections are established with nodes, they should tell their pubkeys
(or rather the pubkeys should be used as public nodeids).

New methods should include: ``store_rekey`` (with policy), ``reencrypt``,
``remove_rekey``.

Nodes should be able to have information on how long they can store
re-encryption keys for (this information will come from metadata written
on blockchain). Clients will be able to knows in advance.
Each node is identified by its pubkey, and clients will be able to know
in advance which node is available to store the policy for long enough.

Another feature to be implemented here is replicating all the rekeys to a
different node is the node is going to be offline for a long time
(complete shutdown). If this happens, the node passes all its rekeys
to node(s) which are capable to handle them for long enough, and write
this information on blockchain.

When a node start, a key which will be used to decrypt the persisted
data can be generated, read from a file (not very safe!), made from
passphrase (safe if the passphrase is long enough and generated),
or stored + delegated access using NuCypher itself.

This kademlia-based protocol is *not* intended to be anonymous, we hope for
split-key reencryption properties (e.g. that < m random nodes will be corrupt).

Persistence layer
====================

The persistence layer to be used is lmdb. Rekeys and metadata can be represented
as Python dictionaries. And when persisted - serialized via msgpack and stored in
lmdb (in an encrypted form).

API
=====
First, we create a Python API. This API should allow to:

* generate a new random symmetric key (this is usually implicit)
* encrypt (off-chain, but store meta-information with files)
* grant and revoke access (on chain)
* decrypt_key (query the network)
* decrypt (data using a key from decrypt_key)

also we can have similar functions for signing rather than just
encryption/decryption in the next versions.

The API should be implemented for: Python (native client),
JSON server (localhost, similar to bitcoind), Javascript (native).

Encryption
=============
We should be able to have algorithms pluggable, so we will note which algo
did we use for pubkey encryption / reencryption in a rekey meta-information.
The choices are:

* Normal BBS98 (1-of-n) (debug only);
* Normal ECIES (1-of-n);
* AFGH (n-of-n) (debug only);
* Split-key ECIES (m-of-n, production ready).

As soon as split-key ECIES is available, we immediately switch to it.
The curve should also be specified. Makes sense to use secp256k1 as it was
well tested with Bitcoin.

We also store which block cipher we used. The choices are:

* AES256-GCM (lisodium-based library for zerodb is the fastest?)
* Other AES modes (maybe not vulnerable to reusing the IV)
* Salsa20 from libsodium

Consumers of the data identify it by owner's public key and the path. It is
important that someone else doesn't submit reencryption keys for the same
path. So, at first, we should add digital signatures for hash(path + policy)
(using pycrypto library?). Then this signature and associated data will be
recorded on the blockchain so that it is publicly verifyable. The miners
have to accept only paths with valid signatures.
Public key should be used as a part of rekey address.
The scheme wouldn't work with anonimity on, so it will have to be redesigned
to be anonymous in later versions of the protocol.

Mapping in the rekey store:

    * hash(path) -> (rekey, policy, algorithm, signature, pubkey)

The pubkey here is *not* the encryption key, it's a separate signing key.

Algorithms/libraries to use:

    * ECDSA (pycryptodome / pycrypto), secp256k1 curve
    * sha3 module for hash functions (let's be future-proof!)
      (included in standard hashlib with python3.6+)


Non-anonymous protocol
============================

Owner of the data has signing keypair sk_o/pk_o and encrypting keypair ske_o/pke_o.
ske_o = hash(sk_o)

The path can be a string or a tuple (where a string is equivalent to a tuple with length one).
An example of a tuple-path::

    path = ('', 'home', 'ubuntu', 'secret.txt')

When a path contains many elements in the tuple, one can share not only one file, but also whole directories.
If the PRE algorithm is not multihop+unidirectional (there is only one like that), the encryption keys for
files/directories are::

    key[i] = hmac(ske_o, '/'.join(path[:i + 1]))

so, key[0] is the (private) key for whole ``/``, key[1] for ``/home`` etc.
When a file (or object) with ``path`` is encrypted, the owner generates a symmetric key for it,
encrypts it with every of key[i] and attaches to the file (or returns just keys if asked for).
When attached to the file, the encrypted symmetric keys are stored together with hashes of
paths and subpaths so that we can verify that this file is encrypted for the users of this path.

When a file or a directory is shared with someone with a key pair (sk_b/pk_b), the re-encryption
key is created for a path shared::

    rk = rekey(key[i], pk_b)

where key[i] is calculated in-place from the path, and rk might mean also all re-encryption shares
rather than just one rekey.

After the calculation, the rk is stored with the NuCypher network. It will be stored in the following
persistent mapping::

    hmac(pk_o + pk_b, '/'.join(path[:i])) -> (rk, policy, algorithm, sign(hash + rk + policy + algorithm, pk_o))

The policy is signed by the owner's public key in order to protect from submitting by someone else.
In order to protect from submitting after being revoked, the signature can be saved on blockchain
when the policy is submitted and when revoked so that no one can use a replay attack to submit it
again (needs to be rethoght for anonymous protocol).

All the interactions are encrypted with each node's public key + symmetric key, so that nobody
except that node can see the rekey. It's usually one-time interaction over rpcudp, so public key
encryption would work faster than TLS would work.

When a client requests to re-encrypt data, the request is initiated by a command like::

    data = client.decrypt(encrypted_data, pk_o, '/path/to/file/or/directory/where/it/is')

What happens under the hood is the following is sent to the miner node in a request encrypted
with miner's public key (on the client side)::

    # Path is transformed into a series of hashes
    path_split = path.split('/')
    path_pieces = ['/'.join(path_split[:i + 1]) for i in len(path_split)]
    path_hashes = [hmac(pk_o + pk_b, piece) for piece in path_pieces]

    # Multiple pieces are when m-of-n split-key reencryption is used
    # if not, there is only one piece
    edata_pieces = low_level_client.reencrypt(encrypted_data, pk_o, path_hashes)
    data = decrypt_m_of_n(edata_pieces, sk_b)

When the server gets a request with all the path_hashes, it looks for a reencryption key
corresponding to at least one of them, and uses the last one of what it found to reencrypt
the data::

    def request_handler(encrypted_data, path_hashes):
        for p in path_hashes[::-1]:
            if p in storage:
                rk = storage[p]
                return reencrypt(encrypted_data, rk)

        raise KeyNotFound