Enter The Freenetrix
The Freenet Help Site
Enter The Freenetrix
Licences used on this wiki
Welcome admin to Freenet WikiServer
Edit Page - Diff - Revisions - Title Search - Preferences - Bottom
 
Specs for the 0.7 version
[1]crypto spec
[2]files spec
[3]datastore talk
[4]transfer talk
[5]padding talk
[6]transport/firewall/bootstrap talk
[7]state machine talk
[8]routing/other talk
[9]load balancing talk
[10]FCPv2 talk
[11]blocksize/splitfile talk
[12]keys/other talk
[13]datarequest/other talk
[14]queueing talk
[15]unicode talk


[1]
link level crypto:

We use 256-bit AES, rather than the 128-bit AES we currently use.

We do the initial session negotiation the same way as we do now, only
over UDP. With silent bob - Bob will not respond unless Alice knows
Bob's PK, so you cannot portscan for freenet nodes. Alice and Bob prove
their identity to one another, and negotiate a session key.

Because of a certain paper I read, I suggest we change this slightly:

IV E(<sequence number> H(<data>) <data...>)

This brings the total protocol overhead to 32+32+4 = 68 bytes.

We do not want a malicious intermediary to be able to change the message
and have it still verify.


It would be nice to be able to identify which node a message belongs to
from the content alone, in case a node changes its IP address without
knowing. It's not essential, because otherwise we'd simply have the node
realize that nobody is talking to it any more and renegotiate all its
connections.

The easiest way to do this is to add H(session key) to the beginning of
each message. But this makes it considerably easier to identify freenet
traffic - a fixed string at the beginning of each message from a given
node. Not that it's hard to identify freenet traffic anyway, but one day
we may do some steganography, and I'd like traffic to be as close to
random as possible before that.

A way to vary it:

IV H(H(E) + H(session key)) E

Where E = E(<sequence number> <data>)

Total space overhead: 32 bytes IV + 32 bytes hash + 4 bytes sequence number = 68 bytes - as above.

Thus we have protection of the message, as nobody without the session
key can generate the hash, and we only have to check the hash of 64
bytes for each connection. We can make it 32 bytes:

IV H(H(E) XOR H(session key)) E

Then to identify which peer sent a packet, we do:

Extract E from the packet
Calculate H(E)
For each peer:
- Calculate H(H(E) XOR H(session key)). H(session key) will have been
precomputed.
- If this is correct, we have a match

For normal verification, we need some hashing:

lengthToHash = Length(E) + 32

So typically around 1040 + 32 = 1072 (very roughly)

For searching for the right node, we need some more hashing:

lengthToHash1 = Length(E) + 32 * numberOfHostsToCompareTo

Typically around 1040 + 32 * 200 = 1040 + 6400 = 7440

So it takes around 7 times as much CPU to find a node as to just verify
the packet. As long as we only have to do this occasionally (which we
can enforce by only doing it e.g. once per second), this is not a
problem.

My XP 2800+ can hash 6,692kB/sec.

So it could theoretically handle a 53Mbps link if it only needed to
verify packets, and a 7Mbps link if it needed also to find the source
node by the above method.


[2]
Files are 32kB. They have an IV (which is the hash of
the hash of the original data).



[3]
The Datastore:
The datastore only stores fixed length files of 32kb. It is more of a simple database than a filesystem. Dijjer has such a datastore, on which the new code may be based:
http://cvs.sourceforge.net/viewcvs.py/dijjer/Dijjer/src/dijjer/io/store/Store.java?rev=1.13&view=markup
There is an index, which is kept in RAM, and written to a file, and
a store file (it will be monolithic).

Dijjer does not even require a free blocks bitmap, because it implements
deletion by moving the last item back over the deleted item, and in any
case most of the time items will be replaced rather than deleted.
Since we will only commit complete and verified files, there will be no
need to delete anything, except in the case of datastore corruption. So
this can remain as-is, at least for now.

However there probably won't be a big problem.

Obviously we will need a slight modification to implement pcaching, if we decide that is useful (some simulations say it's enormously important... others say it isn't).

Optionally, the datastore could fully inflate on installation, so that it
never needs to again. Some users may find fixed space usage
advantageous. This is not necessary to implement immediately but may be
useful before releasing 0.7.

We will not have sub-hashes. We will read the whole block
from the store at once, and write a whole block, verified, at once.
Therefore we do not need to have a separate commit operation or a notion
of half-written files. However this does mean keeping the entirety of
all in-flight blocks in RAM. Hopefully this won't be a problem.

It appears that Dijjer maintains integrity for the store by having a
separate store of hashes for each block. Thus, if it is shut down in the
middle of a block replacement, the hash will not be valid, and the block
will be discarded. Likewise if the index simply hasn't been updated yet.

[4]
Transfers:
Our transfer code is almost identical to Dijjer's.
http://dijjer.org/index.php?page=arch_transfer
It does not assume the file is sent in order, and forwards packets
immediately when it recieves them. The overall hash is checked when the
file has been received fully.

Dijjer's code supports sending multiple files to the same peer at once,
and will automatically throttle its bandwidth usage, regardless of the
number of transfers, according to how many packets are dropped. The
algorithm it uses will play nice with TCP.

Below is the design structure me and Ian came up with; something close to
this has now been implemented in Dijjer and is being used by the initial
Freenet 0.7 code:
Data structures:

Partially Received Block
- Represents a block we are transferring. May have 0 or more packets
received successfully.

Receiver:
- Has a Peer and a Partially Received Block
- Reads from the network and writes to the Partially Received Block

Transmitter:
- Has a Peer and a Partially Received Block
- Also uses a Peer Bandwidth Limiter
- Reads from the PRB, and writes to network, the block which is still
wanted with the lowest sequence number.

Peer:
- Represents a peer
- Includes a Peer Bandwidth Limiter

PeerBandwidthLimiter:
- Runs the bandwidth adjustment algorithm; increases the speed at which
packets are being sent to a host slightly whenever there is a
successful transmit, and decreases it when there is a packet lost.
- Ties into the global bandwidth limiter, which uses the specified
outputBandwidthLimit to determine how many data packets can be sent in
a second (control packets are ignored for bandwidth limiting).
- Between them they arrange for Transmitters to get a chance to send
their packets.


DataMessage:
- 1024 bytes of data (later implementations may use smaller blocks, but
start with big blocks; we will only need a network protocol reset to
change this :) )
- Bitmap of what blocks the transmitter has previously sent


RetransmitRequestMessage:
- Bitmap of what blocks we want resent, or all the blocks we are
currently waiting for.


SuccessfulTransferMessage:
- Indicates that the receiver has received the entire file


SuccessfulTransferAckMessage:
- Indicates that the transmitter has received the transmission success
message.


TransferStatusProbe:
- Sent by transmitter, requests status from receiver.
- Transmitter expects a Successful Transfer Message or a
Retransmit Request Message in response. If it does not get one in a
certain period of time, it kills the request and frees the resources
used.


So to transmit:
We know which blocks the other side still wants (the set of all blocks,
minus those that we have sent, plus those that the other side has asked
for retransmission of)
We send a Data Message with the block with the lowest index in the file.
We then wait for a relevant message, an opportunity to send another
Data Message, or a timeout.
If the latter, we repeat.
If we get a Retransmit Request Message, we blank out the packet numbers
that the other side wants us to retransmit. (within reason; we keep a
counter of the total number of packets we have sent, and if it exceeds a
certain value, kill the transfer, return an error, close the virtual
connection to that node and remove it from the routing table). Any which
we have not yet received which the other side wants are ignored; we
can't do anything about it. Any which were sent more than the round trip
time ago, and which the other side requests be resent, have been
dropped; notify the bandwidth limiter.
If we get a Successful Transfer Message, we send a
Successful Transfer Ack Message back to the receiver, and terminate the
transmission with a successful outcome.
If we get a timeout, we send the receiver a Transfer Status Probe message.
Unless we have sent 10 consecutive TSPs with no reply, in which case we
kill the transfer unsuccessfully.


To receive:
We wait for a message or a timeout (4 times the expected packet gap).
If we get a timeout or a Transfer Status Probe message, we send a
Retransmit Request Message back to the transmitter, listing all blocks we
still need, whether or not we know the transmitter has sent them.
If we get the long timeout (10 consecutive short timeouts with no
received data packets? 20 seconds?), we timeout the receive; it has
failed.
If we get a Data Message, we add the data to our Partially Received Block.
We then check the sent blocks bitmap. If there are any blocks in the
bitmap that we have not yet received, which we have not complained about
in the last 2000ms, we send a Retransmit Request Message back to the
transmitter requesting that they be resent.

If we get the last Data Message, and complete the transfer; we have
succeeded, and we send the transmitter a Successful Transfer Message.
We wait for a timeout (longish), a Transfer Status Probe message, or a
Successful Transfer Ack Message.
If we get a Successful Transfer Ack Message, we exit with clean success.
If we get a Transfer Status Probe message, we send the
Successful Transfer Message again.
If we get the timeout, we exit with annoying success.


Ian suggested that the receiver should track the round trip time, the
time between sending a drop notification and receiving the first of the
dropped packets. We could do this, but it would be problematic if the
same packet is dropped twice. This is unlikely but will happen from time
to time.. I suggest a serial number is included with the
Retransmit Request Message, which is then sent back on any Data Message
caused by it? Otherwise just a straight timeout in milliseconds.

[5]
Padding strategy:
Configurable fixed UDP packet size in the freenet.conf to prevent size
based traffic analysis.
At a later point, we may add support for reading in a captured size
profile to mimic another UDP application.
The fixed packet size MUST be greater than the Data Packet length!

Initially there will be no padding. We will have two kinds of messages,
Data Message's and control messages i.e. everything else. Data Message's
will be of a fixed size, and be bandwidth limited. Control messages will
be of different sizes depending on their type. Note that this is highly vulnerable; we should at least fix the size of the control messages very
soon after the initial prototype.

Later on, we may implement padding, queueing and aggregating, which IMHO
go together. There are a number of other ideas that have been discussed on freenet-tech such as semi-constant-bitrate links, for the purpose of making traffic analysis harder.

[6]
Transport:
- Use Dijjer UDP
- Very compact messages.
- Need to add message coalescing, and padding.
- Crypto can be implemented as a filter.
- Can add a TCP based transport later (with cloaking hopefully, see
TODO), if it is needed.

- We intend to use the UDP simultaneous connect trick to open
connections from NATted nodes to NATted nodes. This can work fine when
opening a new connection from a Store Data. HOWEVER, what happens when
you have to re-open that connection? We have to use a third party
again! So presumably, we contact some non-NATted nodes on startup, and
ask them to connect us to some nodes... What is their incentive? And
what happens if they don't have a connection to those nodes already?
It seems unreasonable to expect them to open a connection to those
nodes just to connect us to them... or is it?

Possibility:

 - NATted node A connects to non-NATted node B. Says "please connect me 
   to NATted node X, here's his node reference".
 - A starts sending handshake packets to X
 - B sends a packet to X, saying "please connect to A"
 - X starts sending handshake packets to A
 - They connect
What's in it for B? He can connect to X as well, if he wants another
connection... that's probably enough.

Just a couple comments, if B is not connected to X then it shouldn't
have any better luck connecting to X than A itself, should it?

Also it seems you would need watch out that |A | can't forge a bunch of
"connect me" requests for X's to B and use this mechanism to create a
D Do S against A. Although I suppose as long as a full connection has to
be established with B before making the "connect me" request it probably
rules that out.


Ouch. Grrr!. :( Correct, the only way for B to tell X to connect to A is
if it's already connected, or if there is a bug in the NAT that allows
the simpler, less reliable, UDP trick... hmm. Need to rethink.

Okay, here's the situation:

There are two main ways of hopping firewalls with UDP.

The first works on the same principle as TCP simultaneous connect, but
it always works, unlike with TCP. NATted nodes A, B, are connected to
node C. A wants to connect to B, so it asks C to tell B to connect to A;
both send handshakes out at once, and it works.

The second relies on a bug in some NATs and firewalls. This forwards all
traffic from any node on port P to the computer which sent packets out
on UDP port P.

We decided to use the first one... it works very well for path folding.
Unfortunately it does not work very well for much else.. if A is
connected to C, and C is NOT connected to B, how does A connect to B?

So, here are our options:
1. Rely on the second, unreliable, UDP behaviour/bug. Is this what the
other UDP P2Ps do? Or do they have a backchannel?
2. Accept that NATted nodes will not be able to reconnect to NATted
nodes immediately, and will lose a lot of their routing table on
startup. This applies for non-NATted nodes that want to connect to
NATted nodes too..!
3. Use an alternative back-channel to tell B to connect to A.
This would have to involve either finding a node that is connected to B,
which we can connect to, or routing a message to B to tell it to connect
to A. The former probably has security issues; the latter could be
accomplished by some form of channels over freenet i.e. message stream
rendezvous at a key. Inserting the stream hook would take some time;
inserting a message to send to the node could be relatively quick.
Sending a series of messages could be really fast - perhaps even usable
for IRC. It's something we have considered implementing after 1.0 for
some time; maybe it should be implemented before 1.0. But it will not be
anything like as fast as a straight UDP connection. Arguably it doesn't
need to be though; path folding should be fast, but restarting doesn't
absolutely need to (although we'd obviously prefer if it was). Another
way to accomplish it would be via an alternative transport.. the only
thing I can think of right now is email though and I can't see that
working reliably out of the box...

That's a pretty tall order; it takes a long time (up to a week on some
nodes... others report more like a few days though) to integrate on the
current network. Having said that the average blocksize is huge on the
current network - 400kB+? - so the increased number of requests alone
will help bootstrapping a lot.

Here's a more serious concern with instant bootstrapping:
- A lot of people will continue to have big datastores
- If we essentially ditch most of the routing table on startup, then the
routing table ceases to be of very much use too... offline routing
table (already implemented) could help a bit here.. The datastore is
of FAR less use if we lose the routing table (on both sides!) We may
get some of our old connections back through path folding, if we set
it up so that that works when we have unconnectable nodes in the RT
that can be introduced via path folding. In fact, on a large network,
you might make the argument that if you can't keep the routing table
you might as well keep the store encrypted using a transient key, as
it's going to be of very little use to you after startup. At least
from a network standpoint. Now, the old content CAN be used... but it
may just serve to dilute the specialization of the node...

This is why I sent you those IRC messages 2 minutes after you posted to
tech@ toad ;)

But yeah, I2P will solve this issue by including some peers a node can
be introduced to through within the node's reference (e.g. instead of
"I'm @ dev.i2p.net:4109" its "I'm @ dev.i2p.net:4109, introduced
through foo:1234, bar:4321, or baz:2345", where foo, bar, and baz
aren't NATed). We can pull that off by using our netDb. </handwaving>

The parallels for Freenet would be using ARKs to share the info of the
introduction points, as Todd suggested - if Freenet can get ARKs
working reasonably quickly and without fear of reinjection attacks,
Freenet can do the same.

I suppose so.. but ARKs never worked when we did have them. Mostly
because the network sucked. And it does mean giving away several of the
node references we are connected to. But that's probably the most
realistic option at the moment...

I suppose this is something to look into after we have the initial alpha
working. In the meantime, we want to be able to do connection via path
folding. Combined with a big offline routing table hopefully restarting
nodes will be able to get back onto the network and recapture many of
their old references...

[7]
Possible state machine; this is very tentative however and may need to
be stripped down significantly.


- Max HTL is 11
- Requests leave the node with HTL 11
- There is a 10% chance of decrementing from 11 to 10. This is
determined once per link to make correlation attacks harder.
- There is a 100% chance of decrementing HTL from 10 to 9, from 9 to 8,
and so on down to 1. This ensures we always have at least 10 hops, so
we don't skew the estimators too much.
- There is a 20% chance of terminating the request if HTL is 1. This is
to prevent some attacks involving inserting different content on
different nodes, and to make it harder to probe a node's datastore.

Full request coalescing - we have a hashtable listing current requests
by key, and coalesce new requests with old where possible.

Request:
- More or less existing algorithm for choosing a node: NGRouting
(slightly simplified because of fixed size files), and queueing based on MRI.
-- Can fail to find a node; this is an RNF (some form of Query Rejected if
this is a remotely originated request)
- We send them a Data Request.
-- Contents of a DataRequest: the key. Anything else needed?
- We wait for an early timeout, an Accepted, or an Early Query Rejected.
- If we get an EarlyQueryRejected:
-- Record it on their estimator
-- Restart i.e. run randomFail() again, and send to the next choice.
- If we get a timeout, treat similarly to Query Rejected
- If we get an Accepted, wait for the long timeout, a Late Query Rejected,
a Data Not Found, or a Data Reply.
-- If we get a Data Not Found, terminate the request, update the estimators.

Data Not Found means "We died because randomFail() said lets terminate the
request here".

-- If we get a Late Query Rejected, run randomFail() again, and send the
request to the next choice.

Late Query Rejected means "We died because we couldn't find any more nodes
to send the request to".

-- If we get a Store Data, queue it for after the Data Reply, and continue
to wait.
-- If we get a Data Reply, start to receive the data (and write it to the
store, from which it will be read by anyone downstream). Open the
datastore handle here, and close it when no longer using it.
- If the data transfer completes successfully but fails to verify, then
punish the node and fail the request.
- If the data transfer times out, then punish the node and fail the
request (should we retry? IIRC estimators get a bit tricky if we do...
also we're likely to hit the overall timeout)
- If the data transfer completes successfully and verifies, the request
is a success. Send the data to the user, if necessary, and wait for
the Store Data.
- If and when we get the Store Data, decide whether we want to connect to
the node.
- If we do, send a Connect Request back along the chain. This includes
our node reference, encrypted with the datasource's public key. It
also includes the datasource's identity.
- This is passed along the chain until it gets back to the datasource
(which is included on the message).
- When the datasource gets it, it sends a Connect Response back to us,
back along the chain in the opposite direction.
- Now both sides know that they are supposed to be opening a connection.
With the UDP transport, they both start sending UDP handshaking
packets to one another. This will open tunnels both sides and allow
them to connect even if both are NATted.
- With other transports, different mechanisms might be used at this point.



Inserts:
- Inserts cannot produce connections via Store Data. One of the few
weaknesses from http://www.sims.berkeley.edu/%7Erachna/courses/cs261/paper.html that has not yet been addressed is that nodes can get themselves connections merely by doing inserts.
- Send an Insert Request, routing slightly abnormally:
-- Estimate = expected time to fetch from that node + expected time to
insert on that node + expected time to fetch from a random node having inserted on that node (the latter two are updated by the insert).
- Wait for the Accepted (or a timeout, in which case randomFail() and
restart)
- Wait for the Insert Reply (BEFORE we send the data - tradeoff is
slightly longer insert time versus less sending data to nodes that
are going to reject it anyway), or a fatal timeout.
- Send the Data Insert.
- After have successfully sent all the data (if unsuccessful,
randomFail() and restart), wait up to the transfer timeout for the
Insert Complete.
- If get the Insert Complete, the insert has finished.
- Report times to the insert estimators.
- Now verify the insert (purely for our own accounting, but all nodes
involved in an insert do it):
-- Send a Data Request, with the randomStart flag enabled, for the data
we just inserted.
-- This will route the Data Request randomly (but within the top 50% of
the routing table) for a few hops (50% chance of ceasing to be random)
-- Don't report the request's success or failure to the node's request
estimators. Instead report it to the insert node's fetch estimator.

[8]
Routing (tentative):
Based on current code.
Hash the key before routing it to make chosen key attacks harder.
Implement per node failure table at some point to prevent various single
key censorship attacks and make it easier to find old/mis-inserted data.
Current things that need estimators or RAs:

Early Query Rejected (now an actual message, yay!) - probability and time
- running average, assume key independant
Early Timeout - probability and time - running average, assume key
independant
Accepted - time to get it, affects both success and failure. Any later
failure mode that happens before the actual Accepted => we assume
accepted happens when that arrived. - running average, assume key
independant
Late Query Rejected (aka RNF) - probability and time (after Accepted) -
key dependant : |
Late Timeout - probability and time (after Accepted) - key dependant : |
Data Not Found - probability and time (after Accepted) - definitely key
dependant!
Success - time - definitely key dependant!
Verify Failure - probability and time - running average, assume key
independant (either a really rare bug or malicious, either way probably
not keyspace dependant)
Transfer Timeout - probability and time - estimators

In total then, for a request Standard Node Estimator (-> Request Estimator)
Running Averages:
- Early Query Rejected p+t
- Early Timeout p+t
- Accepted t
- Verify Failure p+t
Estimators:
- Late Query Rejected p+t
- Late Timeout p+t
- Data Not Found p+t
- Transfer Timeout p+t
- Success t

and for inserts:
- A full Request Estimator for verification requests.
- Early Timeout, Early Query Rejected p+t (Running Average's)
- Accepted t (Running Average)
- Late Timeout p+t (estimators)
- Late Query Rejected p+t (Running Average) - if needed i.e. if we support
Query Rejected at that stage of an insert, which I'm not sure we should.
- Transfer Failure p+t (Running Average - only sending to next node)
- Transfer Complete t (estimator)

One important point I missed:
We know LRU works from theory. We should use LRU for deciding which
connections to drop. If we can keep a time of last disconnection for
each node, then we can use LRU even for deciding which nodes we want to
try to connect to.

Hash the key before routing it to make chosen key attacks harder.
Definitely a good thing, although minor. Cannot be implemented without a
network reset, obviously.

Implement per node failure table at some point to prevent various single
key censorship attacks and make it easier to find old/mis-inserted data.



[9]
Ian has come up with yet another load balancing algorithm. We propose to
use this one for Freenet 0.7. The difference with this one is that it is
closely based on those used by TCP/IP. So hopefully it should work.

Recently I have come to question Ian's algorithm... I'm not sure that it
is going to work safely, although I think it will probably work. So we
may be using MRIs after all. There should be a recent-ish post on this.

Apologies to anyone who has suggested pieces of this in the past. I'm
not sure that we ever had the whole idea..

Node N has clients A, B, and C. It has a queue of client requests, which
contains any and all requests submitted by the clients. It has the
following variables:
Window Size - the target number of in-flight client-originated requests
Round Trip Time - the average time taken for a request to complete (should
this exclude the actual transfer, if there is one? But even if it does,
it'll take longer for a DNF than for a successful search.. so lets call
it the average time for a request to complete).

Every Round Trip Time/Window Size, we can send a client request. If we
don't have any to send, we don't send any (and we wait for one to come
in, and send it immediately). If we do have some to send, we send one,
then wait until we can send another one and so on. We may want to wait
for the number of requests in flight to be below Window Size before
sending another request if it changes rapidly...

A request goes from the queue out to node D.

Node D queues its incoming requests for a maximum of X milliseconds.
2000ms seems reasonable.

If a request is queued on D's incoming requests queue for more than X
millis, it is dropped. It sends a Queued Query Timeout message back to the
source node. If the source node has simply been passing this request on,
then it will pass the Queued Query Timeout on - it will NOT try another
node.

When N gets a Queued Query Timeout, it reduces the window size. When a
query completes, it updates the Round Trip Time and increases the window
size. This is according to an a formula implemented in dijjer for
transfers which is compatible with TCP (multiplicative decrease additive
increase).

D removes requests from the queue and routes them every
Min Interval Between Accepted Requests millis (at most; it can probably do
this on a single continuation, in which case if the interval is too low it
will simply wait to send the first request before sending the next one -
the interval is to keep things like thread usage and bandwidth usage under
control). Once it has removed the request from the queue, it sends an
Accepted message back to the immediately previous node to kill the
no-accepted timeout. It then sends it on to the top node for that request,
according to routing. NGRouting keeps a simple running average for each
node of the probability of a Queued Query Timeout, and adjusts routing
accordingly. Obviously nodes can send a Looped Query Rejected if the query
is looped; this would happen immediately on receiving the request, not
after queueing it, to save time.

D calculates Min Interval Between Accepted Requests in a similar manner to
the current global load limiter; it takes the current average interval
between accepted requests, and multiplies it by the system load. Thus it
should balance out in the long run to accept exactly as many requests as
it can.


This algorithm works well for TCP, produces a typical 2% packet loss in
practice (except on noisy links, obviously), excellent latency, and
short queues. If we adapt it for Freenet, we should get something that
works and does not normally require a long queueing delay on each hop.

Note on request reliability:

Freenet 0.7 requests are likely to be moderately unreliable, in that
some of them will be cut short because of the 5% chance of dying after 1
hop (etc), and some will be casualties of Queued Query Timeout. However
they are also likely to be rather fast. Most requests will be for
splitfiles, whether transparent or otherwise, which will be
automatically retried. Those which are not can and should be retried,
without necessarily telling the user. For example in fproxy, when
fetching single files (e.g. the DBR manifest, or the redirect which
points to the splitfile that makes up the actual site).

Note on timeouts:

With the above scheme, we get a worst case 2000ms queueing on each node.
However this should not happen often. If it did happen often, then a LOT
of packets would get Queued Query Timeout, and this is what the whole
algorithm is designed to prevent. So we may not want to assume this
worst case scenario on each node when calculating timeouts. If we assume
an average 200ms transit time, and 500ms routing and queueing time,
which is probably moderately generous, then 57 hops is 40 seconds... We
may have to tweak that a bit...

Note on bandwidth limiting:

At present, we use two different kinds of bandwidth limiting. It is
possible to precisely limit the bandwidth used by data blocks with the
new architecture, very easily. So it would be preferable not to have to
allow extra slack on this low level limit, while still taking bandwidth
into account in determining how many requests to accept. So we don't
want the number of blocks ready to send to increase indefinitely. We
could set an arbitrary limit on this.. But what we want is for our
transmit rate to be equal to our rate of getting new blocks to send.

So:
Our transmit rate is calculated from our bandwidth limit. We can send
10240 bytes in a second.

A typical request will cause us to send 34000 bytes.

Therefore we should not accept requests more than every 3.32 seconds.
We simply incorporate this into the MinIntervalBetweenAcceptedRequests:

Min Interval Between Accepted Requests =
Max(
(Mean bytes sent for a request) / (Out bandwidth limit)
(Mean bytes received for a request) / (In bandwidth limit)
(Mean time between accepted requests) / (Thread load)
....
)

So we will want to rewrite that code as well rather than simply using
the global quota code already written. But it can be closely based on
the existing code.

Actually, could we get rid of queueing altogether (on input) ?

If the last N requests have come in in X millis, and we have a target of
M millis between requests, then:
- The last N requests should have come in in M*N millis
- If M*N <= X, then we let everything through, for now
- If M*N > X, then we drop (X-M*N)/X of all packets. I.e. on each packet
 we do if(randomFloat() < ((X-M*N)/X)) drop_packet().

This would solve the remaining problems with queue times and timeouts.
All that requires calibration is the value of N, which is essentially a
sensitivity number.

[10]
It will be necessary to have some sort of interface early on in the
process of writing freenet 0.7. So, here are my thoughts on FCPv2:

I'm half inclined to ditch the session identifier. I'll certainly be
getting rid of it from FNP. Arguably it is useful for the client
protocol though. But we will NOT support classic FCP.

Messages will remain in a text based format, similar to HTTP. Hence:

"Client Get
Remove Local Key=false
Identifier=Toad's Request Number One
URI=SSK at rBjVda8pC-Kq 04j Uur I Ab 8 Iz A Gc P Ag M/TFE//
End Message

" (note the double newline at the end)

We will probably still use Field Set for this.

The obvious differences to the current code:
- No HTL field. HTL is deprecated and no longer supported.
- Included an Identifier field. This contains an arbitrary string used
to identify this request, since FCPv2 supports multiplexing.
- Remove Local Key is still supported for requests. It is (even in the
current code) meaningless for inserts.

Any messages related to the request will include the Identifier field.
We will have an extra error message to complain about duplicated
identifiers, of course.

So:

When we actually send the message out (they are queued):

Request Sent
Identifier=Toad's Request Number One
End Message

If it drops off the end of the queue:

Request Dropped
Identifier=Toad's Request Number One
End Message

DNF:

Data Not Found
Identifier=Toad's Request Number One
End Message

RNF:

Most of the fields in the RNF message no longer make sense. If any do
we'll add them.

Route Not Found
Identifier=Toad's Request Number One
End Message

Found:

We only send the Data Received when all the data has been transferred. We
then send the whole of the received file, after it has verified.

Data Received
Identifier=Toad's Request Number One
Metadata Length=32
Data Length=2578
Metadata Compressed=false
Data Compressed=false
Data
[ all the metadata ]
[ all the data ]

The client is required to do the decompression, as this is a low level
protocol (it is clearer why in the insert case).

Verify failed:

Verify Failed
Identifier=Toad's Request Number One
End Message

(if the verify failed, the request failed)

Transfer failed:

Transfer Failed
Identifier=Toad's Request Number One
End Message


The FEC API probably will remain unchanged, at least initially.

Unusual features such as split metadata will simply use metadata
changes. Something like this:

"Version
Revision=2
End Part
Include
Split File.Check Block.1=<chk>
Split File.Check Block.2=<chk>
Split File.Check Block.3=<chk>
...
Split File.Block.1=<chk>
...
Include.Compressed=gzip
End"

Client Put would similarly be adapted:

Client Put
URI=CHK@
Identifier=Insert#1
Data Length=4589
Metadata Length=32
Data Compressed=true
Metadata Compressed=false
Data
[ metadata ]
[ data ]

Data Compressed and Metadata Compressed are simply flags. The data is not
compressed by the node. The reason for this is that if it did, the
client would not know whether or not the data will fit into the 32kB
limit. So we get the client to do this.

Pending
Identifier=Insert#1
Timeout=32000
URI=freenet:CHK at blahblah...
[Public Key=...]
[Private Key=...]
End Message

This means that the insert has been routed to the end of the chain and
we have the Insert Reply.

Success
Identifier=Insert#1
URI=CHK at blahblah...
[Public Key=...]
[Private Key=...]
End Message

Meaning that we transferred the data successfully and got the Store Data.

One more important message:

Queue Info
Queue Length=43
Queued This Client=13
Window=13
Round Trip Time=6789
In Flight=12
Recommended Interval=14000
End Message

This means:
- This client has 13 requests currently queued
- There are a total of 43 requests currently queued (FCP is intended for
use locally and is not designed with security in mind)
- It takes 6789ms on average to complete a request
- We aim for 13 in flight requests
- We currently have 12 in flight requests
- We recommend this client sends requests no more often than once every
14 seconds. (this field may be removed or replaced; like the rest of
the spec, it's not finalized)

See my mail with regards to the new rate limiting replacement for
details on what this means. This message is from time to time sent by
the node to the clients.

[11]
Here's a possible further simplification, perhaps for 0.7:

Use a _block_ size of 1kB !

Abolish the transfer code completely, and route 1kB packets.
Don't include either the estimators or the full node reference on the
Store Data; simply include a hash of both, and if you want to connect,
fetch it down the request chain (it should be kept around for a few
minutes; there are some obvious hacks that we can do to make this not
use much RAM).

Then we can simplify the estimators even more:
- No transfer failed
- No verify failed even - CHKs are simply a 20 byte hash and a 1024 byte
data block. SSKs are more complex of course and contain a smaller
amount of payload, say 512 bytes, for an identical overall size.
- Our estimators are now:
-- Running average for probability and time for a Queued Query Timeout
-- Running average for probability and time for a Looped Query Rejected
-- Estimators for probability and time for Data Not Found
-- Estimator for success time
-- Estimator (or running average?) for probability and time for RNFQueryRejected

Okay, is this a case of slitting one's own throat with occam's razor?
(einstein's phrase). The main problem would probably be that we wouldn't
be able to route fast enough, right? Should I simulate that?

This was inspired by Thelema and by Nathan Johnson.

The hosts involved would know that they were handling splitfiles; this
is the main disadvantage. But unless they already knew the splitfile, in
which case they could have done correlation attacks, they won't be able
to tell much more about it. They can of course do more sophisticated
correlation attacks based on key closeness; nothing can prevent this.
One difficulty would be exactly how to implement a single HTL for a
whole bunch of requests.

Proposal:

A starts a splitfile download. The file has 2,600 blocks. It sends some
of these, eventually 260 of them, to node A. But first, it sends a
Bulk Setup message:

Bulk Setup
ID=34567890a (random message ID, does not identify a full chain but is
             regenerated at each hop)
HTL=20

Then it sends the actual requests, which have an extra field, Bulk ID,
which is set to the above ID.

A sees the Bulk Setup, and sets up a bulk request structure for that ID,
including the HTL.

Those requests A cannot answer from the store, it routes. It sends a
different Bulk Setup ID to each node it routes requests to, possibly with
a separately processed HTL i.e. it may decrement it for some nodes but
not for others; that is a point that needs to be debated. And it routes
the requests. It won't send a Bulk Setup until it has a request to send,
but if we get a request with a Bulk ID we haven't seen, we send a
No Bulk Setup message, indicating the requestor needs to retransmit the
Bulk Setup, and also the request, whose chain ID will be specified.

Implications are discussed below; basically, it's a tradeoff:
PRO: Correlation attacks from nodes who know the splitfile are a lot
    more difficult.
CON: Any node that gets a lot of blocks knows that it is being asked to
    return a splitfile, even if it doesn't know which splitfile.

One possibility to mitigate the CON would be to automatically aggregate
requests into bulk-blocks, even if they are unrelated... This needs
further thought.


Simply keep the splitfile together when you request it. A node receives
a batch of requests for the splitfile, it knows they are all part of one
metarequest, and they have a single HTL. Have a 66% chance of not
decrementing the group HTL for HTL=20 on each hop, so on average it goes
3 hops before reducing from 20. Possibly make there a significant (33%?)
chance of not decrementing for 19, 18, 17, and 16 as well, in order to
reduce the harmful interactions between perturbing HTL and the maximum.

There was a different solution to this problem a bit later on, which was
implemented. Essentially we have to make the decision of whether to
decrement the HTL once for every link, rather than on each request. This
reduces the information given away by HTL a lot.


[12]
All CHKs, and other keys, are padded. We can't use 0's, for
cryptographic safety reasons, and we can't use random data, because we
don't want different hashes for the same content.

Currently the algorithm is something like:
- Create a hash
- Feed all the actual data into the hash
- Write the current hash into a buffer
- Feed the buffer into the hash
- Write the buffer to the data stream, chopping bytes off if needed.

This is annoying because it is the ONLY piece of code in the whole of
freenet that makes us require our own, java, slow, SHA-1 implementation.
Can we get rid of it? I am told that SHA-1 has zero setup, hence all we
need to do is:

X = H(data)
X = H(X)
Write X, breaking if we reach EOF, chopping off if needed
Repeat forever (until write breaks out of the loop)

Since SHA-1 has no setup, I don't see why this would be any less safe,
or much slower. Anyone disagree? Any old info on why we chose this
particular padding algorithm? Was there some good reason for it or was
it just convenient to code it that way?

New CHK spec:

1. CHK URIs

CHK@<routing key>,<decryption key>,<extra>[/<human readable filename>]

<routing key> is the same as it is now.
<decryption key> is 256 bits (currently 128).
<extra> is:
16 bits - cipher ID. 0 means AES/256 (with CBC).
1 bit - is the data compressed?
1 bit - is the document a control document.
All of the above are of course base64 encoded (so extra is exactly 3
characters after encoding).


2. CHK content:

The routing key is the hash of the document header and the encrypted
data. These are part of the same encryption cycle - we use the IV to
initialize the cipher, simply by decrypting it, and then we decode the
data length, then we decode the data.
The document header is:
IV (encrypted) - 32 bytes
Data length (encrypted) - 2 bytes

The document header would normally be exchanged with the Data Reply, on
the message, rather than being folded into the 32kB, so we can have a
nice clean PO2 block size, which is probably a good thing.

Now, we can achieve the functionality of the old storables using only
the previous system:
- Part-size is unnecessary (no parts, whole file hash)
- Initial-digest is unnecessary (no Storables)
- Symmetric-cipher is unnecessary (its in the URL)
- Document-header is unnecessary (fixed size document header)
-- The document header portion would normally include the hash of the
decryption key, the data length and the metadata length. But it is a
flexible structure (unnecessarily so IMHO... it is a bad thing to let
attackers add arbitrary numbers of bytes to keys).
- The data length we have
- The metadata length we don't need because we don't have metadata
- The IV is the hash of the encryption key. We can simply check this.

Note that the IV is encrypted, so that we still have some security
should reversing SHA-256 become feasible. Of course we lose this for
SSKs.


3. Details and padding

Original data length: N bytes. N <= 32768.

If N = 32768, we don't need to pad. If N < 32768, we need to pad it:

Let H1 be the hash of the original data.
Then we set up a Mersenne Twister, seeded from H1, and use it to
generate random bytes to the end of the 32kB.

Having padded if necessary, we take H2, the SHA-256 hash of the whole
plaintext including padding. H2 becomes our encryption key. The IV is
H3, the SHA-256 hash of H2. We then encrypt the IV, the data size and
the plaintext data including padding, in that order, in CBC mode.

Nodes passing the file on or storing it simply need to verify that the
routing key matches the overall hash of all 32+2+32,768 bytes. The
decoding node can check that the IV is in fact (when decrypted) the
hash of the decryption key. If it is not, then the key was invalid when
it was first inserted. We have a similar mechanism now, and providing
such an encrypted hash is probably not a problem as an attacker who can
break AES can probably easily determine what content is interesting even
if we did not include such a checksum.


A. Note on control documents:

We no longer support metadata on data keys. A document is either a data
key or a "control document" i.e. a metadata key. This is slightly less
flexible but produces more collisions and simpler code especially for
client writers.

There is a corner case for < 32kB files that are accessed directly as
CHKs, via fproxy, and need a MIME type. Either:
a) You will need to insert a separate control document to specify the
MIME type, OR
b) You will need to force the mime type via the already implemented
?mime= parameter in fproxy.
Hence
CHK@<routing key>,<decrypt key>,<extra>/<filename>?mime=text/plain
This is already implemented and will remain, because it is used by the
filter when it encounters unknown MIME types.

Given the above two options, I don't think this is a serious drawback.


B. Ciphers

Here I use AES with CBC to indicate the current Rijndael algorithm with
a doubled key size for safety. It may be CBC-CTS, or it may be some
other form; I will use what we use now. I do know that it is some form
of CBC.

For padding purposes, we pretend there is a 0x00 at the beginning of the
key when calculating the hash from which to generate the random padding.
This is so that we can support the 0-length CHK.


[13]
For a Data Request that goes smoothly:

Requestor: (A)
We send the Data Request.
We wait for an Accepted, Loop Query Rejected, Data Reply, Data Not Found or
Late Query Rejected (or whatever we call the RNF message).

Middle-node: (B)
We receive the Data Request.
We send the Accepted to A.
We send the Data Request on to node C.
We wait for another Data Request from A, indicating that we need to resend
the Accepted, or from C, an Accepted, Loop Query Rejected, Data Reply,
Data Not Found, or Late Query Rejected.

Requestor(A):
We receive the Accepted.
We wait for Data Reply, Data Not Found, or Late Query Rejected.

Middle-node: (B)
We receive an Accepted from C.
We wait for a Data Request from A, or a Data Reply, Data Not Found, or
Late Query Rejected, from C.
We get a Data Reply.
We send the Data Reply to A.
We send the Data Reply Ack to C.
We start receiving the data on a separate thread.
We wait for a Data Request from A, a Data Reply Ack from A, a resent
Data Reply from C, or a completion notification from the data reception


thread (which may be success or failure).

Requestor (A):
We receive the Data Reply.
We send a Data Reply Ack to B.
We start receiving the data on a separate thread.
We wait for a resent Data Reply from B, or a completion notification from
the data reception thread.

Middle-node (B):
We receive the Data Reply Ack from A.
We start sending the data on another thread.
We now wait for a resent Data Reply from C, or a completion notification
from the data reception thread (which may be success or failure), or a
completion notification from the data sending thread (which may be
success or failure).
...
We get a completion notification from the data reception thread.
We now wait for a completion notification from the data sending thread.
...
We get a completion notification from the data sending thread.
We are finished (well more or less).


Now, is it just me, or is the above actually quite complex, given that
it has to deal with all the possibilities for packet loss and so on?
Maybe I'm just feeling thick at the moment, and it's all perfectly
obvious.

Here's the alternative:
We implement reliable out of order message delivery under the FNP level.
At the end of each message, we include:
- A list of IDs of messages we have sent but not yet received
acknowledgements for.
- A list of IDs of messages we have received and are acknowledging.
If we receive a message, we wait say 100ms, and if we don't send any
other messages to the peer in that time, we send a message just for the
acknowledgements.
If we send a message, we wait say 200ms, and if we don't receive an
acknowledgement in that time, we resend it. Then we repeat. If we still
haven't got an acknowledgement, we silently drop it. So it's not a 100%
reliable transport, but it should be near enough; the rest will be dealt
with by timeouts and NGR.


We provide a send-a-message function to the FNP level which sends the
message but does not wait for an acknowledgement.

Now, that makes the above:

Requestor (A):
We send the Data Request.
We wait for Accepted, Loop Query Rejected, Data Reply, Data Not Found, or
Late Query Rejected.

Middle-node (B):
We receive the Data Request.
We send an Accepted to A.
We send the Data Request to C.
We wait for Accepted, Loop Query Rejected, Data Reply, Data Not Found, or
Late Query Rejected, from C.

Requestor (A):
We receive the Accepted.
We wait for Data Reply, Data Not Found, or Late Query Rejected.

Middle-node (B):
We get an Accepted from C.
We wait for a Data Reply, Data Not Found, or Late Query Rejected, from C.
...
We get a Data Reply.
We send the Data Reply to A.
We start to receive the data from C.
We start to send the data to A.
We wait for the receive to finish or the send to finish.
...
We get the receive-finished internal message.
We wait for the send to finish.
...
We get the send-finished internal message.
We're done (more or less).


What are the main differences here?
- We don't need to worry about losing the Accepted, losing the
Data Request, losing the Data Reply, or losing the acks.
- We don't need to send or worry about Data Reply Ack, we don't need to
resend Data Reply or worry about it being resent, we don't need to
resend Data Request or worry about it being resent - any Data Request we
receive after we have started is a loop, for example.

Is this a significant benefit? IMHO the situation without this change is
that the code will be complex and might actually be simpler as a state
machine. The cost is this layer itself, and the fact that we would need
to rewrite the transfer code.

Or is it easier just to do it at the FNP level?

[14]
Queueing

IMHO we will still need message queues.

Sending the packet blocking to the OS, as Ian points out, is cheap if
the link is not saturated. However:
a) Sometimes it is.
b) More seriously: We would have to implement bandwidth limiting by
blocking the UDP send. The result of this is that threads would have to
wait for a sleep to complete, to slow down the message sending rate to
the desired bandwidth.

Therefore, Dijjer or not, we need queueing.

So we need message queues for each peer. This is actually quite helpful,
as it makes message aggregation considerably easier.

We can however still use blocking-style code, using continuations. We
might want to try to send the packet immediately and wait for it, if
there is no bandwidth limiting reason to wait. But normally, we would
suspend the continuation and resume it when the message has been sent.

There is then the question of whether to wait for each packet to be sent
before sending the next one in various instances. For the state machine,
I think it's clear that we want strictly blocking sends. However, I
think it might be sensible to queue several packets at once - perhaps all
the packets at once - for a file transfer. Especially if the data packets
are small enough that more than one might fit into a UDP packet.

What happens if we don't specify the encoding when we deliver it to the browser?
Is that out of HTTP spec? If so, we can still do it. Since UTF-8 & -16 are
supported by mandate, and are the only defaults of XML, they would make a good
choice of default for Freenet. XHTML and HTML do not need http-equiv to specify
charsets.

Below this line are excepts from the XML 1.1 spec:

All XML processors MUST accept the UTF-8 and UTF-16 encodings of Unicode [Unicode]

In the absence of external character encoding information (such as MIME
headers), parsed entities which are stored in an encoding other than UTF-8 or
UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration)
containing an encoding declaration:

Unless an encoding is determined by a higher-level protocol, it is also a fatal
error if an XML entity contains no encoding declaration and its content is not
legal UTF-8 or UTF-16.

Examples of text declarations containing encoding declarations:
<?xml encoding='UTF-8'?>