-*- outline -*-

Reasonably independent newpb sub-tasks that need doing. Most important come
first.

* decide on a version negotiation scheme

Should be able to telnet into a PB server and find out that it is a PB
server. Pointing a PB client at an HTTP server (or an HTTP client at a PB
server) should result in an error, not a timeout. Implement in
banana.Banana.connectionMade().

desiderata:

 negotiation should take place with regular banana sequences: don't invent a
 new protocol that is only used at the start of the connection

 Banana should be useable one-way, for storage or high-latency RPC (the mnet
 folks want to create a method call, serialize it to a string, then encrypt
 and forward it on to other nodes, sometimes storing it in relays along the
 way if a node is offline for a few days). It should be easy for the layer
 above Banana to feed it the results of what its negotiation would have been
 (if it had actually used an interactive connection to its peer). Feeding the
 same results to both sides should have them proceed as if they'd agreed to
 those results.

 negotiation should be flexible enough to be extended but still allow old
 code to talk with new code. Magically predict every conceivable extension
 and provide for it from the very first release :).

There are many levels to banana, all of which could be useful targets of
negotiation:

 which basic tokens are in use? Is there a BOOLEAN token? a NONE token? Can
 it accept a LONGINT token or is the target limited to 32-bit integers?

 are there any variations in the basic Banana protocol being used? Could the
 smaller-scope OPEN-counter decision be deferred until after the first
 release and handled later with a compatibility negotiation flag?

 What "base" OPEN sequences are known? 'unicode'? 'boolean'? 'dict'? This is
 an overlap between expressing the capabilities of the host language, the
 Banana implementation, and the needs of the application. How about
 'instance', probably only used for StorageBanana?

 What "top-level" OPEN sequences are known? PB stuff (like 'call', and
 'your-reference')? Are there any variations or versions that need to be
 known? We may add new functionality in the future, it might be useful for
 one end to know whether this functionality is available or not. (the PB
 'call' sequence could some day take numeric argument names to convey
 positional parameters, a 'reference' sequence could take a string to
 indicate globally-visible PB URLs, it could become possible to pass
 target.remote_foo directly to a peer and have a callable RemoteMethod object
 pop out the other side).

 What "application-level" sequences are available? (Which RemoteInterface
 classes are known and valid in 'call' sequences? Which RemoteCopy names are
 valid for targets of the 'copy' sequence?). This is not necessarily within
 the realm of Banana negotiation, but applications may need to negotiate this
 sort of thing, and any disagreements will be manifested when Banana starts
 raising Violations, so it may be useful to include it in the Banana-level
 negotiation.

On the other hand, negotiation is only useful if one side is prepared to
accomodate a peer which cannot do some of the things it would prefer to use,
or if it wants to know about the incapabilities so it can report a useful
failure rather than have an obscure protocol-level error message pop up an
hour later. So negotiation isn't the only goal: simple capability awareness
is a useful lesser goal.

It kind of makes sense for the first object of a stream to be a negotiation
blob. We could make a new 'version' opentype, and declare that the contents
will be something simple and forever-after-parseable (like a dict, with heavy
constraints on the keys and values, all strings emitted in full).

DONE, at least the framework is in place. Uses HTTP-style header-block
exchange instead of banana sequences, with client-sends-first and
server-decides. This correctly handles PB-vs-HTTP, but requires a timeout to
detect oldpb clients vs newpb servers. No actual feature negotiation is
performed yet, because we still only have the one version of the code.

* connection initiation

** define PB URLs

[newcred is the most important part of this, the URL stuff can wait]

A URL defines an endpoint: a pb.Referenceable, with methods. Somewhere along
the way it defines a transport (tcp+host+port, or unix+path) and an object
reference (pathname). It might also define a RemoteInterface, or that might
be put off until we actually invoke a method.

 URL = f("pb:", host, port, pathname)
 d = pb.callRemoteURL(URL, ifacename, methodname, args)

probably give an actual RemoteInterface instead of just its name

a pb.RemoteReference claims to provide access to zero-or-more
RemoteInterfaces. You may choose which one you want to use when invoking
callRemote.

TODO: decide upon a syntax for URLs that refer to non-TCP transports
 pb+foo://stuff, pby://stuff (for yURL-style self-authenticating names)

TODO: write the URL parser, implementing pb.getRemoteURL and pb.callRemoteURL
 DONE: use a Tub/PBService instead

TODO: decide upon a calling convention for callRemote when specifying which
RemoteInterface is being used.


DONE, PB-URL is the way to go.
** more URLs

relative URLs (those without a host part) refer to objects on the same
Broker. Absolute URLs (those with a host part) refer to objects on other
Brokers.

SKIP, interesting but not really useful

** build/port pb.login: newcred for newpb

Leave cred work for Glyph.

<thomasvs> has some enhanced PB cred stuff (challenge/response, pb.Copyable
credentials, etc).

URL = pb.parseURL("pb://lothar.com:8789/users/warner/services/petmail",
                  IAuthorization)
URL = doFullLogin(URL, "warner", "x8yzzy")
URL.callRemote(methodname, args)

NOTDONE

* constrain ReferenceUnslicer properly

The schema can use a ReferenceConstraint to indicate that the object must be
a RemoteReference, and can also require that the remote object be capable of
handling a particular Interface.

This needs to be implemented. slicer.ReferenceUnslicer must somehow actually
ask the constraint about the incoming tokens.

An outstanding question is "what counts". The general idea is that
RemoteReferences come over the wire as a connection-scoped ID number and an
optional list of Interface names (strings and version numbers). In this case
it is the far end which asserts that its object can implement any given
Interface, and the receiving end just checks to see if the schema-imposed
required Interface is in the list.

This becomes more interesting when applied to local objects, or if a
constraint is created which asserts that its object is *something* (maybe a
RemoteReference, maybe a RemoteCopy) which implements a given Interface. In
this case, the incoming object could be an actual instance, but the class
name must be looked up in the unjellyableRegistry (and the class located, and
the __implements__ list consulted) before any of the object's tokens are
accepted.

* decide upon what the "Shared" constraint should mean

The idea of this one was to avoid some vulnerabilities by rejecting arbitrary
object graphs. Fundamentally Banana can represent most anything (just like
pickle), including objects that refer to each other in exciting loops and
whorls. There are two problems with this: it is hard to enforce a schema that
allows cycles in the object graph (indeed it is tricky to even describe one),
and the shared references could be used to temporarily violate a schema.

I think these might be fixable (the sample case is where one tuple is
referenced in two different places, each with a different constraint, but the
tuple is incomplete until some higher-level node in the graph has become
referenceable, so [maybe] the schema can't be enforced until somewhat after
the object has actually finished arriving).

However, Banana is aimed at two different use-cases. One is kind of a
replacement for pickle, where the goal is to allow arbitrary object graphs to
be serialized but have more control over the process (in particular we still
have an unjellyableRegistry to prevent arbitrary constructors from being
executed during deserialization). In this mode, a larger set of Unslicers are
available (for modules, bound methods, etc), and schemas may still be useful
but are not enforced by default.

PB will use the other mode, where the set of conveyable objects is much
smaller, and security is the primary goal (including putting limits on
resource consumption). Schemas are enforced by default, and all constraints
default to sensible size limits (strings to 1k, lists to [currently] 30
items). Because complex object graphs are not commonly transported across
process boundaries, the default is to not allow any Copyable object to be
referenced multiple times in the same serialization stream. The default is to
reject both cycles and shared references in the object graph, allowing only
strict trees, making life easier (and safer) for the remote methods which are
being given this object tree.

The "Shared" constraint is intended as a way to turn off this default
strictness and allow the object to be referenced multiple times. The
outstanding question is what this should really mean: must it be marked as
such on all places where it could be referenced, what is the scope of the
multiple-reference region (per- method-call, per-connection?), and finally
what should be done when the limit is violated. Currently Unslicers see an
Error object which they can respond to any way they please: the default
containers abandon the rest of their contents and hand an Error to their
parent, the MethodCallUnslicer returns an exception to the caller, etc. With
shared references, the first recipient sees a valid object, while the second
and later recipient sees an error.


* figure out Deferred errors for immutable containers

Somewhat related to the previous one. The now-classic example of an immutable
container which cannot be created right away is the object created by this
sequence:

        t = ([],)
        t[0].append((t,))

This serializes into (with implicit reference numbers on the left):

[0] OPEN(tuple)
[1]  OPEN(list)
[2]   OPEN(tuple)
[3]    OPEN(reference #0)
      CLOSE
     CLOSE
    CLOSE

In newbanana, the second TupleUnslicer cannot return a fully-formed tuple to
its parent (the ListUnslicer), because that tuple cannot be created until the
contents are all referenceable, and that cannot happen until the first
TupleUnslicer has completed. So the second TupleUnslicer returns a Deferred
instead of a tuple, and the ListUnslicer adds a callback which updates the
list's item when the tuple is complete.

The problem here is that of error handling. In general, if an exception is
raised (perhaps a protocol error, perhaps a schema violation) while an
Unslicer is active, that Unslicer is abandoned (all its remaining tokens are
discarded) and the parent gets an Error object. (the parent may give up too..
the basic Unslicers all behave this way, so any exception will cause
everything up to the RootUnslicer to go boom, and the RootUnslicer has the
option of dropping the connection altogether). When the error is noticed, the
Unslicer stack is queried to figure out what path was taken from the root of
the object graph to the site that had an error. This is really useful when
trying to figure out which exact object cause a SchemaViolation: rather than
being told a call trace or a description of the *object* which had a problem,
you get a description of the path to that object (the same series of
dereferences you'd use to print the object: obj.children[12].peer.foo.bar).

When references are allowed, these exceptions could occur after the original
object has been received, when that Deferred fires. There are two problems:
one is that the error path is now misleading, the other is that it might not
have been possible to enforce a schema because the object was incomplete.

The most important thing is to make sure that an exception that occurs while
the Deferred is being fired is caught properly and flunks the object just as
if the problem were caught synchronously. This may involve discarding an
otherwise complete object graph and blaming the problem on a node much closer
to the root than the one which really caused the failure.

* adaptive VOCAB compression

We want to let banana figure out a good set of strings to compress on its
own. In Banana.sendToken, keep a list of the last N strings that had to be
sent in full (i.e. they weren't in the table). If the string being sent
appears more than M times in that table, before we send the token, emit an
ADDVOCAB sequence, add a vocab entry for it, then send a numeric VOCAB token
instead of the string.

Make sure the vocab mapping is not used until the ADDVOCAB sequence has been
queued. Sending it inline should take care of this, but if for some reason we
need to push it on the top-level object queue, we need to make sure the vocab
table is not updated until it gets serialized. Queuing a VocabUpdate object,
which updates the table when it gets serialized, would take care of this. The
advantage of doing it inline is that later strings in the same object graph
would benefit from the mapping. The disadvantage is that the receiving
Unslicers must be prepared to deal with ADDVOCAB sequences at any time (so
really they have to be stripped out). This disadvantage goes away if ADDVOCAB
is a token instead of a sequence.

Reasonable starting values for N and M might be 30 and 3.

* write oldbanana compatibility code?

An oldbanana peer can be detected because the server side sends its dialect
list from connectionMade, and oldbanana lists are sent with OLDLIST tokens
(the explicit-length kind).


* add .describe methods to all Slicers

This involves setting an attribute between each yield call, to indicate what
part is about to be serialized.


* serialize remotely-callable methods?

It might be useful be able to do something like:

 class Watcher(pb.Referenceable):
     def remote_foo(self, args): blah

 w = Watcher()
 ref.callRemote("subscribe", w.remote_foo)

That would involve looking up the method and its parent object, reversing
the remote_*->* transformation, then sending a sequence which contained both
the object's RemoteReference and the appropriate method name.

It might also be useful to generalize this: passing a lambda expression to
the remote end could stash the callable in a local table and send a Callable
Reference to the other side. I can smell a good general-purpose object
classification framework here, but I haven't quite been able to nail it down
exactly.

* testing

** finish testing of LONGINT/LONGNEG

test_banana.InboundByteStream.testConstrainedInt needs implementation

** thoroughly test failure-handling at all points of in/out serialization

places where BananaError or Violation might be raised

sending side:
 Slicer creation (schema pre-validation? no): no no
  pre-validation is done before sending the object, Broker.callFinished,
  RemoteReference.doCall
  slicer creation is done in newSlicerFor

 .slice (called in pushSlicer) ?
 .slice.next raising Violation
 .slice.next returning Deferrable when streaming isn't allowed
 .sendToken (non-primitive token, can't happen)
 .newSlicerFor (no ISlicer adapter)
 top.childAborted

receiving side:
 long header (>64 bytes)
 checkToken (top.openerCheckToken)
 checkToken (top.checkToken)
 typebyte == LIST (oldbanana)
 bad VOCAB key
 too-long vocab key
 bad FLOAT encoding
 top.receiveClose
 top.finish
 top.reportViolation
 oldtop.finish (in from handleViolation)
 top.doOpen
 top.start
plus all of these when discardCount != 0
OPENOPEN

send-side uses:
 f = top.reportViolation(f)
receive-side should use it too (instead of f.raiseException)

** test failure-handing during callRemote argument serialization

** implement/test some streaming Slicers

** test producer Banana

* profiling/optimization

Several areas where I suspect performance issues but am unwilling to fix
them before having proof that there is a problem:

** Banana.produce

This is the main loop which creates outbound tokens. It is called once at
connectionMade() (after version negotiation) and thereafter is fired as the
result of a Deferred whose callback is triggered by a new item being pushed
on the output queue. It runs until the output queue is empty, or the
production process is paused (by a consumer who is full), or streaming is
enabled and one of the Slicers wants to pause.

Each pass through the loop either pushes a single token into the transport,
resulting in a number of short writes. We can do better than this by telling
the transport to buffer the individual writes and calling a flush() method
when we leave the loop. I think Itamar's new cprotocol work provides this
sort of hook, but it would be nice if there were a generalized Transport
interface so that Protocols could promise their transports that they will
use flush() when they've stopped writing for a little while.

Also, I want to be able to move produce() into C code. This means defining a
CSlicer in addition to the cprotocol stuff before. The goal is to be able to
slice a large tree of basic objects (lists, tuples, dicts, strings) without
surfacing into Python code at all, only coming "up for air" when we hit an
object type that we don't recognize as having a CSlicer available.

** Banana.handleData

The receive-tokenization process wants to be moved into C code. It's
definitely on the critical path, but it's ugly because it has to keep
calling into python code to handle each extracted token. Maybe there is a
way to have fast C code peek through the incoming buffers for token
boundaries, then give a list of offsets and lengths to the python code. The
b128 conversion should also happen in C. The data shouldn't be pulled out of
the input buffer until we've decided to accept it (i.e. the
memory-consumption guarantees that the schemas provide do not take any
transport-level buffering into account, and doing cprotocol tokenization
would represent memory that an attacker can make us spend without triggering
a schema violation). Itamar's CLineReceiver is a good example: you tokenize
a big buffer as much as you can, pass the tokens upstairs to Python code,
then hand the leftover tail to the next read() call. The tokenizer always
works on the concatenation of two buffers: the tail of the previous read()
and the complete contents of the current one.

** Unslicer.doOpen delegation

Unslicers form a stack, and each Unslicer gets to exert control over the way
that its descendents are deserialized. Most don't bother, they just delegate
the control methods up to the RootUnslicer. For example, doOpen() takes an
opentype and may return a new Unslicer to handle the new OPEN sequence. Most
of the time, each Unslicer delegates doOpen() to their parent, all the way
up the stack to the RootUnslicer who actually performs the UnslicerRegistry
lookup.

This provides an optimization point. In general, the Unslicer knows ahead of
time whether it cares to be involved in these methods or not (i.e. whether
it wants to pay attention to its children/descendants or not). So instead of
delegating all the time, we could just have a separate Opener stack.
Unslicers that care would be pushed on the Opener stack at the same time
they are pushed on the regular unslicer stack, likewise removed. The
doOpen() method would only be invoked on the top-most Opener, removing a lot
of method calls. (I think the math is something like turning
avg(treedepth)*avg(nodes) into avg(nodes)).

There are some other methods that are delegated in this way. open() is
related to doOpen(). setObject()/getObject() keep track of references to
shared objects and are typically only intercepted by a second-level object
which defines a "serialization scope" (like a single remote method call), as
well as connection-wide references (like pb.Referenceables) tracked by the
PBRootUnslicer. These would also be targets for optimization.

The fundamental reason for this optimization is that most Unslicers don't
care about these methods. There are far more uses of doOpen() (one per
object node) then there are changes to the desired behavior of doOpen().

** CUnslicer

Like CSlicer, the unslicing process wants to be able to be implemented (for
built-in objects) entirely in C. This means a CUnslicer "object" (a struct
full of function pointers), a table accessible from C that maps opentypes to
both CUnslicers and regular python-based Unslicers, and a CProtocol
tokenization code fed by a CTransport. It should be possible for the
python->C transition to occur in the reactor when it calls ctransport.doRead
python->and then not come back up to Python until Banana.receivedObject(),
at least for built-in types like dicts and strings.
