
- the top-level multiplication routine is fairly dumb, in that the
  algorithm for handling unbalanced multiplications is sub-optimal. One
  should consider implementing several unbalanced toom alternatives, and
  use them as building blocks for the unbalanced multiplications.

- there is no addmul_2 routine, and when sse-2 registers are available,
  there could be a clear win in using it. (or addmul_4 on 32-bit
  machines). It's even quite easy.
