Version 0.0.3

This is the first version which is fully usable to generate complete
input for Sherpa and Pythia. It should therefore be close to a public
release now, barring of course some optimisation improvements and
bugfixes that might become necessary towards working to the release.

NEW FEATURES

- Add support for using host-only Chili when otherwise running on device
- Add beam energy setting
- Make HDF5 output compatible with the Sherpa reader
- Full PDF and beam information in HepMC3 output
- Cross section information in HDF5 output (in the "/generatedResult"
  field)
- Add support for LHEF event output
- Add leading colour configuration information to event output
- Add VEGAS-optimised t-channel (plus one s-channel) integrator
  "Chili(mild)" with CUDA support
- Add support for CUDA LHAPDF

MINOR IMPROVEMENTS

- Do MPI sum before updating helicity selection weights
- Enable Chili max eta cuts for jets
- Only stop optimisation/integration when all processes have been
  sampled at least once
- Add environment variable PEPPER_DEVICE_ID for setting the CUDA device
  used
- Scale min. number of nonzero events per optimisation/integration step
  with the number of helicity configuration and divide by the number of
  MPI ranks for easier usage
- Make CMake configuration output a bit more verbose
- Support nested internal timing diagnostics
- Add H_Tp^2/2 and H_T^2/2 scale definitions
- Improve output when zero events are requested

ADDED DOCUMENTATION

- Manual guide on reusing cached results
- Manual tutorial "Getting started"
- Manual reference on Runcard options

PERFORMANCE IMPROVEMENTS

- Add CPU vector instructions for some vertices
- Improve performance of resetting particle information in the recursion
- Remove unnecessary D->H copies of particle information
- Remove redundant helicity selection weight updates
- Make H->D copy of random numbers for FORCE_HOST_RNG=1 faster
- Evaluate Z/photon currents simultaneously
- Improve performance of the momentum storage handling in the Chili
  interface
- Use minimal storage for matrix elements
- Cache the non-zero helicity configuration, which speeds up
  initialisation of subsequent runs in particular on the device

BUGFIXES

- Fix read-in of selection weights
- Fix too low output precision of cached results
- Fix bug when returning a zero standard deviation/variance if the
  number of trials is one
- Fix non positive definite standard deviation (and hence selection
  weights)
- Use correct scale information in HDF5 output
- Improve heuristics to set beams in HDF5 output
- Increase FORM max. term number to fix colour factor generation bug
- Fix moving FORM-generated files to a across filesystem
- Fix crash in HDF5 output for weighted events
- Fill dummy cross sections for auxiliary weights to suppress new HepMC3
  warnings
- Fix integer overflow bug when MPI summing MC event counters for many
  ranks
- Remove dummy-event zero counting in HDF5 output, which broke the
  Sherpa readin
- Fix crash in HDF5 output when all events are zero
- Correct reported LHEH5 version string from 2.0.0 to 2.0.1
- Fix GPU device selection for MPI use
- Fix accidental correlation in the random flavour channel selection
- Fill correct scales and couplings in various output formats