Version 1.23: 6/3/07
	- Linear algebra changes:
		- Made singleton removal alternate with deleting cliques 
		  from the input matrix. This seems to be required when
		  the initial matrix is large and very sparse, otherwise 
		  only trivial dependencies are found
		- Made the iteration skip verifying that all columns are
		  used in the last iteration. I think that this causes
		  unnecessary restarts of the linear algebra
		- Allowed for combine_cols to find zero dependencies
	- NFS filtering changes:
		- Allowed the singleton removal to grow the size of 
		  hashtables used during disk-based passes
		- Modified the filtering driver to more intelligently tune 
		  the bound on large ideals when there is a very large 
		  amount of excess, and/or when sieving uses 29+ bit 
		  large primes 
		- Changed the singleton removal cutoffs to avoid doing the
		  last (pretty useless) disk-based pass most of the time
		- Changed the singleton removal to delete the file from the
		  duplicate removal phase when finished with it
	- Fixed a bug reading NFS relations (thanks Greg Childers)

Version 1.22: 5/27/07
	- Linear algebra changes:
		- Added simple multithreading
		- Modified the dense matrix multiply to use blocks of 64
		  rows instead of 32. This is slightly faster, and allows
		  reusing the general vector-vector code from elsewhere
		- Modified the matrix packing code to pack the matrix in
		  two passes instead of one; this saves an enormous number
		  of reallocations and significantly reduces the working
		  set size of the linear algebra
		- Modified the driver to correctly choose when to pack
		  the matrix, and to correctly pack dense rows when no 
		  post-Lanczos matrix is desired
		- Fixed a memory leak freeing the packed matrix
		- Cleaned up combine_cols
		- Added more fixes for OS X Intel (thanks Romain Muguet)
	- Forced the buffer size for ideals in a relation_lp_t to match 
		that of the buffer in a relation_t. This fixes a fairly
		common NFS square root failure (thanks Hallstein Hansen)
	- Fixed a bug in the factor base generator change from v1.21
		(thanks Tom Womack and Jes Hansen)
	- Made the Pentium 2 and 3 QS sieve cores have a 64kB block size
	- Modified the expression evaluator to accept integers in octal or
		hex format. Dear reverse engineers: it isn't that hard
	- Added more sanity checking to NFS matrix construction
	- Added more QS tuning from Bill Hart
	- Fixed a typo in the tiny QS code (thanks Volturno)

Version 1.21: 5/11/07
	- NFS filtering changes:
		- Changed the merge phase to count the average weight
		  of cycles that would appear in the matrix, not the average
		  weight of all cycles found. This allows for a somewhat
		  smaller matrix when there is a large amount of excess
		- Changed the filtering driver to iteratively rerun the
		  singleton removal with gradually smaller bounds. This is
		  much better at dynamically figuring out a bound for large 
		  ideals that is small and can produce a sensible matrix
		- Added back the code that buries ideals that are too
		  heavy. The current merging architecture makes it possible
		  to apply burying of ideals more intelligently than my
		  initial experiments, and burying ideals saves a lot of
		  memory as merging progresses
		- Modified the singleton removal to use a larger hashtable
		  for the initial disk-based pruning passes, and also to
		  rebuild the hashtable after a pruning pass if it was 
		  saturated going into the pruning pass
		- Increased the maximum clique batch size, to save time
		  when filtering datasets that have a lot of excess
		- Increased the number of excess columns in the final matrix;
		  maybe this will help prevent occaisional bad behavior
	- Merged Brian Gladman's MSVC versions of the inline assembly code,
		as well as his rearrangement of the unrolled Lanczos loops
	- Merged Brian Gladman's latest MSVC build files
	- Modified the NFS square root to handle inputs where there are no
		q for which (algebraic polynomial) mod q is irreducible
	- Changed the NFS linear algebra to assemble the matrix on disk
		and read in the columns after relations are freed. This 
		reduces the memory use of the linear algebra by about 30%
	- Fixed yet another bug in the NFS factor base generator, which
		would sometimes miss roots of polynomials that would be
		degree 1 after reduction by p (thanks to Hallstein Hansen)
	- Documented the QS improvements
	- Sharpened the detection of older and newer AMD processors
	- Allowed for use of AMD's MMX extensions; this allows a slight QS
		speedup when the full SSE instruction set is not available
	- Added some QS tuning from Bill Hart

Version 1.20: 5/1/07
	- Added support throughout the NFS module for relations with
		64-bit 'a' values. This makes the postprocessing a little
		slower, but finally allows msieve to complete any previously
		started GGNFS run
	- QS changes:
		- Modified the makefile to create a 'fat binary', the same
		  core sieving routine compiled over and over again with 
		  different optimizations and preprocessor directives. The 
		  QS driver now selects the optimal routine at runtime, 
		  and this makes several CPUs noticeably faster for small
		  and moderate size factorizations 
		- Used two different reciprocal values instead of one, 
		  so that remainder operations involving small factor 
		  base primes are faster now
		- Modified the sizing up of the factor base into distinct
		  regions; this gives ranges of factor base primes that
		  use reciprocals, and also allows a large range where
		  remainder operations are essentially free (for primes
		  smaller than the cutoff for using hashtables)
		- Increased the unrolling of the sieve scanning loop to 64
		- Used multimedia vector instructions to scan the sieve
		  interval, if they are available. This makes small and
		  moderate size factorizations up to 5% faster
		- When sizing MPQS polynomials, use the rounded up value
		  of the sieve interval instead of the original. Not doing
		  this means the sieve interval would be slightly lopsided,
		  making sieve values slightly larger on average
		- Modified generation of reciprocals so that the quotient
		  from integer division is exact, i.e. doesn't need correcting
		- Reduced the number of small primes that are not sieved
		  as the input size increases. Skipping small primes is less
		  important in these cases, and the extra relations found
		  more than make up for the extra time spent sieving
		- Removed obsolete code from MPQS polynomial generation
		- Reduced the limit on QS factor base primes to 2^26 (need
		  an extra bit in the factor base structure)
		- Cleaned up check_sieve_val
	- Completed documenting the NFS filtering
	- Modified the cache detection code to handle level 1 cache sizes too
	- Added code to detect the (x86) CPU type
	- Modified the NFS square root to handle negative leading rational
		polynomial coefficients; also added code to check the sign
		of the leading algebraic polynomial coefficient
	- Fixed a bug sorting relations during the NFS merge postprocessing
	- Fixed find_large_ideals to reduce b values modulo p before
		calling mp_modinv_1; the latter cannot handle b that
		exceed p (thanks to Greg Childers)
	- Fixed a typo parsing NFS options in the demo program
	- Increased the allowed relation set size during NFS merging; the
		limit cannot be too restrictive, or else many cycles will not
		be found (a big cycle can turn into a small cycle because
		of a fortuitous merge, but not if the cycle is pruned first)
	- Made the prefetching of long buffers conditionally compiled; most
		modern CPUs can prefetch these automatically so explicit 
		prefetch instructions are unnecessary
	- Modified the NFS square root to not stop unless the largest
		remaining composite is very small. This will protect all
		the NFS relations from being overwritten by the MPQS code
	- Removed the use of assembly code using cmov instructions when
		compiling for 64-bit x86
	- Changed inline assembly code to not explicitly clobber the EBX
		register; apparently OS X on Intel reserves this register
		for the PIC base address (thanks Romain Muguet)
	- Increased size of the buffer used in the demo to hold input numbers,
		to accomodate ridiculously huge SNFS inputs (thanks Tom Womack)


Version 1.19: 4/17/07
	- Linear algebra changes:
		- Made the block Lanczos code much more modular
		- Added cache blocking to the matrix multiply, with 
			automatic sizing of blocks based on the size of
			CPU caches
		- Added x86-specific MMX code to the computationally intensive
			routines. This item and the previous reduce the
			solve time for large problems by a factor of 4-6!
		- Pushed the restart of failed linear algebra runs into
			the block Lanczos code itself, instead of relying
			on the QS and NFS code to do it
		- Added progress reporting for larger jobs
		- Increased the number of rows that are converted to packed
			format, and made the final number of such rows a
			multiple of 32
	- Modified the file format of the NFS factor base, to allow manual
		adjustment of NFS parameters
	- Added an extra NFS matrix row to guarantee that the number of
		free relations in each dependency is even
	- Fixed a rare buffer overflow in the FFT multiplication code 
		(thanks to Greg Childers / valgrind)
	- Added automatic external cache size detection code
	- Began documenting the NFS filtering
	- In the NFS linear algebra, increased the bound for primes that 
		are packed into dense rows; this saves about 5% of the 
		working set during the linear algebra
	- Made all header files safe for C++ compilation

Version 1.18: 4/5/07
	- NFS filtering changes:
		- Added a postprocessing step to the merge phase that removes
		  relations by rearranging the basis defined by the current
		  collection of cycles. This makes the final linear system
		  about 2-4% lighter
		- Made the merge phase a lot more modular, since merging and
		  postprocessing of the merged relations need many of the
		  same utility functions
		- Added the capability to force building of a matrix, even
		  when there aren't as many excess relations as the filtering
		  code wants by default
		- Modified the filtering driver to not use the factor base 
		  at all; instead, a histogram of all the primes in the 
		  set of relations is built up during duplicate removal,
		  and the large prime cutoff is chosen based on that
		- Modified the clique removal to choose the clique batch size
		  based on the amount of excess to prune, instead of using
		  a hardwired size
		- Modified the merge phase to choose the maximum relation
		  set size to keep based on the amount of excess relations.
		  This can save a lot of memory when there is a large number
		  of excess relations
		- Modified the merge phase to keep producing cycles as long
		  as possible, instead of stopping when there are more cycles
		  than skipped ideals. This gives many more cycles to choose
		  from, and makes it much easier to converge to a sensible
		  matrix, especially when there aren't many excess relations
	- Modified nfs_read_relation to ignore factors of zero that 
		are read in (thanks to Greg Childers)
	- Fixed a buffer overflow in mp_divrem_core that occurs when the
		quotient requires an entire mp_t (thanks sp65536)
	- Fixed a longstanding bug in the polynomial rootfinder that 
		generated incorrect roots for nonmonic linear NFS 
		polynomials (thanks Macz)
	- Fixed the NFS square root to work correctly in the case of a
		nonmonic linear polynomial and free relations
	- Changed the extended precision floating point to not use
		ppc_intrinsics.h on PowerPC. This header apparently does not 
		exist on the Playstation3 (Cell) dev. environment (thanks Macz)
	- Updated the Visual Studio build files, courtesy of Brian Gladman
	- Added a little more documentation to the NFS square root

Version 1.17: 3/12/07
	- NFS filtering changes:
		- Peformed a major overhaul of the merging code. The
		  new version uses two heaps of ideals, and this allows
		  undoing decisions made about which ideals to try to
		  merge. The big advantage of this approach is that merging
		  starts by computing the lightest possible (but too large) 
		  matrix, then continues reducing the matrix dimension 
		  until the result has a specified density
		- Added minimum-spanning-tree combining of relation sets,
		  and performed a major overhaul of the core merging code
		- When merging, keep the relations making up a relation set
		  in sorted order. This allows the relation lists to be merged
		  just like the ideal lists, and allows repeated relations to
		  be removed from cycles as merging progresses
		- When estimating the weight gain from combining relation
		  sets, give a bonus to merging relation sets that lead to
		  cancellations in the combined list of relations. All of
		  these improvements combine to reduce the weight of the 
		  final matrix by up to 50%
		- Removed more compile-time limits in the merge phase
		- Fixed a serious and subtle bug in the merge phase, which
		  caused the weight of merged relation sets to be wrong
		- Changed the second singleton removal pass to use the
		  same large ideal cutoff for rational and algebraic factor
		  bases, even if they have different sizes
		- Added code to give up on relation sets that are too dense,
		  i.e. that contain too many relations
		Many thanks to Greg Childers for providing an NFS factorization
		that made me scramble to improve the NFS filtering
	- Added the beginning of a skewed polynomial selector (currently
		turned off)
	- Performed a complete overhaul of the driver and the manager
		for factors found. This pushes the recursive use of
		factorization into the library and unifies how all
		the different algorithms handle factors found
	- Added the Pollard-Brent algorithm; this makes inputs containing
		6-8 digit factors complete much faster (thanks to Leonardo
		Volpi for pointing out that latter-day msieve versions
		performed quite badly in this regard)
	- Modified the NFS linear algebra to sort lists of ideals by
		prime and then by rational/algebraic type. This assures
		that the Lanczos code begins with the sparsest possible matrix
	- Added code that brute-force divides read-in NFS relations by 
		small primes, in case some factors in the savefile 
		are missing. This is required to perform the postprocessing 
		for a factorization started with GGNFS. Also simplified 
		nfs_read_relation() and corrected a minor bug there
	- Fixed a horrible use-after-free bug in purge_singletons_final_core()
		(thanks valgrind)
	- Fixed a bug in the sieving during NFS polynomial selection; 
		factors of 11 were not being used
	- Modified the lanczos code to remember if some of the input matrix
		was converted into dense rows; this allows the linear algebra
		to restart properly after a bad initial solution
	- Added a failsafe path for long division by zero
	- Made mp_clear a static inline function instead of a macro. The extra
		type checking this allows would have prevented a silly bug
		in the NFS square root (in ap_poly_mul)
	- Fixed the routine that checks the relation product in the NFS
		square root to not assume degree 5 algebraic polynomials
	- Fixed a stupid bug in mp_is_prime

Version 1.16: 1/17/07
	- Made the code to analyze polynomials independent of the 
		polynomial degree and skewness; also made the interface
		to the analysis code much more modular
	- Rearranged the sampling code for NFS polynomial root properties
		so that samples are only generated one at a time
	- Modified the NFS driver to analyze any polynomial when it is 
		read in at the beginning of an NFS run, even if the 
		polynomial was not generated by the library
	- Fixed a stupid bug in mp_log that caused NFS polynomial search
		to be cut short early
	- Fixed a bug in the factor base generator that only shows up
		for SNFS factorizations (thanks to Greg Childers)
	- Allowed 6th degree polynomials, for people with SNFS factorizations
	- Reduced the required amount of NFS matrix excess from 200 to 80

Version 1.15: 1/11/07
	- Made as much of the NFS linear algebra code as possible into
		common code, so that the QS module can take advantage
		of the much more sophisticated matrix handling. Also
		generalized the code to form dense rows automatically
	- Fixed the linear algebra code that performed post-Lanczos 
		elimination to get true dependencies; it was completely wrong
	- Fixed an obscure bug in the FFT multiply code, that was causing
		the NFS square root to fail. Also added a verification
		step after the product of the NFS relations is computed
	- Modified the initial stage of the NFS linear algebra to 
		dynamically allocate the structures for forming dense rows
	- Made common code to accurately count the number of nonzero
		entries in QS or NFS matrices, replacing lots of ad hoc
		code that didn't do a good job
	- More reductions in the memory use of the NFS square root
	- Documented the NFS square root
	- Changed the method for detecting the floating point precision
		at runtime; the old method doesn't work if the compiler
		can emit SSE2 instructions
	- Use two separate hash functions when indexing 2-word structures.
		This is much more robust against data that is prone to hash
		collisions; for example, the previous 1-hash-function
		implementation performs miserably on big-endian systems during
		NFS filtering
	- Reduced the size of the NFS quadratic character base; the problems
		in the NFS square root were not due to an insufficient number
		of quadratic characters

Version 1.14: 1/5/07
	- Modified the NFS module to allow free relations
	- NFS square root changes:
		- Try only one start point to the Newton iteration per
		  dependency. This allows a major reduction in memory use
		  for the final iteration
		- tune the choice of starting point for the Newton iteration
		  so that the final iteration has precision just larger
		  than what the true square root needs
		- fixed memory leaks that occurred if a dependency had
		  a fatal error
		- Modified the checking to count up the powers of all 
		  the algebraic ideals, not just the primes to which 
		  those ideals correspond
	- Made the parity row mandatory for NFS linear algebra
	- Modified find_large_ideals to avoid finding the exact root
		to which a rational ideal corresponds. This makes the
		routine twice as fast, and makes filtering much easier
		when the rational poly is (ever) nonlinear
	- Disabled the NFS rootfinder assembly code for gcc < 3.0, and
		added some tweaks to allow compilation with less than
		full optimization (thanks to Daniel Roethlisberger)
	- Fixed a long-standing bug in the buffering of savefile data,
		that under fairly rare circumstances caused junk to be
		written to the savefile. If a run was then restarted, the
		first few relations would be reported as corrupted (thanks
		to Maximilian Hasler for the first report of this)

Version 1.13: 12/31/06
	- Added NFS square root code. This computes the algebraic square
		root by brute force, and while the current version is
		a little slow and a little memory-intensive, the runtime
		is much better than the literature claims
	- Added floating point FFT-multiply code. The core FFT arithmetic
		can be made *much* faster, but it's good enough for now
	- Added a basic arbitrary-precision math library
	- Split off the rootfinding code from the NFS factor base
		generation, optimized the rootfinder (~30% faster now),
		added functions needed by the NFS square root
	- Performed a major overhaul of the code that manages factors
		found by the sieve methods. The new code is simpler,
		usable by QS and NFS, and allows stopping before all
		dependencies are processed. This was long overdue,
		and cleans up the QS square root significantly
	- Modified the NFS driver to be fire-and-forget, using filtering
		information to feed back into the sieving
	- NFS filtering changes:
		- allocate critical structures dynamically, so that 
		  filtering can at least complete in the face of 
		  tough datasets
		- double the number of clique removals in the first pass
		- streamline the inputs to the second pass
	- Made the extended-precision floating point code independent of
		NFS poly generation
	- Increased the limit on QS factor base primes to 2^27
	- Increased the limit on input size to 164 digits, and added
		completely untested parameter tuning for inputs up to a little
		over 155 digits. I'm flattered that you think msieve can
		handle 120+ digit problems, and could I sell you a bridge
		near New York City?
	- Changed poly_get_roots to also return the leading poly coefficient
		mod p. Even though it doesn't save much time, this value
		is always needed in calling code and it's available for free
	- Increased the size of the NFS quadratic character base again.
		The previous size wasn't enough to allow an algebraic
		square for even a 100-digit factorization
	- Changed the lanczos code to pack nontrivial dependencies into the
		low-order bits of the dependency vector
	- Modified the demo to allow individual NFS postprocessing phases
		to be run
	- Added a check to the QS multiplier selection that a given
		multiplier doesn't cause overflow if used
	- Fixed a silly bug in QS polynomial initialization
	- Changed mp_modsqrt_1 to avoid the need for random numbers
	- Added mp_log; this removes some duplicated code
	- Moved mp_t conversion into the multiple precision library
	- Corrected some of the documentation in the NFS linear algebra
	- Overhauled the NFS readme

Version 1.12: 9/8/06
	- Fixed a bug declaring the hash multiplier on 64-bit systems
		(thanks to Igor Schein)

Version 1.11: 9/7/06
	- Added the NFS rational square root, and also the very beginning
		of the algebraic square root
	- Modified nfs_read_cycles() to optionally load only the cycles
		corresponding to a particular dependency from the linear
		algebra phase. This lets the routine be reused for the
		NFS square root
	- Removed hashtable_just_added(); the NFS hashtable implementation
		cannot find out after the fact whether a given entry was
		just added to the hashtable
	- Fixed the fix for the stupid bug in mp_str2mp (thanks to
		Bernardo Boncompagni)

Version 1.10: 8/25/06
	- Changed the NFS linear algebra to add a row to the matrix if the
		rational polynomial is not monic and linear
	- Fixed a stupid bug in mp_str2mp (thanks to Philippe Strohl)

Version 1.09: 8/23/06
	- Fixed silly mistake in mp_modmul

Version 1.08: 8/23/06
	- Major reorganization of the NFS code: 
		- the poly selection code is much more modular, so that
		  other poly selection methods can reuse as much existing
		  code as possible 
		- moved the filtering code to its own directory with 
		  its own header file, and reorganized it
		- general rearrangement of header files, and grouping
		  together of related but previously scattered functions
	- Added documentation for all NFS code except the filtering merge 
		phase (which is still under development)
	- Added a neat expression evaluator, a Visual Studio 2005 build 
		system, use of the Visual Studio prefetch instrinsic, a
		fix for mp_addmul_1 and some portability fixes, all courtesy 
		of Brian Gladman
	- Modified QS and NFS postprocessing to use more compact structures
		to represent lists of relations and lists of cycles. The 
		result is more memory efficient (especially on 64-bit systems)
		and reduces the number of small memory allocations. It also 
		removes a lot of cheesy pointer games that the previous 
		version played when building an NFS matrix.
	- Modified QS postprocessing to change the sorting criteria for
		generated cycles, and performed a general cleanup of the
		QS filtering code. The primary benefit is that full and
		partial relations are treated identically, and this simplifies
		the linear algebra and square root
	- Pushed the generation / reading of NFS factor bases into the sub-
		phases of the NFS code, instead of doing it once at the top
		level driver. This allows each stage to fine-tune the factor
		base so as to save memory
	- NFS linear algebra changes:
		- increased the size of the quadratic character base
		- generalized the matrix building code to work correctly
		  with any number of dense rows and any number of rows
		  in the post-Lanczos elimination phase, including zero
		  for both of these quantities
		- modified the Lanczos matrix multiply to allow for dense
		  rows packed into 32-bit bitfields. Packing the first few
		  dense rows into bitfields removes about 20% of the Lanczos
		  working set
		- after reducing the initial matrix, permute the rows so
		  that the smallest row indices are the heaviest
	- NFS filtering changes:
		- ignore the smallest ideals when estimating the weight 
		  of a relation. These often cancel each other in the 
		  matrix phase, and potentially confuse the optimization 
		  process in the merge phase. The result is a slightly 
		  sparser matrix given to the Lanczos code
		- fixed a minor bug calculating the weight of cliques 
		  found during clique removal. Matrices are about 1% sparser
		- changed singleton removal to always dump a singleton file
		  to disk when finished
	- Modified QS multiplier selection to allow for even numbers (again!).
		Also reduced the number of multiplier test primes, to keep
		the runtime of multiplier selection down. Thanks to Bill 
		Hart for finding an input where an even multiplier is optimal
	- Extended to smaller factorizations the trick used to pick QS 
		polynomial A values that are close to the optimium
	- Expanded the API for using hashtables in the NFS code; this greatly
		simplifies all the places that need general hashtable support
	- Changed the multiplier used in hashtables to be prime. This
		appears to make NFS filtering run about 3% faster

Version 1.07: 7/30/06
	- Added an NFS linear algebra phase. For NFS size matrices the
		block lanczos code is very slow, but otherwise all the
		pieces are there
	- NFS filtering improvements:
		- during the full merge, separated the burying of heavy
		  ideals and the merging of light ideals. Compared to 
		  interleaving the two processes, this makes the final 
		  matrix a few percent lighter
		- split singleton removal into two parts, one for ideals
		  above those in the factor base (with light clique removal)
		  and one for factor base ideals (with aggressive clique
		  removal). This makes the final matrix a tiny bit more
		  sparse, and greatly reduces the worst-case memory con-
		  sumption of singleton removal
		- made memory allocation during the final singleton pass
		  less aggressive
		- increased the limit on number of ideals allowed for a single
		  relation. Previously some relations exceeded that limit and
		  were silently skipped
		- removed some debug code
	- Modified nfs_read_relation to, as an option, store only (one copy of)
		ideals whose multiplicity in a relation has odd parity.
		This saves memory in the matrix build phase, and makes 
		find_large_ideals simpler and very slightly faster
	- Modified find_large_ideals to correctly handle projective roots,
		small ideals like 2 or 3, and a rational factor of -1. This
		makes the routine suitable for use in NFS matrix-building
	- Changed the file format for NFS cycles, to make parsing easier
	- More QS improvements:
		- Reduced the QS sieve block size when compiling for PowerPC
		- Separated the list of modular square roots from the other
		  factor base information. They're only needed during poly
		  initialization, and the smaller factor base structures 
		  make better use of cache (~5% speedup for small jobs)
		- Added profiling code to the sieve phase. This has been 
		  present in my internal sources for a while, but there's
		  no harm in making it public since it's off by default
	- Changed qsort callbacks to correctly sort uint32 arrays when the
		array elements need all 32 bits
	- Modified the makefile to have separate targets for compiling
		with and without the NFS code (default without). This makes
		it easy to compile without dependencies on external libraries
		or experimental code. Also changed Apple compile flags to
		assume FSF gcc as the compiler
	- Make the 'flags' field of msieve_obj volatile. Bits in this are
		checked to see if they changed asynchronously in other threads,
		and the compiler needs to know this field cannot be cached
	- Modified the tiny factoring code to give up when the factors of N
		found are 1 and N. Michael Fuhr and Dennis Langdeau found
		several inputs that would recurse infinitely when this happened.
		Unfortunately it's still an open issue to make these
		factorizations actually *work*
	- Always try to open /dev/urandom unless compiling for win32.
		Apparently several OSes have this facility and there's
		a fallback path if they do not (thanks to Michael Fuhr)

Version 1.06: 4/21/06
	- Added an NFS filtering phase. All the steps are there (except for
		minimum spanning tree combining of relation-sets) but it's
		not documented yet because the current code is pretty messy
		and needs to be integrated better
	- Many QS improvements (thanks to Colin Percival for ideas and
		discussions that led to some of these):
		- Fixed a bug in the multiplier selection
		- Exchanged the loop order in choose_multiplier(), so that 
		  precomputations can be performed without having to buffer 
		  them all 
		- Increased the number of primes tested when choosing 
		  the multiplier
		- Added a safety check when choosing the number of primes 
		  to use in the multiplier selection
		- Increased the amount of unrolling when scanning the sieve
		  array from 8 to 32 
		- Unrolled the loop that adds the sieve updates from 
		  large primes 
		- Switched back to a compile-time sieve block size
		- Removed the exception code in the inner sieve loops for
		  factor base entries that are not sieved
		- Modified the sieve initialization to cover the whole interval
		  in one pass, instead of handling positive and negative
		  offsets in separate passes
		- Converted an inner loop in the poly selection to use the 
		  faster modular inverse
		- Moved the corrections for root updates into the sieve trial
		  factoring code, where they are much cheaper
		- Pulled common subexpressions out of the loops used for 
		  switching sieve polynomials
		- Use conditional move instructions (where supported) to 
		  reduce the overhead of switching sieve polynomials
		The reduction in overhead improves performance by 10-20%,
		even for large factorizations. The amount of improvement
		varies a good deal, and I still have to figure out why
	- When NFS sieving, turn off factor base entries whose prime 
		is divisible by the current b value, and turn off
		projective roots in the same way
	- Removed all the special case sieve code for NFS factor base
		primes of 2. This also fixes a bug that made relations 
		always say they had a factor of 2, even if they did not
	- Made separate QS and NFS versions of the functions that write to
		the savefile 
	- Made the NFS siever skip initialization entirely if no relations
		are needed
	- Turned off prefetching for gcc older than 3.0 (thanks to Tony Goddard)

Version 1.05: 2/4/06
	- Made the NFS factor base an array of ordinary structures and made
		the version used by the line siever an array of line-siever-
		specific structures. This cleans up the siever code, saves
		memory outside the siever code and paves the way for a 
		lattice siever to do the same thing later
	- Changed d2mp() in the NFS poly generation code to work correctly
		on PowerPC processors. QS and NFS should both work on
		that platform now
	- Added some fixes to the portions of the MP library that are
		only compiled for non-x86 platforms
	- Fixed a silly mistake parsing NFS arguments in the demo application
		(thanks to Bill Hart)
	- Added a note to the makefile about linking with GSL (thanks sp65536)

Version 1.04: 1/31/06
	- Added the beginnings of a package that implements the general
		number field sieve. It's not even close to done, but
		includes a polynomial selector, factor base generator
		and line siever, the latter two being quite sophisticated.
		GNFS must be turned on explicitly, and only applies to 
		factorizations above ~97 digits 
	- Major reorganization of the library code; anything that is
		not specific to the QS implementation has been moved
		elsewhere, for reuse by other modules. This is a fundamental
		shift: QS is not the purpose of this library anymore,
		factoring is
	- Major overhaul of the multiple precision library. This includes
		new functions, reduced overhead, better algorithms and 
		more assembly language. Some routines are an order of 
		magnitude faster now, although only small factorizations
		really benefit from the faster library (my test C70 is
		about 10% faster).
	- Finally added a prime sieve to build factor bases, and increased
		the trial factoring bound
	- Combined the trial factoring with all of the checking afterwards
		(i.e. for primes, perfect powers, and small inputs). This
		lets different modules do their own trial factoring and
		automatically handle cases where MPQS or GNFS is not needed
	- Added some MSVC portability fixes from Brian Gladman

Version 1.03: 11/27/05
	- Modified the demo to recurse fully, in the case of multiple
		composite factors found. Also print out the number to be
		factored when in quiet mode
	- Fixed a dumb mistake assigning the number of words in mp_add_1
		and mp_sub_1 (thanks terrasse247)
	- Replaced include of stdint.h with the more portable inttypes.h

Version 1.02: 11/20/05
	- Greatly increased the size of the factor base and the sieve interval
		for large factorizations (90 digits and up). Thanks to
		Jay Berg for pointing out how suboptimal the original choices
		turned out to be. Expect to see speedups of 20% for 
		100 digit and 50% for 105 digit factorizations! I've chosen 
		what appear to be optimal parameters up to 110 digits, but 
		beyond that is still only guesswork
	- Modified the sieve code to dynamically choose the number of
		polynomials that are simultaneously sieved. Since sieve 
		intervals can now be large, statically picking the number
		of polynomials runs the risk of chewing up massive amounts
		of memory
	- Modified the polynomial selection code to choose larger factors
		of polynomials for large factorizations. This shouldn't have
		a performance penalty and can potentially save lots of memory
	- For big factorizations, modified the poly generation code to choose
		some factors deliberately too small. This makes the last factor
		chosen deliberately too large, and reduces the difference 
		between the current poly and the optimal poly. The result
		is a ~2% increase in relation discovery rate. Also added an
		explicit limit on the number of factor base primes to be
		searched as candidates for this last factor (the limit was
		always there, but now the code runs the risk of hitting it)
	- More multiplier changes. All possible multipliers are tried
		for any input, not just one of the four subsets previously
		used. The time needed to test all multipliers is really
		trivial compared to the time needed for a big factorization
	- Made the counting of cycles during sieving optional. For 
		distributed clients that will only do sieving, and for which
		the tracking of cycles is unnecessary, this saves a lot
		of memory
	- When sieving is interrupted, exhaust all of the polynomials for the
		current 'A' value before stopping, if there are not too many 
		such polynomials. If 'A' has very many polynomials, stop
		once 2000 have been sieved. This is a compromise that allows
		getting the sieving to which users are entitled, without
		having to wait 15 minutes for a shutdown
	- Clamped the maximum factor base prime at 16 million (2^24). With
		the new parameters, there's a danger that the very largest
		factorizations will hit this limit
	- If the demo program discovers that a previously found factor is
		composite, it will automatically recurse and factor it. The
		changes to implement this were so trivial it's embarrassing
		I didn't make them long ago
	- Added a tiny MPQS routine to factor inputs that are too small 
		(~25 digits or less) for the main QS code to handle 
		comfortably. This fixes a longstanding 'blind spot' for 
		19-21 digit factorizations
	- Allow sieving to stop after a specified number of relations.
		This is also a cheap way to force the combining phase to
		run before the requisite number of relations have been found
	- Added safety checks that allow the linear algebra to run even
		if the resulting matrix is known to be underdetermined.
		Factorizations for which this is the case are pretty much
		guaranteed to fail, but several people want to try finishing
		their factorizations early
	- Removed the bit about 'the best possible quadratic sieve code' from
		the readme. At least one person has interpreted this to mean
		I believe QS code can't get any better than msieve, which
		in fact I do not believe

Version 1.01: 7/22/05
	- Complete overhaul of the multiplier selection and factor base
		construction code. Now it's much cleaner and better 
		documented, much faster, and chooses better multipliers 
		on average. After four separate attempts to figure out
		how multipliers are supposed to work, I think this version
		finally gets it right
	- Force all polynomial 'a' values to have at least three primes.
		This is only an issue for very small factorizations
	- Fixed another bug in the polynomial selection, triggered only
		when the factor base is very small (thanks Washuu)
	- Corrected some typos in the readme

Version 1.0: 6/20/05
	- Increased the precision of the multiple precision library
		to 125 digits. Also added sieving parameters for 105-125
		digit inputs. These are not tested at all, but they have 
		to be better than using the 105-digit parameters everywhere
	- For 90+ digit inputs, use a trial division cutoff that is larger
		than the double large prime cutoff. This makes sieving
		a little slower, but the increase in partial relations found
		outweighs the slowdown. Assuming that a linear increase in
		partial relations makes the factorization linearly faster
		(this may not be true), tuning this step makes factorizations
		above 90 digits around 5% faster
	- Added an option to shut down gracefully after a specified 
		number of minutes
	- Only squarefree multipliers are allowed
	- Print the number of bits in all the sieving cutoffs
	- Added notification to screen if sieving completed
	- Modified demo.c to not depend on numbers ending in
		a carriage return when being read in
	- Fixed a stupid bug generating random seeds in linux;
		also use /dev/urandom instead of /dev/random
	- Catch SIGTERM as well as SIGINT
	- Cleaned up the wording on progress messages; hopefully the
		new wording won't confuse so many people about
		the nature of the relations being collected
	- Added mp.h back into msieve.h, so that the main structure
	  	automatically chooses the right size for an mp_t
	- Added build flags for more platforms to the makefile
	- Used the types from stdint.h to handle 32-bit vs 64-bit
	- Munged enough typedefs so things compile on AIX with xlc
	- Allow a multiplier of 1, even if the input is not 1 mod 4
	- Increased the trial factoring bound in mp_is_prime to 256
	- Added code to print the elapsed time even if the run was interrupted
	- Added readme

Version 0.88: 12/24/04
	- Moved all the core factorization code into a library and
		forked off a demo application to call it. Also
		built a lightweight API that hides the library internals
	- Encapsulated all the static data needed to perform a factorization
		into an msieve_obj struct; this removes all dependencies
		on global variables and makes the factorization library
		thread-safe
	- Made logging much more flexible: to file, to screen, both or neither
	- Removed the need for the roots of the large factor base primes to
		be in sorted order in one of the inner sieve loops;
		amazingly, the one branch that implemented the sort
		took over half the total runtime of poly initialization!
	- Store the precomputed values for initializing polynomials in
		two different formats: structure-of-arrays for the small FB
		primes and array-of-structures for the large FB primes.
		This makes moderate-size factorizations somewhat faster
	- Much more paranoia in the choice of random seeds
	- Reduced the size of a block of work when the L1 cache is smaller;
		hopefully this will speed up factorizations on Intel CPUs.
	- Added an extra check that the number of cycles found by the
		cycle finder not exceed the number of cycles expected.
		The absence of this check could have caused some 
		factorizations to crash. This appears to only be an issue
		with early versions of gcc 4.0.0 (see gcc bugzilla #19050)
	- Added powers of odd numbers to the list of small multipliers
		available; no sieving for these numbers is performed, so
		powers are okay now
	- Fixed several memory leaks that happen when performing small
		factorizations in batch mode
	- Count the number of digits in the input and in any factors found
	- If sieving is not actually happening when a Ctrl-C occurs,
		exit immediately
	- When reading relations from disk, make sure that a relation con-
		tains at most one -1 factor
	- Compute the elapsed time if QS is actually needed

Version 0.87: 12/11/04
	- Modified the sieve code to perform sieving for several
		polynomials simultaneously, and interleaved polynomial
		(re)initialization with the sieving. This allows cache 
		blocking of both the factor base and the sieve interval, 
		and makes poly initialization take almost zero time.
		This in turn allows extremely small sieve intervals, allowing
		smooth relations to be found faster. The end result is
		a big speedup in the sieve stage, 15-20% at least
	- Modified the hashtable-based portion of the sieving to generate
		more efficient compiled code; 1-2% speedup
	- Changed the savefile format again, to avoid printing out 'b' values
	- Fixed an extremely subtle bug involving a sieve offset of -1;
		this is a special case for SIQS but not MPQS
	- Removed the up-front verification pass; the worst that can
		happen is that the postprocessing will not find enough
		valid relations, so you'll just have to sieve some more
	- Performed an overhaul of out-of-date documentation
	- Add batch mode (again)
	- Added a makefile
	- Consume whitespace before parsing the input number
	- Correctly factor an input of zero

Version 0.86: 11/29/04
	- Modified the polynomial generation code to generate 'a'
		values as close as possible to the theoretical
		'best' value. Sieving is ~10% faster now, *and*
		finds more partial relations than before
	- The calculation of cutoff scores for trial factoring
		sieve values was completely wrong; it assumed (like
		QS did) that small sieve offsets yielded small polynomial
		values. Correcting this removes ~30% of calls to
		the trial factoring code
	- Compute the second trial factoring cutoff inside check_sieve_val();
		this makes it more likely that sieve values near a real
		root of the MPQS polynomial will not be thrown away
	- Changed all savefile writes to be manually buffered, so that
		disk access only happens in large blocks and is explicitly
		flushed. This *may* solve the mysterious problems people
		are seeing where relations in a large run get corrupted.
		It also makes small factorizations run much faster
	- Added code to verify every relation before sieving starts, and
		change the filtering stage to recompute numbers of cycles
		and relations from scratch
	- Make sure that lists of numbers read from the savefile are
		sorted in ascending order
	- Clamped the bound for single large primes at 32 bits. Someday
		this may become important

Version 0.85: 11/26/04
	- If any partial relations are duplicates, rebuild the graph
		of large primes and recompute the expected number of cycles. 
		This is the only way to survive a large number of duplicates
	- Slightly modify the cycle-finder to allow quitting even if
		some partial relations do not participate in cycles
	- Make sure that relations read from disk do not have too
		many factors (i.e. are corrupt)
	- Use the default cache size for the Pentium 4; using the correct
		cache size for this processor causes performance loss
	- Print timestamps to stdout and not stderr; apparently I'm the
		only one who likes seeing timestamps when output is
		redirected to a file
	- Print relation counts to stderr, and only print the last progress
		notification to stdout. This should give a nice capsule
		description if output is redirected, and keeps thousands
		of lines of progress notifications out of logs.
	
Version 0.84: 11/25/04
	- Cache-based optimizations: decoupled the small FB size
		from the sieve block size, made the sieve block size
		variable, added cache size detection for x86 processors
		(and hopefully accurate compile directives for PowerPCs)
	- Change the savefile scheme to only have one output file, containing
		relations and polynomials mixed together. This greatly
		simplifies parsing saved data; it also means that savefiles
		from different runs (or machines!) can be concatenated
		together and read in all at once. 
	- Added much more paranoia to the parsing of savefiles
	- Restart the linear algebra on any error in find_nonsigular_sub();
		these errors are a consequence of a bad random start
		point for the matrix code and just need a different
		start point to work
	- Added versions of malloc() and free() that align allocated memory
	- Fix a potential bug in the counting of partial relations just
		as the filtering stage is starting
	- Increased the minimum trial factoring bound slightly

Version 0.83: 11/23/04
	- Fixed a bug in the polynomial generation code that was
		triggered by the multiplier improvements of 0.82

Version 0.82: 11/22/04
	- Changed the multiplier computation to use the modified
		Knuth-Schroeppel algorithm (now that I finally found
		a paper that describes it)
	- For diagnostic purposes, print the version and the 
		seeds for the random number generator 
	- Remove some debug printouts, fix some typos

Version 0.81: 11/20/04
	- Added a preprocessing stage to the matrix code to remove
		singleton rows. This reduces the number of linear
		algebra failures for small inputs and eliminates
		(I hope) the inability to find dependencies at all
		for some factorizations
	- Restart the Lanczos iteration if not all columns are used
		between two iteration steps. This saves the program
		having to be rerun manually
	- Slightly modified some of the output text

Version 0.8: 11/19/04
	- Renamed the program to 'msieve'. The 'm' is for 'Michey'
	- Added double large prime support
	- Wholesale changes to support checkpoint and restart. Not
		only does this make things more crashworthy, it 
		drastically reduces memory consumption
	- Added block Lanczos for the linear algebra step. The Gauss
		elimination code was really huffing and puffing for
		big factorizations, especially since double large primes
		are making much more dense matrices 
	- Changed the way polynomials are stored, to reduce memory use
		during the final stage of factorization
	- Changed the trial division to use multiplication by
		reciprocals. This makes the trial division part
		of checking a sieve value ~20% faster, and is 
		especially important with double large primes
		because trial factoring happens much more often
	- Use built-in tables to compute both the initial trial factoring
		bound and the factor base bound. This avoids having to
		guess how many primes the trial factoring will need
		(previously the number used was a huge overestimate that
		wasted lots of time)
	- Moved all code that deals with relations (allocation, freeing,
		purging, cycle generation) to its own file
	- Used SQUFOF for inputs 19 digits or less after trial factoring;
		also fixed a bug in the polynomial selection code that
		caused 20-22 digit factorizations to needlessly fail
	- Fixed another bug in the polynomial selection code that only
		showed up for small factor bases
	- Collapsed all of the sieve initialization into a single
		routine again, and performed some long-needed cleanup
	- Changed the seeding of the random number generator to be different
		for every invocation of the program
	- Increased the size limit of an mp_t to 115 digits. It's presuming
		a bit much, but someone may need the extra room
	- In the square root phase, if a factor is found and it divides
		the multiplier then don't print it (it's part of the multiplier)
	- Fixed a silly bug in mp_bits, for zero inputs

Version 0.7: 10/23/04
	- Coded up the self-initializing variant of MPQS. 70-digit
		factorizations are 15% faster, and the speedup climbs
		to 50% for 90-digit factorizations. This was a lot
		trickier than I thought. Note that the minimum input
		size is now around 20 digits
	- Changed the small prime multiplier computation to penalize
		larger multipliers. The previous version only looked at
		the factor base and didn't account for the fact that a
		larger multiplier meant a larger number to factor
	- Finally added (basic) parameter tuning based on size of the
		input number. For smaller factorizations the result is
		up to twice as fast compared to a single set of parameters;
		for larger factorizations the difference is smaller
	- For all factors found, report the factor as prime, composite
		or probable prime
	- Removed the floating-point modular inverse. SIQS doesn't
		need it, and it was pretty kludgy to begin with
	- Big reorganization of the driver code; added capability to
		factor multiple integers in batch mode. Also changed
		the initial factor base calculations to save memory
	- The linear algebra phase now collects 64 dependencies 
		instead of 32. This increases the odds of a complete
		factorization, and the extra overhead of 64-bit arithmetic 
		on 32-bit platforms should be negligible
	- Replaced mp_isqrt with mp_iroot, and used the latter to 
		determine if the input is a perfect power
	- Added several casts to make the code 64-bit clean
	- Added a progress notification for every 500 full relations
		collected (this was long overdue)

Version 0.6: 10/9/04
	- Another overhaul of the low-level sieving code. This version
		uses a weird hashtable technique for most of the 
		factor base primes. The result is a dramatic speedup 
		in sieving and trial factoring of sieve values, and an 
		overall speedup of ~30% for 80 digit factorizations
	- I was wrong in thinking that it wouldn't make any difference
		to skip sieving with small primes. For 60-80 digit
		factorizations quite a bit of time is spent in the
		small prime phase of the sieve, and implementing the
		small prime variation gives a 5-10% overall speedup
	- replaced the use of mp_expo_1 in the polynomial initial-
		ization stage with a custom version that uses floating
		point. For big factorizations (>60 digits) this more
		than doubles the speed of the initialization, for an
		overall speedup approaching 20%
	- removed the sieve tuning subsystem; it just wasn't working
		any better than picking a single set of parameters
		for all factorizations
	- modified the factor base size calculation and the MPQS
		poly generation code to handle the case when the
		number to be factored is very small. The minimum
		size input for which the MPQS code works is ~12 digits
	- modified mp_divrem to use mp_divrem_1 whenever possible;
		this prevents a host of problems with small denominators
	- modified mp_gcd to use mp_mod_1 whenever possible;
		this is faster and prevents infinite loops
	- streamlined handling of the multiplier during the
		MPQS square root phase
	- fixed some compiler warnings about improper casts
	- switched to gcc's built-in prefetch intrinsic

Version 0.5: 7/10/04
	- massive overhaul of the entire sieving phase, including a
		top-level code reorganization, removing dependence
		on hard-coded constants, and multiple efficiency 
		optimizations. The result is cleaner, more robust,
		better documented and 20-30% faster
	- added an experimental tuning phase that estimates the sieving
		runtime and can in principle choose all sieving parameters
		simultaneously to optimize the sieving time. In practice
		it still needs some work.
	- added multipliers of 2, 6 and 10 to the list of multipliers 
		available. This required small changes to the MPQS init
		phase and to the square root phase
	- fixed several edge conditions in the handling of 0 or 1
		partial relations, that would otherwise crash
	- fixed the calculation of new sieve bailout values to avoid
		an infinite loop when only one more relation was needed
	- added more sanity-checking when generating random witnesses
		in mp_is_prime
	- inlined mp_expo_1 and removed the initial remainder
		operation, which is unnecessary if the base is
		about the same size as the modulus. This requires
		an extra normalization in the third case of 
		mp_legendre_1, but makes the MPQS initialization 
		stage slightly faster.
	- renamed mp_modmul to mp_modmul_1
	- split out the MPQS square root code into its own file

Version 0.4: 6/10/04
	- multiple polynomials at last! Even with no tuning, the
		speedup is incredible (5x for 50 digit numbers,
		10x for 60-70 digit numbers). The sieve code is simpler,
		and the trial division code is both simpler and faster.
	- added multiple-precision versions of mp_expo_1, mp_legendre_1,
		and mp_modsqrt_1. 
	- added mp_rand, mp_is_prime, mp_random_prime, mp_next_prime
	- forced the bound for trial division to a minimum of 10000
	- bail out if the input number has been completely factored
		after the trial division phase
	- fixed a bug in the second case of mp_legendre_1
	- fixed typos in comments

Version 0.3: 4/13/04
	- changed the basic sieving routines to use division-free
		arithmetic. This removes compiler-dependent code,
		removes artificial limits to the size of the sieving
		interval, removes the latency of 64-bit division for
		machines which don't support it directly, and is much
		faster (~15%) even on x86 machines with hardware
		64-bit divide.
	- added L1 and L2 cache blocking to the sieving routines. 
		Additional 15% speedup
	- major changes to how trial division on sieve values is performed.
		This removes the entire multiple-precision library from
		all critical paths in the program, and makes everything
		~10% faster. The speedup increases as the numbers to be
		factored, and the factor base, gets larger
	- added computation of a small prime multiplier in the initial
		stages of the program, to optimize the choice
		of factor bases. This can make a huge difference
		in performance (I've seen 25% speedups)
	- forced precompution of all of the primes needed in the 
		initial stages of the program. This makes trial 
		division and multiplier selection much faster
	- forced use of the dense Gaussian solver on any matrix smaller 
		than a cutoff size, even if the sparse solver can keep going
	- fixed an initialization bug in mp_mul_1
	- changed the inline asm for mp_modmul. The original version broke
		mp_modsqrt_1, which in turn broke (in a very obscure way)
		the trial division of sieve values
	- modified a loop bound in build_matrix to avoid a buffer overrun
		when the number of relations is much larger than required
	- moved freeing of the sieve array to point after its last use,
		instead of at the end of the program
	- added code to sort the factors found into ascending order; 
		otherwise compiler and OS specific details of qsort 
		cause the factors to be printed out in a different 
		order on different machines
	- packed several code groups in the square root phase
		into loops (makes for cleaner code)

Version 0.2: 3/29/04
	- fixed a silly bug that threw away half the depen-
		dencies in the square root phase
	- modified division routine to compute quotient
		correctly when num is larger than square
		of denom
	- used fixed division routine to make GCDs hugely
		faster when one operand is much smaller than
		the other
	- used faster GCDs to print out only prime factors
		that the program finds (rather than products of
		two or more prime factors)
	- included test that number to be factored is not 
		a perfect square
	- modified gcc inline assembly to put condition 
		codes in the clobber list
	- fixed many compiler warnings
	- used __inline for VC++
	- included some more header files to avoid
		missing function prototypes
	- cleaned up some typos in source code comments

Version 0.1: 3/24/04
	Initial release