Thursday, July 26, 2012

Is GJSieve "efficient?"

While it is true that this application is being developed primarily as a means of teaching C/C++ and demonstrating the usefulness of the MPIR fork of the GMP library, it also serves as a valuable insight into various practices of number theory.

In a previous post, it was suggested that perhaps trial division with an array of prime numbers alone was not the best way to find factors of an integer. To expand on that, allow me to posit that trial division is itself an inefficient means of factorizing numbers, at least in the way in which it is currently implemented in programs like GJSieve.

Currently, a number N is trial-divided using:

std::vector ta<99999>;

In this vector, each value is set to the previous value + 1, starting at 2. Thus, the vector progresses:

2, 3, 4, 5, 6, 7, 8, 9, 10, <...> 100001, 100002

The program divides the number N by ta[u]= trial where u=1; u++ in a "for" conditional. This increases while the number from the vector, styled trial, is less than 100002. 

Therefore, rather than using an array of primes only, we simply use an array of whole numbers.

But doesn't this seem like overkill? What if a number such as 3 or 5 turns out to be a factor? Why test every other number after that? 

In order to prevent this unnecessary arithmetic, we implement the trial division function such that it stores the remainder in a variable and then immediately follow with checks to see if the remainder is exactly zero or not.

If the remainder is zero, then trial is a factor. There is also a check to see if the trial number is equal to N - for instance, when testing 13, the first factor it finds is 13.

This was an issue in version 1.5.0, wherein practically every number was "composite" because it was a factor of itself and the program did not differentiate as it should have. But if we implement something like 

if (trial == N && remainder == 0)


gmp_printf ("%Zd %s", N, "could be prime!);


then every number less than 100002 will be "prime!"

The solution is this:
  1. after a trial division, check to see if the remainder is zero
  2. if it is not, then move on to the next trial number
  3. if the remainder IS zero, check to see if it is equivalent to N
  4. if it is equivalent to N, go to step 8
  5. if it is not equivalent to N, display that it is a factor
  6. divide N by that number to determine the other factor
  7. display both factors provided they are both whole numbers that divide N. Break.
  8. if the trial number is equivalent to N, this means no other trial numbers were factors and N is likely to be prime. Break.
The break statement is key. This prevents unnecessary trial divisions. Namely, it prevents the program from trying to divide smaller numbers by larger numbers - for instance, you don't want to be dividing 791 by 1551. Even though 791 is prime, the program will calculate a remainder of 791 every time unless it stops at either the first factor found or the point where the trial number equals N.

There IS, however, a problem here. What if N is greater than 100002? 

In most cases, it is - sometimes, exponentially. This is where trial division becomes viewed as inefficient. It would be impractical (and, indeed, impossible) to have a dynamic vector that would test every number up to N, especially if N has several hundred thousand digits! That, and no number is ever factored by a number one or a few less than it - with a select couple exceptions (ie 2 factors 4).

As was also mentioned in the previous post, it is more efficient to divide up to the point where the trial number = sqrt(N). 

A feature like this is slated for implementation into GJSieve,  but will likely not appear for quite some time. There are numerous things to take into account, such as the fact that square roots are rarely whole numbers (and when they are, the number is very likely to be composite). Also, how does one dynamically grow an array or vector? Making a new one for each new N seems to make sense, but would probably end up using a lot of memory. 

This is all part of the deterministic process, though. We'll keep you updated!


No comments:

Post a Comment