[openssl-commits] [openssl] OpenSSL_1_0_2-stable update

Tue Mar 1 13:57:01 UTC 2016

The branch OpenSSL_1_0_2-stable has been updated
       via  a5006916587ef6b3969ec4d42542cd64c5230f3a (commit)
       via  902f3f50d051dfd6ebf009d352aaf581195caabf (commit)
       via  45e53cf88104fd8c3ae031691e5ba49e5820bef6 (commit)
       via  08d0ff54d0f49bbf836f081e7a8012fef090dc95 (commit)
       via  248808c8406c113d00ab45368ab03bfa66411d00 (commit)
       via  515f3be47a0b58eec808cf365bc5e8ef6917266b (commit)
       via  25d14c6c29b53907bf614b9964d43cd98401a7fc (commit)
       via  08ea966c01a39e38ef89e8920d53085e4807a43a (commit)
       via  ef98503eeef5c108018081ace902d28e609f7772 (commit)
       via  708dc2f1291e104fe4eef810bb8ffc1fae5b19c1 (commit)
       via  bc38a7d2d3c6082163c50ddf99464736110f2000 (commit)
       via  1b1d8ae49a41c89a33d9902fc7304cf8accc3f67 (commit)
       via  021fb42dd0cf2bf985b0e26ca50418eb42c00d09 (commit)
       via  9dfd2be8a1761fffd152a92d8f1b356ad667eea7 (commit)
      from  c175308407858afff3fc8c2e5e085d94d12edc7d (commit)


- Log -----------------------------------------------------------------
commit a5006916587ef6b3969ec4d42542cd64c5230f3a
Author: Matt Caswell <matt at openssl.org>
Date:   Tue Mar 1 13:37:56 2016 +0000

    Prepare for 1.0.2h-dev
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>

commit 902f3f50d051dfd6ebf009d352aaf581195caabf
Author: Matt Caswell <matt at openssl.org>
Date:   Tue Mar 1 13:36:54 2016 +0000

    Prepare for 1.0.2g release
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>

commit 45e53cf88104fd8c3ae031691e5ba49e5820bef6
Author: Matt Caswell <matt at openssl.org>
Date:   Tue Mar 1 13:36:54 2016 +0000

    make update
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>

commit 08d0ff54d0f49bbf836f081e7a8012fef090dc95
Author: Matt Caswell <matt at openssl.org>
Date:   Tue Mar 1 12:08:33 2016 +0000

    Ensure mk1mf.pl is aware of no-weak-ssl-ciphers option
    
    Update mk1mf.pl to properly handle no-weak-ssl-ciphers
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>

commit 248808c8406c113d00ab45368ab03bfa66411d00
Author: Matt Caswell <matt at openssl.org>
Date:   Tue Mar 1 11:00:48 2016 +0000

    Update CHANGES and NEWS for new release
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>

commit 515f3be47a0b58eec808cf365bc5e8ef6917266b
Author: Andy Polyakov <appro at openssl.org>
Date:   Tue Jan 26 16:50:10 2016 +0100

    bn/asm/x86_64-mont5.pl: unify gather procedure in hardly used path
    and reorganize/harmonize post-conditions.
    
    Additional hardening following on from CVE-2016-0702
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>
    Reviewed-by: Rich Salz <rsalz at openssl.org>
    (cherry picked from master)

commit 25d14c6c29b53907bf614b9964d43cd98401a7fc
Author: Andy Polyakov <appro at openssl.org>
Date:   Mon Jan 25 23:41:01 2016 +0100

    crypto/bn/x86_64-mont5.pl: constant-time gather procedure.
    
    At the same time remove miniscule bias in final subtraction.
    Performance penalty varies from platform to platform, and even with
    key length. For rsa2048 sign it was observed to be 4% for Sandy
    Bridge and 7% on Broadwell.
    
    CVE-2016-0702
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>
    Reviewed-by: Rich Salz <rsalz at openssl.org>
    (cherry picked from master)

commit 08ea966c01a39e38ef89e8920d53085e4807a43a
Author: Andy Polyakov <appro at openssl.org>
Date:   Mon Jan 25 23:25:40 2016 +0100

    bn/asm/rsaz-avx2.pl: constant-time gather procedure.
    
    Performance penalty is 2%.
    
    CVE-2016-0702
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>
    Reviewed-by: Rich Salz <rsalz at openssl.org>
    (cherry picked from master)

commit ef98503eeef5c108018081ace902d28e609f7772
Author: Andy Polyakov <appro at openssl.org>
Date:   Mon Jan 25 23:06:45 2016 +0100

    bn/asm/rsax-x86_64.pl: constant-time gather procedure.
    
    Performance penalty is 2% on Linux and 5% on Windows.
    
    CVE-2016-0702
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>
    Reviewed-by: Rich Salz <rsalz at openssl.org>
    (cherry picked from master)

commit 708dc2f1291e104fe4eef810bb8ffc1fae5b19c1
Author: Andy Polyakov <appro at openssl.org>
Date:   Mon Jan 25 20:38:38 2016 +0100

    bn/bn_exp.c: constant-time MOD_EXP_CTIME_COPY_FROM_PREBUF.
    
    Performance penalty varies from platform to platform, and even
    key length. For rsa2048 sign it was observed to reach almost 10%.
    
    CVE-2016-0702
    
    Reviewed-by: Richard Levitte <levitte at openssl.org>
    Reviewed-by: Rich Salz <rsalz at openssl.org>
    (cherry picked from master)
    
    Resolved conflicts:
    	crypto/bn/bn_exp.c

commit bc38a7d2d3c6082163c50ddf99464736110f2000
Author: Viktor Dukhovni <openssl-users at dukhovni.org>
Date:   Fri Feb 19 13:05:11 2016 -0500

    Disable EXPORT and LOW SSLv3+ ciphers by default
    
    Reviewed-by: Emilia Käsper <emilia at openssl.org>

commit 1b1d8ae49a41c89a33d9902fc7304cf8accc3f67
Author: Matt Caswell <matt at openssl.org>
Date:   Fri Feb 19 11:38:25 2016 +0000

    Add a test for SSLv2 configuration
    
    SSLv2 should be off by default. You can only turn it on if you have called
    SSL_CTX_clear_options(SSL_OP_NO_SSLv2) or
    SSL_clear_options(SSL_OP_NO_SSLv2). You should not be able to inadvertantly
    turn it on again via SSL_CONF without having done that first.
    
    Reviewed-by: Emilia Käsper <emilia at openssl.org>

commit 021fb42dd0cf2bf985b0e26ca50418eb42c00d09
Author: Viktor Dukhovni <openssl-users at dukhovni.org>
Date:   Wed Feb 17 23:38:55 2016 -0500

    Bring SSL method documentation up to date
    
    Reviewed-by: Emilia Käsper <emilia at openssl.org>

commit 9dfd2be8a1761fffd152a92d8f1b356ad667eea7
Author: Viktor Dukhovni <openssl-users at dukhovni.org>
Date:   Wed Feb 17 21:07:48 2016 -0500

    Disable SSLv2 default build, default negotiation and weak ciphers.
    
    SSLv2 is by default disabled at build-time.  Builds that are not
    configured with "enable-ssl2" will not support SSLv2.  Even if
    "enable-ssl2" is used, users who want to negotiate SSLv2 via the
    version-flexible SSLv23_method() will need to explicitly call either
    of:
    
        SSL_CTX_clear_options(ctx, SSL_OP_NO_SSLv2);
    or
        SSL_clear_options(ssl, SSL_OP_NO_SSLv2);
    
    as appropriate.  Even if either of those is used, or the application
    explicitly uses the version-specific SSLv2_method() or its client
    or server variants, SSLv2 ciphers vulnerable to exhaustive search
    key recovery have been removed.  Specifically, the SSLv2 40-bit
    EXPORT ciphers, and SSLv2 56-bit DES are no longer available.
    
    Mitigation for CVE-2016-0800
    
    Reviewed-by: Emilia Käsper <emilia at openssl.org>

-----------------------------------------------------------------------

Summary of changes:
 CHANGES                         |  111 +++-
 Configure                       |    8 +-
 NEWS                            |   15 +-
 README                          |    2 +-
 crypto/bn/Makefile              |    4 +-
 crypto/bn/asm/rsaz-avx2.pl      |  219 ++++---
 crypto/bn/asm/rsaz-x86_64.pl    |  375 +++++++++---
 crypto/bn/asm/x86_64-mont.pl    |  227 ++++---
 crypto/bn/asm/x86_64-mont5.pl   | 1276 ++++++++++++++++++++++-----------------
 crypto/bn/bn_exp.c              |  103 +++-
 crypto/opensslv.h               |    6 +-
 doc/apps/ciphers.pod            |   59 +-
 doc/apps/s_client.pod           |   12 +-
 doc/apps/s_server.pod           |    8 +-
 doc/ssl/SSL_CONF_cmd.pod        |   33 +-
 doc/ssl/SSL_CTX_new.pod         |  168 ++++--
 doc/ssl/SSL_CTX_set_options.pod |   10 +
 doc/ssl/ssl.pod                 |   77 ++-
 openssl.spec                    |    2 +-
 ssl/Makefile                    |   69 ++-
 ssl/s2_lib.c                    |    6 +
 ssl/s3_lib.c                    |   54 ++
 ssl/ssl_conf.c                  |   10 +-
 ssl/ssl_lib.c                   |    7 +
 ssl/sslv2conftest.c             |  231 +++++++
 test/Makefile                   |   35 +-
 util/mk1mf.pl                   |    2 +
 27 files changed, 2113 insertions(+), 1016 deletions(-)
 create mode 100644 ssl/sslv2conftest.c

diff --git a/CHANGES b/CHANGES
index 26a0291..0e3d70e 100644
--- a/CHANGES
+++ b/CHANGES
@@ -2,7 +2,46 @@
  OpenSSL CHANGES
  _______________
 
- Changes between 1.0.2f and 1.0.2g [xx XXX xxxx]
+ Changes between 1.0.2g and 1.0.2h [xx XXX xxxx]
+
+  *)
+
+ Changes between 1.0.2f and 1.0.2g [1 Mar 2016]
+
+  * Disable weak ciphers in SSLv3 and up in default builds of OpenSSL.
+    Builds that are not configured with "enable-weak-ssl-ciphers" will not
+    provide any "EXPORT" or "LOW" strength ciphers.
+    [Viktor Dukhovni]
+
+  * Disable SSLv2 default build, default negotiation and weak ciphers.  SSLv2
+    is by default disabled at build-time.  Builds that are not configured with
+    "enable-ssl2" will not support SSLv2.  Even if "enable-ssl2" is used,
+    users who want to negotiate SSLv2 via the version-flexible SSLv23_method()
+    will need to explicitly call either of:
+
+        SSL_CTX_clear_options(ctx, SSL_OP_NO_SSLv2);
+    or
+        SSL_clear_options(ssl, SSL_OP_NO_SSLv2);
+
+    as appropriate.  Even if either of those is used, or the application
+    explicitly uses the version-specific SSLv2_method() or its client and
+    server variants, SSLv2 ciphers vulnerable to exhaustive search key
+    recovery have been removed.  Specifically, the SSLv2 40-bit EXPORT
+    ciphers, and SSLv2 56-bit DES are no longer available.
+    (CVE-2016-0800)
+    [Viktor Dukhovni]
+
+  *) Fix a double-free in DSA code
+
+     A double free bug was discovered when OpenSSL parses malformed DSA private
+     keys and could lead to a DoS attack or memory corruption for applications
+     that receive DSA private keys from untrusted sources.  This scenario is
+     considered rare.
+
+     This issue was reported to OpenSSL by Adam Langley(Google/BoringSSL) using
+     libFuzzer.
+     (CVE-2016-0705)
+     [Stephen Henson]
 
   *) Disable SRP fake user seed to address a server memory leak.
 
@@ -23,6 +62,76 @@
      (CVE-2016-0798)
      [Emilia Käsper]
 
+  *) Fix BN_hex2bn/BN_dec2bn NULL pointer deref/heap corruption
+
+     In the BN_hex2bn function the number of hex digits is calculated using an
+     int value |i|. Later |bn_expand| is called with a value of |i * 4|. For
+     large values of |i| this can result in |bn_expand| not allocating any
+     memory because |i * 4| is negative. This can leave the internal BIGNUM data
+     field as NULL leading to a subsequent NULL ptr deref. For very large values
+     of |i|, the calculation |i * 4| could be a positive value smaller than |i|.
+     In this case memory is allocated to the internal BIGNUM data field, but it
+     is insufficiently sized leading to heap corruption. A similar issue exists
+     in BN_dec2bn. This could have security consequences if BN_hex2bn/BN_dec2bn
+     is ever called by user applications with very large untrusted hex/dec data.
+     This is anticipated to be a rare occurrence.
+
+     All OpenSSL internal usage of these functions use data that is not expected
+     to be untrusted, e.g. config file data or application command line
+     arguments. If user developed applications generate config file data based
+     on untrusted data then it is possible that this could also lead to security
+     consequences. This is also anticipated to be rare.
+
+     This issue was reported to OpenSSL by Guido Vranken.
+     (CVE-2016-0797)
+     [Matt Caswell]
+
+  *) Fix memory issues in BIO_*printf functions
+
+     The internal |fmtstr| function used in processing a "%s" format string in
+     the BIO_*printf functions could overflow while calculating the length of a
+     string and cause an OOB read when printing very long strings.
+
+     Additionally the internal |doapr_outch| function can attempt to write to an
+     OOB memory location (at an offset from the NULL pointer) in the event of a
+     memory allocation failure. In 1.0.2 and below this could be caused where
+     the size of a buffer to be allocated is greater than INT_MAX. E.g. this
+     could be in processing a very long "%s" format string. Memory leaks can
+     also occur.
+
+     The first issue may mask the second issue dependent on compiler behaviour.
+     These problems could enable attacks where large amounts of untrusted data
+     is passed to the BIO_*printf functions. If applications use these functions
+     in this way then they could be vulnerable. OpenSSL itself uses these
+     functions when printing out human-readable dumps of ASN.1 data. Therefore
+     applications that print this data could be vulnerable if the data is from
+     untrusted sources. OpenSSL command line applications could also be
+     vulnerable where they print out ASN.1 data, or if untrusted data is passed
+     as command line arguments.
+
+     Libssl is not considered directly vulnerable. Additionally certificates etc
+     received via remote connections via libssl are also unlikely to be able to
+     trigger these issues because of message size limits enforced within libssl.
+
+     This issue was reported to OpenSSL Guido Vranken.
+     (CVE-2016-0799)
+     [Matt Caswell]
+
+  *) Side channel attack on modular exponentiation
+
+     A side-channel attack was found which makes use of cache-bank conflicts on
+     the Intel Sandy-Bridge microarchitecture which could lead to the recovery
+     of RSA keys.  The ability to exploit this issue is limited as it relies on
+     an attacker who has control of code in a thread running on the same
+     hyper-threaded core as the victim thread which is performing decryptions.
+
+     This issue was reported to OpenSSL by Yuval Yarom, The University of
+     Adelaide and NICTA, Daniel Genkin, Technion and Tel Aviv University, and
+     Nadia Heninger, University of Pennsylvania with more information at
+     http://cachebleed.info.
+     (CVE-2016-0702)
+     [Andy Polyakov]
+
   *) Change the req app to generate a 2048-bit RSA/DSA key by default,
      if no keysize is specified with default_bits. This fixes an
      omission in an earlier change that changed all RSA/DSA key generation
diff --git a/Configure b/Configure
index 4a715dc..c98107a 100755
--- a/Configure
+++ b/Configure
@@ -58,6 +58,10 @@ my $usage="Usage: Configure [no-<cipher> ...] [enable-<cipher> ...] [experimenta
 #		library and will be loaded in run-time by the OpenSSL library.
 # sctp          include SCTP support
 # 386           generate 80386 code
+# enable-weak-ssl-ciphers
+#		Enable EXPORT and LOW SSLv3 ciphers that are disabled by
+#		default.  Note, weak SSLv2 ciphers are unconditionally
+#		disabled.
 # no-sse2	disables IA-32 SSE2 code, above option implies no-sse2
 # no-<cipher>   build without specified algorithm (rsa, idea, rc5, ...)
 # -<xxx> +<xxx> compiler options are passed through 
@@ -781,11 +785,13 @@ my %disabled = ( # "what"         => "comment" [or special keyword "experimental
 		 "md2"            => "default",
 		 "rc5"            => "default",
 		 "rfc3779"	  => "default",
-		 "sctp"       => "default",
+		 "sctp"           => "default",
 		 "shared"         => "default",
 		 "ssl-trace"	  => "default",
+		 "ssl2"           => "default",
 		 "store"	  => "experimental",
 		 "unit-test"	  => "default",
+		 "weak-ssl-ciphers" => "default",
 		 "zlib"           => "default",
 		 "zlib-dynamic"   => "default"
 	       );
diff --git a/NEWS b/NEWS
index c596993..4737636 100644
--- a/NEWS
+++ b/NEWS
@@ -5,10 +5,23 @@
   This file gives a brief overview of the major changes between each OpenSSL
   release. For more details please read the CHANGES file.
 
-  Major changes between OpenSSL 1.0.2f and OpenSSL 1.0.2g [under development]
+  Major changes between OpenSSL 1.0.2g and OpenSSL 1.0.2h [under development]
 
       o
 
+  Major changes between OpenSSL 1.0.2f and OpenSSL 1.0.2g [1 Mar 2016]
+
+      o Disable weak ciphers in SSLv3 and up in default builds of OpenSSL.
+      o Disable SSLv2 default build, default negotiation and weak ciphers
+        (CVE-2016-0800)
+      o Fix a double-free in DSA code (CVE-2016-0705)
+      o Disable SRP fake user seed to address a server memory leak
+        (CVE-2016-0798)
+      o Fix BN_hex2bn/BN_dec2bn NULL pointer deref/heap corruption
+        (CVE-2016-0797)
+      o Fix memory issues in BIO_*printf functions (CVE-2016-0799)
+      o Fix side channel attack on modular exponentiation (CVE-2016-0702)
+
   Major changes between OpenSSL 1.0.2e and OpenSSL 1.0.2f [28 Jan 2016]
 
       o DH small subgroups (CVE-2016-0701)
diff --git a/README b/README
index 200678b..bb2e4c6 100644
--- a/README
+++ b/README
@@ -1,5 +1,5 @@
 
- OpenSSL 1.0.2g-dev
+ OpenSSL 1.0.2h-dev
 
  Copyright (c) 1998-2015 The OpenSSL Project
  Copyright (c) 1995-1998 Eric A. Young, Tim J. Hudson
diff --git a/crypto/bn/Makefile b/crypto/bn/Makefile
index 215855e..c4c6409 100644
--- a/crypto/bn/Makefile
+++ b/crypto/bn/Makefile
@@ -252,8 +252,8 @@ bn_exp.o: ../../include/openssl/e_os2.h ../../include/openssl/err.h
 bn_exp.o: ../../include/openssl/lhash.h ../../include/openssl/opensslconf.h
 bn_exp.o: ../../include/openssl/opensslv.h ../../include/openssl/ossl_typ.h
 bn_exp.o: ../../include/openssl/safestack.h ../../include/openssl/stack.h
-bn_exp.o: ../../include/openssl/symhacks.h ../cryptlib.h bn_exp.c bn_lcl.h
-bn_exp.o: rsaz_exp.h
+bn_exp.o: ../../include/openssl/symhacks.h ../constant_time_locl.h
+bn_exp.o: ../cryptlib.h bn_exp.c bn_lcl.h rsaz_exp.h
 bn_exp2.o: ../../e_os.h ../../include/openssl/bio.h ../../include/openssl/bn.h
 bn_exp2.o: ../../include/openssl/buffer.h ../../include/openssl/crypto.h
 bn_exp2.o: ../../include/openssl/e_os2.h ../../include/openssl/err.h
diff --git a/crypto/bn/asm/rsaz-avx2.pl b/crypto/bn/asm/rsaz-avx2.pl
index 3b6ccf8..712a77f 100755
--- a/crypto/bn/asm/rsaz-avx2.pl
+++ b/crypto/bn/asm/rsaz-avx2.pl
@@ -443,7 +443,7 @@ $TEMP2 = $B2;
 $TEMP3 = $Y1;
 $TEMP4 = $Y2;
 $code.=<<___;
-	#we need to fix indexes 32-39 to avoid overflow
+	# we need to fix indices 32-39 to avoid overflow
 	vmovdqu		32*8(%rsp), $ACC8		# 32*8-192($tp0),
 	vmovdqu		32*9(%rsp), $ACC1		# 32*9-192($tp0)
 	vmovdqu		32*10(%rsp), $ACC2		# 32*10-192($tp0)
@@ -1592,68 +1592,128 @@ rsaz_1024_scatter5_avx2:
 .type	rsaz_1024_gather5_avx2,\@abi-omnipotent
 .align	32
 rsaz_1024_gather5_avx2:
+	vzeroupper
+	mov	%rsp,%r11
 ___
 $code.=<<___ if ($win64);
 	lea	-0x88(%rsp),%rax
-	vzeroupper
 .LSEH_begin_rsaz_1024_gather5:
 	# I can't trust assembler to use specific encoding:-(
-	.byte	0x48,0x8d,0x60,0xe0		#lea	-0x20(%rax),%rsp
-	.byte	0xc5,0xf8,0x29,0x70,0xe0	#vmovaps %xmm6,-0x20(%rax)
-	.byte	0xc5,0xf8,0x29,0x78,0xf0	#vmovaps %xmm7,-0x10(%rax)
-	.byte	0xc5,0x78,0x29,0x40,0x00	#vmovaps %xmm8,0(%rax)
-	.byte	0xc5,0x78,0x29,0x48,0x10	#vmovaps %xmm9,0x10(%rax)
-	.byte	0xc5,0x78,0x29,0x50,0x20	#vmovaps %xmm10,0x20(%rax)
-	.byte	0xc5,0x78,0x29,0x58,0x30	#vmovaps %xmm11,0x30(%rax)
-	.byte	0xc5,0x78,0x29,0x60,0x40	#vmovaps %xmm12,0x40(%rax)
-	.byte	0xc5,0x78,0x29,0x68,0x50	#vmovaps %xmm13,0x50(%rax)
-	.byte	0xc5,0x78,0x29,0x70,0x60	#vmovaps %xmm14,0x60(%rax)
-	.byte	0xc5,0x78,0x29,0x78,0x70	#vmovaps %xmm15,0x70(%rax)
+	.byte	0x48,0x8d,0x60,0xe0		# lea	-0x20(%rax),%rsp
+	.byte	0xc5,0xf8,0x29,0x70,0xe0	# vmovaps %xmm6,-0x20(%rax)
+	.byte	0xc5,0xf8,0x29,0x78,0xf0	# vmovaps %xmm7,-0x10(%rax)
+	.byte	0xc5,0x78,0x29,0x40,0x00	# vmovaps %xmm8,0(%rax)
+	.byte	0xc5,0x78,0x29,0x48,0x10	# vmovaps %xmm9,0x10(%rax)
+	.byte	0xc5,0x78,0x29,0x50,0x20	# vmovaps %xmm10,0x20(%rax)
+	.byte	0xc5,0x78,0x29,0x58,0x30	# vmovaps %xmm11,0x30(%rax)
+	.byte	0xc5,0x78,0x29,0x60,0x40	# vmovaps %xmm12,0x40(%rax)
+	.byte	0xc5,0x78,0x29,0x68,0x50	# vmovaps %xmm13,0x50(%rax)
+	.byte	0xc5,0x78,0x29,0x70,0x60	# vmovaps %xmm14,0x60(%rax)
+	.byte	0xc5,0x78,0x29,0x78,0x70	# vmovaps %xmm15,0x70(%rax)
 ___
 $code.=<<___;
-	lea	.Lgather_table(%rip),%r11
-	mov	$power,%eax
-	and	\$3,$power
-	shr	\$2,%eax			# cache line number
-	shl	\$4,$power			# offset within cache line
-
-	vmovdqu		-32(%r11),%ymm7		# .Lgather_permd
-	vpbroadcastb	8(%r11,%rax), %xmm8
-	vpbroadcastb	7(%r11,%rax), %xmm9
-	vpbroadcastb	6(%r11,%rax), %xmm10
-	vpbroadcastb	5(%r11,%rax), %xmm11
-	vpbroadcastb	4(%r11,%rax), %xmm12
-	vpbroadcastb	3(%r11,%rax), %xmm13
-	vpbroadcastb	2(%r11,%rax), %xmm14
-	vpbroadcastb	1(%r11,%rax), %xmm15
-
-	lea	64($inp,$power),$inp
-	mov	\$64,%r11			# size optimization
-	mov	\$9,%eax
-	jmp	.Loop_gather_1024
+	lea	-0x100(%rsp),%rsp
+	and	\$-32, %rsp
+	lea	.Linc(%rip), %r10
+	lea	-128(%rsp),%rax			# control u-op density
+
+	vmovd		$power, %xmm4
+	vmovdqa		(%r10),%ymm0
+	vmovdqa		32(%r10),%ymm1
+	vmovdqa		64(%r10),%ymm5
+	vpbroadcastd	%xmm4,%ymm4
+
+	vpaddd		%ymm5, %ymm0, %ymm2
+	vpcmpeqd	%ymm4, %ymm0, %ymm0
+	vpaddd		%ymm5, %ymm1, %ymm3
+	vpcmpeqd	%ymm4, %ymm1, %ymm1
+	vmovdqa		%ymm0, 32*0+128(%rax)
+	vpaddd		%ymm5, %ymm2, %ymm0
+	vpcmpeqd	%ymm4, %ymm2, %ymm2
+	vmovdqa		%ymm1, 32*1+128(%rax)
+	vpaddd		%ymm5, %ymm3, %ymm1
+	vpcmpeqd	%ymm4, %ymm3, %ymm3
+	vmovdqa		%ymm2, 32*2+128(%rax)
+	vpaddd		%ymm5, %ymm0, %ymm2
+	vpcmpeqd	%ymm4, %ymm0, %ymm0
+	vmovdqa		%ymm3, 32*3+128(%rax)
+	vpaddd		%ymm5, %ymm1, %ymm3
+	vpcmpeqd	%ymm4, %ymm1, %ymm1
+	vmovdqa		%ymm0, 32*4+128(%rax)
+	vpaddd		%ymm5, %ymm2, %ymm8
+	vpcmpeqd	%ymm4, %ymm2, %ymm2
+	vmovdqa		%ymm1, 32*5+128(%rax)
+	vpaddd		%ymm5, %ymm3, %ymm9
+	vpcmpeqd	%ymm4, %ymm3, %ymm3
+	vmovdqa		%ymm2, 32*6+128(%rax)
+	vpaddd		%ymm5, %ymm8, %ymm10
+	vpcmpeqd	%ymm4, %ymm8, %ymm8
+	vmovdqa		%ymm3, 32*7+128(%rax)
+	vpaddd		%ymm5, %ymm9, %ymm11
+	vpcmpeqd	%ymm4, %ymm9, %ymm9
+	vpaddd		%ymm5, %ymm10, %ymm12
+	vpcmpeqd	%ymm4, %ymm10, %ymm10
+	vpaddd		%ymm5, %ymm11, %ymm13
+	vpcmpeqd	%ymm4, %ymm11, %ymm11
+	vpaddd		%ymm5, %ymm12, %ymm14
+	vpcmpeqd	%ymm4, %ymm12, %ymm12
+	vpaddd		%ymm5, %ymm13, %ymm15
+	vpcmpeqd	%ymm4, %ymm13, %ymm13
+	vpcmpeqd	%ymm4, %ymm14, %ymm14
+	vpcmpeqd	%ymm4, %ymm15, %ymm15
+
+	vmovdqa	-32(%r10),%ymm7			# .Lgather_permd
+	lea	128($inp), $inp
+	mov	\$9,$power
 
-.align	32
 .Loop_gather_1024:
-	vpand		-64($inp),		%xmm8,%xmm0
-	vpand		($inp),			%xmm9,%xmm1
-	vpand		64($inp),		%xmm10,%xmm2
-	vpand		($inp,%r11,2),		%xmm11,%xmm3
-	 vpor					%xmm0,%xmm1,%xmm1
-	vpand		64($inp,%r11,2),	%xmm12,%xmm4
-	 vpor					%xmm2,%xmm3,%xmm3
-	vpand		($inp,%r11,4),		%xmm13,%xmm5
-	 vpor					%xmm1,%xmm3,%xmm3
-	vpand		64($inp,%r11,4),	%xmm14,%xmm6
-	 vpor					%xmm4,%xmm5,%xmm5
-	vpand		-128($inp,%r11,8),	%xmm15,%xmm2
-	lea		($inp,%r11,8),$inp
-	 vpor					%xmm3,%xmm5,%xmm5
-	 vpor					%xmm2,%xmm6,%xmm6
-	 vpor					%xmm5,%xmm6,%xmm6
-	vpermd		%ymm6,%ymm7,%ymm6
-	vmovdqu		%ymm6,($out)
+	vmovdqa		32*0-128($inp),	%ymm0
+	vmovdqa		32*1-128($inp),	%ymm1
+	vmovdqa		32*2-128($inp),	%ymm2
+	vmovdqa		32*3-128($inp),	%ymm3
+	vpand		32*0+128(%rax),	%ymm0,	%ymm0
+	vpand		32*1+128(%rax),	%ymm1,	%ymm1
+	vpand		32*2+128(%rax),	%ymm2,	%ymm2
+	vpor		%ymm0, %ymm1, %ymm4
+	vpand		32*3+128(%rax),	%ymm3,	%ymm3
+	vmovdqa		32*4-128($inp),	%ymm0
+	vmovdqa		32*5-128($inp),	%ymm1
+	vpor		%ymm2, %ymm3, %ymm5
+	vmovdqa		32*6-128($inp),	%ymm2
+	vmovdqa		32*7-128($inp),	%ymm3
+	vpand		32*4+128(%rax),	%ymm0,	%ymm0
+	vpand		32*5+128(%rax),	%ymm1,	%ymm1
+	vpand		32*6+128(%rax),	%ymm2,	%ymm2
+	vpor		%ymm0, %ymm4, %ymm4
+	vpand		32*7+128(%rax),	%ymm3,	%ymm3
+	vpand		32*8-128($inp),	%ymm8,	%ymm0
+	vpor		%ymm1, %ymm5, %ymm5
+	vpand		32*9-128($inp),	%ymm9,	%ymm1
+	vpor		%ymm2, %ymm4, %ymm4
+	vpand		32*10-128($inp),%ymm10,	%ymm2
+	vpor		%ymm3, %ymm5, %ymm5
+	vpand		32*11-128($inp),%ymm11,	%ymm3
+	vpor		%ymm0, %ymm4, %ymm4
+	vpand		32*12-128($inp),%ymm12,	%ymm0
+	vpor		%ymm1, %ymm5, %ymm5
+	vpand		32*13-128($inp),%ymm13,	%ymm1
+	vpor		%ymm2, %ymm4, %ymm4
+	vpand		32*14-128($inp),%ymm14,	%ymm2
+	vpor		%ymm3, %ymm5, %ymm5
+	vpand		32*15-128($inp),%ymm15,	%ymm3
+	lea		32*16($inp), $inp
+	vpor		%ymm0, %ymm4, %ymm4
+	vpor		%ymm1, %ymm5, %ymm5
+	vpor		%ymm2, %ymm4, %ymm4
+	vpor		%ymm3, %ymm5, %ymm5
+
+	vpor		%ymm5, %ymm4, %ymm4
+	vextracti128	\$1, %ymm4, %xmm5	# upper half is cleared
+	vpor		%xmm4, %xmm5, %xmm5
+	vpermd		%ymm5,%ymm7,%ymm5
+	vmovdqu		%ymm5,($out)
 	lea		32($out),$out
-	dec	%eax
+	dec	$power
 	jnz	.Loop_gather_1024
 
 	vpxor	%ymm0,%ymm0,%ymm0
@@ -1661,20 +1721,20 @@ $code.=<<___;
 	vzeroupper
 ___
 $code.=<<___ if ($win64);
-	movaps	(%rsp),%xmm6
-	movaps	0x10(%rsp),%xmm7
-	movaps	0x20(%rsp),%xmm8
-	movaps	0x30(%rsp),%xmm9
-	movaps	0x40(%rsp),%xmm10
-	movaps	0x50(%rsp),%xmm11
-	movaps	0x60(%rsp),%xmm12
-	movaps	0x70(%rsp),%xmm13
-	movaps	0x80(%rsp),%xmm14
-	movaps	0x90(%rsp),%xmm15
-	lea	0xa8(%rsp),%rsp
+	movaps	-0xa8(%r11),%xmm6
+	movaps	-0x98(%r11),%xmm7
+	movaps	-0x88(%r11),%xmm8
+	movaps	-0x78(%r11),%xmm9
+	movaps	-0x68(%r11),%xmm10
+	movaps	-0x58(%r11),%xmm11
+	movaps	-0x48(%r11),%xmm12
+	movaps	-0x38(%r11),%xmm13
+	movaps	-0x28(%r11),%xmm14
+	movaps	-0x18(%r11),%xmm15
 .LSEH_end_rsaz_1024_gather5:
 ___
 $code.=<<___;
+	lea	(%r11),%rsp
 	ret
 .size	rsaz_1024_gather5_avx2,.-rsaz_1024_gather5_avx2
 ___
@@ -1708,8 +1768,10 @@ $code.=<<___;
 	.long	0,2,4,6,7,7,7,7
 .Lgather_permd:
 	.long	0,7,1,7,2,7,3,7
-.Lgather_table:
-	.byte	0,0,0,0,0,0,0,0, 0xff,0,0,0,0,0,0,0
+.Linc:
+	.long	0,0,0,0, 1,1,1,1
+	.long	2,2,2,2, 3,3,3,3
+	.long	4,4,4,4, 4,4,4,4
 .align	64
 ___
 
@@ -1837,18 +1899,19 @@ rsaz_se_handler:
 	.rva	rsaz_se_handler
 	.rva	.Lmul_1024_body,.Lmul_1024_epilogue
 .LSEH_info_rsaz_1024_gather5:
-	.byte	0x01,0x33,0x16,0x00
-	.byte	0x36,0xf8,0x09,0x00	#vmovaps 0x90(rsp),xmm15
-	.byte	0x31,0xe8,0x08,0x00	#vmovaps 0x80(rsp),xmm14
-	.byte	0x2c,0xd8,0x07,0x00	#vmovaps 0x70(rsp),xmm13
-	.byte	0x27,0xc8,0x06,0x00	#vmovaps 0x60(rsp),xmm12
-	.byte	0x22,0xb8,0x05,0x00	#vmovaps 0x50(rsp),xmm11
-	.byte	0x1d,0xa8,0x04,0x00	#vmovaps 0x40(rsp),xmm10
-	.byte	0x18,0x98,0x03,0x00	#vmovaps 0x30(rsp),xmm9
-	.byte	0x13,0x88,0x02,0x00	#vmovaps 0x20(rsp),xmm8
-	.byte	0x0e,0x78,0x01,0x00	#vmovaps 0x10(rsp),xmm7
-	.byte	0x09,0x68,0x00,0x00	#vmovaps 0x00(rsp),xmm6
-	.byte	0x04,0x01,0x15,0x00	#sub	rsp,0xa8
+	.byte	0x01,0x36,0x17,0x0b
+	.byte	0x36,0xf8,0x09,0x00	# vmovaps 0x90(rsp),xmm15
+	.byte	0x31,0xe8,0x08,0x00	# vmovaps 0x80(rsp),xmm14
+	.byte	0x2c,0xd8,0x07,0x00	# vmovaps 0x70(rsp),xmm13
+	.byte	0x27,0xc8,0x06,0x00	# vmovaps 0x60(rsp),xmm12
+	.byte	0x22,0xb8,0x05,0x00	# vmovaps 0x50(rsp),xmm11
+	.byte	0x1d,0xa8,0x04,0x00	# vmovaps 0x40(rsp),xmm10
+	.byte	0x18,0x98,0x03,0x00	# vmovaps 0x30(rsp),xmm9
+	.byte	0x13,0x88,0x02,0x00	# vmovaps 0x20(rsp),xmm8
+	.byte	0x0e,0x78,0x01,0x00	# vmovaps 0x10(rsp),xmm7
+	.byte	0x09,0x68,0x00,0x00	# vmovaps 0x00(rsp),xmm6
+	.byte	0x04,0x01,0x15,0x00	# sub	  rsp,0xa8
+	.byte	0x00,0xb3,0x00,0x00	# set_frame r11
 ___
 }
 
diff --git a/crypto/bn/asm/rsaz-x86_64.pl b/crypto/bn/asm/rsaz-x86_64.pl
index 091cdc2..87ce2c3 100755
--- a/crypto/bn/asm/rsaz-x86_64.pl
+++ b/crypto/bn/asm/rsaz-x86_64.pl
@@ -915,9 +915,76 @@ rsaz_512_mul_gather4:
 	push	%r14
 	push	%r15
 
-	mov	$pwr, $pwr
-	subq	\$128+24, %rsp
+	subq	\$`128+24+($win64?0xb0:0)`, %rsp
+___
+$code.=<<___	if ($win64);
+	movaps	%xmm6,0xa0(%rsp)
+	movaps	%xmm7,0xb0(%rsp)
+	movaps	%xmm8,0xc0(%rsp)
+	movaps	%xmm9,0xd0(%rsp)
+	movaps	%xmm10,0xe0(%rsp)
+	movaps	%xmm11,0xf0(%rsp)
+	movaps	%xmm12,0x100(%rsp)
+	movaps	%xmm13,0x110(%rsp)
+	movaps	%xmm14,0x120(%rsp)
+	movaps	%xmm15,0x130(%rsp)
+___
+$code.=<<___;
 .Lmul_gather4_body:
+	movd	$pwr,%xmm8
+	movdqa	.Linc+16(%rip),%xmm1	# 00000002000000020000000200000002
+	movdqa	.Linc(%rip),%xmm0	# 00000001000000010000000000000000
+
+	pshufd	\$0,%xmm8,%xmm8		# broadcast $power
+	movdqa	%xmm1,%xmm7
+	movdqa	%xmm1,%xmm2
+___
+########################################################################
+# calculate mask by comparing 0..15 to $power
+#
+for($i=0;$i<4;$i++) {
+$code.=<<___;
+	paddd	%xmm`$i`,%xmm`$i+1`
+	pcmpeqd	%xmm8,%xmm`$i`
+	movdqa	%xmm7,%xmm`$i+3`
+___
+}
+for(;$i<7;$i++) {
+$code.=<<___;
+	paddd	%xmm`$i`,%xmm`$i+1`
+	pcmpeqd	%xmm8,%xmm`$i`
+___
+}
+$code.=<<___;
+	pcmpeqd	%xmm8,%xmm7
+
+	movdqa	16*0($bp),%xmm8
+	movdqa	16*1($bp),%xmm9
+	movdqa	16*2($bp),%xmm10
+	movdqa	16*3($bp),%xmm11
+	pand	%xmm0,%xmm8
+	movdqa	16*4($bp),%xmm12
+	pand	%xmm1,%xmm9
+	movdqa	16*5($bp),%xmm13
+	pand	%xmm2,%xmm10
+	movdqa	16*6($bp),%xmm14
+	pand	%xmm3,%xmm11
+	movdqa	16*7($bp),%xmm15
+	leaq	128($bp), %rbp
+	pand	%xmm4,%xmm12
+	pand	%xmm5,%xmm13
+	pand	%xmm6,%xmm14
+	pand	%xmm7,%xmm15
+	por	%xmm10,%xmm8
+	por	%xmm11,%xmm9
+	por	%xmm12,%xmm8
+	por	%xmm13,%xmm9
+	por	%xmm14,%xmm8
+	por	%xmm15,%xmm9
+
+	por	%xmm9,%xmm8
+	pshufd	\$0x4e,%xmm8,%xmm9
+	por	%xmm9,%xmm8
 ___
 $code.=<<___ if ($addx);
 	movl	\$0x80100,%r11d
@@ -926,45 +993,38 @@ $code.=<<___ if ($addx);
 	je	.Lmulx_gather
 ___
 $code.=<<___;
-	movl	64($bp,$pwr,4), %eax
-	movq	$out, %xmm0		# off-load arguments
-	movl	($bp,$pwr,4), %ebx
-	movq	$mod, %xmm1
-	movq	$n0, 128(%rsp)
+	movq	%xmm8,%rbx
+
+	movq	$n0, 128(%rsp)		# off-load arguments
+	movq	$out, 128+8(%rsp)
+	movq	$mod, 128+16(%rsp)
 
-	shlq	\$32, %rax
-	or	%rax, %rbx
 	movq	($ap), %rax
 	 movq	8($ap), %rcx
-	 leaq	128($bp,$pwr,4), %rbp
 	mulq	%rbx			# 0 iteration
 	movq	%rax, (%rsp)
 	movq	%rcx, %rax
 	movq	%rdx, %r8
 
 	mulq	%rbx
-	 movd	(%rbp), %xmm4
 	addq	%rax, %r8
 	movq	16($ap), %rax
 	movq	%rdx, %r9
 	adcq	\$0, %r9
 
 	mulq	%rbx
-	 movd	64(%rbp), %xmm5
 	addq	%rax, %r9
 	movq	24($ap), %rax
 	movq	%rdx, %r10
 	adcq	\$0, %r10
 
 	mulq	%rbx
-	 pslldq	\$4, %xmm5
 	addq	%rax, %r10
 	movq	32($ap), %rax
 	movq	%rdx, %r11
 	adcq	\$0, %r11
 
 	mulq	%rbx
-	 por	%xmm5, %xmm4
 	addq	%rax, %r11
 	movq	40($ap), %rax
 	movq	%rdx, %r12
@@ -977,14 +1037,12 @@ $code.=<<___;
 	adcq	\$0, %r13
 
 	mulq	%rbx
-	 leaq	128(%rbp), %rbp
 	addq	%rax, %r13
 	movq	56($ap), %rax
 	movq	%rdx, %r14
 	adcq	\$0, %r14
 	
 	mulq	%rbx
-	 movq	%xmm4, %rbx
 	addq	%rax, %r14
 	 movq	($ap), %rax
 	movq	%rdx, %r15
@@ -996,6 +1054,35 @@ $code.=<<___;
 
 .align	32
 .Loop_mul_gather:
+	movdqa	16*0(%rbp),%xmm8
+	movdqa	16*1(%rbp),%xmm9
+	movdqa	16*2(%rbp),%xmm10
+	movdqa	16*3(%rbp),%xmm11
+	pand	%xmm0,%xmm8
+	movdqa	16*4(%rbp),%xmm12
+	pand	%xmm1,%xmm9
+	movdqa	16*5(%rbp),%xmm13
+	pand	%xmm2,%xmm10
+	movdqa	16*6(%rbp),%xmm14
+	pand	%xmm3,%xmm11
+	movdqa	16*7(%rbp),%xmm15
+	leaq	128(%rbp), %rbp
+	pand	%xmm4,%xmm12
+	pand	%xmm5,%xmm13
+	pand	%xmm6,%xmm14
+	pand	%xmm7,%xmm15
+	por	%xmm10,%xmm8
+	por	%xmm11,%xmm9
+	por	%xmm12,%xmm8
+	por	%xmm13,%xmm9
+	por	%xmm14,%xmm8
+	por	%xmm15,%xmm9
+
+	por	%xmm9,%xmm8
+	pshufd	\$0x4e,%xmm8,%xmm9
+	por	%xmm9,%xmm8
+	movq	%xmm8,%rbx
+
 	mulq	%rbx
 	addq	%rax, %r8
 	movq	8($ap), %rax
@@ -1004,7 +1091,6 @@ $code.=<<___;
 	adcq	\$0, %r8
 
 	mulq	%rbx
-	 movd	(%rbp), %xmm4
 	addq	%rax, %r9
 	movq	16($ap), %rax
 	adcq	\$0, %rdx
@@ -1013,7 +1099,6 @@ $code.=<<___;
 	adcq	\$0, %r9
 
 	mulq	%rbx
-	 movd	64(%rbp), %xmm5
 	addq	%rax, %r10
 	movq	24($ap), %rax
 	adcq	\$0, %rdx
@@ -1022,7 +1107,6 @@ $code.=<<___;
 	adcq	\$0, %r10
 
 	mulq	%rbx
-	 pslldq	\$4, %xmm5
 	addq	%rax, %r11
 	movq	32($ap), %rax
 	adcq	\$0, %rdx
@@ -1031,7 +1115,6 @@ $code.=<<___;
 	adcq	\$0, %r11
 
 	mulq	%rbx
-	 por	%xmm5, %xmm4
 	addq	%rax, %r12
 	movq	40($ap), %rax
 	adcq	\$0, %rdx
@@ -1056,7 +1139,6 @@ $code.=<<___;
 	adcq	\$0, %r14
 
 	mulq	%rbx
-	 movq	%xmm4, %rbx
 	addq	%rax, %r15
 	 movq	($ap), %rax
 	adcq	\$0, %rdx
@@ -1064,7 +1146,6 @@ $code.=<<___;
 	movq	%rdx, %r15	
 	adcq	\$0, %r15
 
-	leaq	128(%rbp), %rbp
 	leaq	8(%rdi), %rdi
 
 	decl	%ecx
@@ -1079,8 +1160,8 @@ $code.=<<___;
 	movq	%r14, 48(%rdi)
 	movq	%r15, 56(%rdi)
 
-	movq	%xmm0, $out
-	movq	%xmm1, %rbp
+	movq	128+8(%rsp), $out
+	movq	128+16(%rsp), %rbp
 
 	movq	(%rsp), %r8
 	movq	8(%rsp), %r9
@@ -1098,45 +1179,37 @@ $code.=<<___ if ($addx);
 
 .align	32
 .Lmulx_gather:
-	mov	64($bp,$pwr,4), %eax
-	movq	$out, %xmm0		# off-load arguments
-	lea	128($bp,$pwr,4), %rbp
-	mov	($bp,$pwr,4), %edx
-	movq	$mod, %xmm1
-	mov	$n0, 128(%rsp)
+	movq	%xmm8,%rdx
+
+	mov	$n0, 128(%rsp)		# off-load arguments
+	mov	$out, 128+8(%rsp)
+	mov	$mod, 128+16(%rsp)
 
-	shl	\$32, %rax
-	or	%rax, %rdx
 	mulx	($ap), %rbx, %r8	# 0 iteration
 	mov	%rbx, (%rsp)
 	xor	%edi, %edi		# cf=0, of=0
 
 	mulx	8($ap), %rax, %r9
-	 movd	(%rbp), %xmm4
 
 	mulx	16($ap), %rbx, %r10
-	 movd	64(%rbp), %xmm5
 	adcx	%rax, %r8
 
 	mulx	24($ap), %rax, %r11
-	 pslldq	\$4, %xmm5
 	adcx	%rbx, %r9
 
 	mulx	32($ap), %rbx, %r12
-	 por	%xmm5, %xmm4
 	adcx	%rax, %r10
 
 	mulx	40($ap), %rax, %r13
 	adcx	%rbx, %r11
 
 	mulx	48($ap), %rbx, %r14
-	 lea	128(%rbp), %rbp
 	adcx	%rax, %r12
 	
 	mulx	56($ap), %rax, %r15
-	 movq	%xmm4, %rdx
 	adcx	%rbx, %r13
 	adcx	%rax, %r14
+	.byte	0x67
 	mov	%r8, %rbx
 	adcx	%rdi, %r15		# %rdi is 0
 
@@ -1145,24 +1218,48 @@ $code.=<<___ if ($addx);
 
 .align	32
 .Loop_mulx_gather:
-	mulx	($ap), %rax, %r8
+	movdqa	16*0(%rbp),%xmm8
+	movdqa	16*1(%rbp),%xmm9
+	movdqa	16*2(%rbp),%xmm10
+	movdqa	16*3(%rbp),%xmm11
+	pand	%xmm0,%xmm8
+	movdqa	16*4(%rbp),%xmm12
+	pand	%xmm1,%xmm9
+	movdqa	16*5(%rbp),%xmm13
+	pand	%xmm2,%xmm10
+	movdqa	16*6(%rbp),%xmm14
+	pand	%xmm3,%xmm11
+	movdqa	16*7(%rbp),%xmm15
+	leaq	128(%rbp), %rbp
+	pand	%xmm4,%xmm12
+	pand	%xmm5,%xmm13
+	pand	%xmm6,%xmm14
+	pand	%xmm7,%xmm15
+	por	%xmm10,%xmm8
+	por	%xmm11,%xmm9
+	por	%xmm12,%xmm8
+	por	%xmm13,%xmm9
+	por	%xmm14,%xmm8
+	por	%xmm15,%xmm9
+
+	por	%xmm9,%xmm8
+	pshufd	\$0x4e,%xmm8,%xmm9
+	por	%xmm9,%xmm8
+	movq	%xmm8,%rdx
+
+	.byte	0xc4,0x62,0xfb,0xf6,0x86,0x00,0x00,0x00,0x00	# mulx	($ap), %rax, %r8
 	adcx	%rax, %rbx
 	adox	%r9, %r8
 
 	mulx	8($ap), %rax, %r9
-	.byte	0x66,0x0f,0x6e,0xa5,0x00,0x00,0x00,0x00		# movd	(%rbp), %xmm4
 	adcx	%rax, %r8
 	adox	%r10, %r9
 
 	mulx	16($ap), %rax, %r10
-	 movd	64(%rbp), %xmm5
-	 lea	128(%rbp), %rbp
 	adcx	%rax, %r9
 	adox	%r11, %r10
 
 	.byte	0xc4,0x62,0xfb,0xf6,0x9e,0x18,0x00,0x00,0x00	# mulx	24($ap), %rax, %r11
-	 pslldq	\$4, %xmm5
-	 por	%xmm5, %xmm4
 	adcx	%rax, %r10
 	adox	%r12, %r11
 
@@ -1176,10 +1273,10 @@ $code.=<<___ if ($addx);
 
 	.byte	0xc4,0x62,0xfb,0xf6,0xb6,0x30,0x00,0x00,0x00	# mulx	48($ap), %rax, %r14
 	adcx	%rax, %r13
+	.byte	0x67
 	adox	%r15, %r14
 
 	mulx	56($ap), %rax, %r15
-	 movq	%xmm4, %rdx
 	 mov	%rbx, 64(%rsp,%rcx,8)
 	adcx	%rax, %r14
 	adox	%rdi, %r15
@@ -1198,10 +1295,10 @@ $code.=<<___ if ($addx);
 	mov	%r14, 64+48(%rsp)
 	mov	%r15, 64+56(%rsp)
 
-	movq	%xmm0, $out
-	movq	%xmm1, %rbp
+	mov	128(%rsp), %rdx		# pull arguments
+	mov	128+8(%rsp), $out
+	mov	128+16(%rsp), %rbp
 
-	mov	128(%rsp), %rdx		# pull $n0
 	mov	(%rsp), %r8
 	mov	8(%rsp), %r9
 	mov	16(%rsp), %r10
@@ -1229,6 +1326,21 @@ $code.=<<___;
 	call	__rsaz_512_subtract
 
 	leaq	128+24+48(%rsp), %rax
+___
+$code.=<<___	if ($win64);
+	movaps	0xa0-0xc8(%rax),%xmm6
+	movaps	0xb0-0xc8(%rax),%xmm7
+	movaps	0xc0-0xc8(%rax),%xmm8
+	movaps	0xd0-0xc8(%rax),%xmm9
+	movaps	0xe0-0xc8(%rax),%xmm10
+	movaps	0xf0-0xc8(%rax),%xmm11
+	movaps	0x100-0xc8(%rax),%xmm12
+	movaps	0x110-0xc8(%rax),%xmm13
+	movaps	0x120-0xc8(%rax),%xmm14
+	movaps	0x130-0xc8(%rax),%xmm15
+	lea	0xb0(%rax),%rax
+___
+$code.=<<___;
 	movq	-48(%rax), %r15
 	movq	-40(%rax), %r14
 	movq	-32(%rax), %r13
@@ -1258,7 +1370,7 @@ rsaz_512_mul_scatter4:
 	mov	$pwr, $pwr
 	subq	\$128+24, %rsp
 .Lmul_scatter4_body:
-	leaq	($tbl,$pwr,4), $tbl
+	leaq	($tbl,$pwr,8), $tbl
 	movq	$out, %xmm0		# off-load arguments
 	movq	$mod, %xmm1
 	movq	$tbl, %xmm2
@@ -1329,30 +1441,14 @@ $code.=<<___;
 
 	call	__rsaz_512_subtract
 
-	movl	%r8d, 64*0($inp)	# scatter
-	shrq	\$32, %r8
-	movl	%r9d, 64*2($inp)
-	shrq	\$32, %r9
-	movl	%r10d, 64*4($inp)
-	shrq	\$32, %r10
-	movl	%r11d, 64*6($inp)
-	shrq	\$32, %r11
-	movl	%r12d, 64*8($inp)
-	shrq	\$32, %r12
-	movl	%r13d, 64*10($inp)
-	shrq	\$32, %r13
-	movl	%r14d, 64*12($inp)
-	shrq	\$32, %r14
-	movl	%r15d, 64*14($inp)
-	shrq	\$32, %r15
-	movl	%r8d, 64*1($inp)
-	movl	%r9d, 64*3($inp)
-	movl	%r10d, 64*5($inp)
-	movl	%r11d, 64*7($inp)
-	movl	%r12d, 64*9($inp)
-	movl	%r13d, 64*11($inp)
-	movl	%r14d, 64*13($inp)
-	movl	%r15d, 64*15($inp)
+	movq	%r8, 128*0($inp)	# scatter
+	movq	%r9, 128*1($inp)
+	movq	%r10, 128*2($inp)
+	movq	%r11, 128*3($inp)
+	movq	%r12, 128*4($inp)
+	movq	%r13, 128*5($inp)
+	movq	%r14, 128*6($inp)
+	movq	%r15, 128*7($inp)
 
 	leaq	128+24+48(%rsp), %rax
 	movq	-48(%rax), %r15
@@ -1956,16 +2052,14 @@ $code.=<<___;
 .type	rsaz_512_scatter4,\@abi-omnipotent
 .align	16
 rsaz_512_scatter4:
-	leaq	($out,$power,4), $out
+	leaq	($out,$power,8), $out
 	movl	\$8, %r9d
 	jmp	.Loop_scatter
 .align	16
 .Loop_scatter:
 	movq	($inp), %rax
 	leaq	8($inp), $inp
-	movl	%eax, ($out)
-	shrq	\$32, %rax
-	movl	%eax, 64($out)
+	movq	%rax, ($out)
 	leaq	128($out), $out
 	decl	%r9d
 	jnz	.Loop_scatter
@@ -1976,22 +2070,106 @@ rsaz_512_scatter4:
 .type	rsaz_512_gather4,\@abi-omnipotent
 .align	16
 rsaz_512_gather4:
-	leaq	($inp,$power,4), $inp
+___
+$code.=<<___	if ($win64);
+.LSEH_begin_rsaz_512_gather4:
+	.byte	0x48,0x81,0xec,0xa8,0x00,0x00,0x00	# sub    $0xa8,%rsp
+	.byte	0x0f,0x29,0x34,0x24			# movaps %xmm6,(%rsp)
+	.byte	0x0f,0x29,0x7c,0x24,0x10		# movaps %xmm7,0x10(%rsp)
+	.byte	0x44,0x0f,0x29,0x44,0x24,0x20		# movaps %xmm8,0x20(%rsp)
+	.byte	0x44,0x0f,0x29,0x4c,0x24,0x30		# movaps %xmm9,0x30(%rsp)
+	.byte	0x44,0x0f,0x29,0x54,0x24,0x40		# movaps %xmm10,0x40(%rsp)
+	.byte	0x44,0x0f,0x29,0x5c,0x24,0x50		# movaps %xmm11,0x50(%rsp)
+	.byte	0x44,0x0f,0x29,0x64,0x24,0x60		# movaps %xmm12,0x60(%rsp)
+	.byte	0x44,0x0f,0x29,0x6c,0x24,0x70		# movaps %xmm13,0x70(%rsp)
+	.byte	0x44,0x0f,0x29,0xb4,0x24,0x80,0,0,0	# movaps %xmm14,0x80(%rsp)
+	.byte	0x44,0x0f,0x29,0xbc,0x24,0x90,0,0,0	# movaps %xmm15,0x90(%rsp)
+___
+$code.=<<___;
+	movd	$power,%xmm8
+	movdqa	.Linc+16(%rip),%xmm1	# 00000002000000020000000200000002
+	movdqa	.Linc(%rip),%xmm0	# 00000001000000010000000000000000
+
+	pshufd	\$0,%xmm8,%xmm8		# broadcast $power
+	movdqa	%xmm1,%xmm7
+	movdqa	%xmm1,%xmm2
+___
+########################################################################
+# calculate mask by comparing 0..15 to $power
+#
+for($i=0;$i<4;$i++) {
+$code.=<<___;
+	paddd	%xmm`$i`,%xmm`$i+1`
+	pcmpeqd	%xmm8,%xmm`$i`
+	movdqa	%xmm7,%xmm`$i+3`
+___
+}
+for(;$i<7;$i++) {
+$code.=<<___;
+	paddd	%xmm`$i`,%xmm`$i+1`
+	pcmpeqd	%xmm8,%xmm`$i`
+___
+}
+$code.=<<___;
+	pcmpeqd	%xmm8,%xmm7
 	movl	\$8, %r9d
 	jmp	.Loop_gather
 .align	16
 .Loop_gather:
-	movl	($inp), %eax
-	movl	64($inp), %r8d
+	movdqa	16*0($inp),%xmm8
+	movdqa	16*1($inp),%xmm9
+	movdqa	16*2($inp),%xmm10
+	movdqa	16*3($inp),%xmm11
+	pand	%xmm0,%xmm8
+	movdqa	16*4($inp),%xmm12
+	pand	%xmm1,%xmm9
+	movdqa	16*5($inp),%xmm13
+	pand	%xmm2,%xmm10
+	movdqa	16*6($inp),%xmm14
+	pand	%xmm3,%xmm11
+	movdqa	16*7($inp),%xmm15
 	leaq	128($inp), $inp
-	shlq	\$32, %r8
-	or	%r8, %rax
-	movq	%rax, ($out)
+	pand	%xmm4,%xmm12
+	pand	%xmm5,%xmm13
+	pand	%xmm6,%xmm14
+	pand	%xmm7,%xmm15
+	por	%xmm10,%xmm8
+	por	%xmm11,%xmm9
+	por	%xmm12,%xmm8
+	por	%xmm13,%xmm9
+	por	%xmm14,%xmm8
+	por	%xmm15,%xmm9
+
+	por	%xmm9,%xmm8
+	pshufd	\$0x4e,%xmm8,%xmm9
+	por	%xmm9,%xmm8
+	movq	%xmm8,($out)
 	leaq	8($out), $out
 	decl	%r9d
 	jnz	.Loop_gather
+___
+$code.=<<___	if ($win64);
+	movaps	0x00(%rsp),%xmm6
+	movaps	0x10(%rsp),%xmm7
+	movaps	0x20(%rsp),%xmm8
+	movaps	0x30(%rsp),%xmm9
+	movaps	0x40(%rsp),%xmm10
+	movaps	0x50(%rsp),%xmm11
+	movaps	0x60(%rsp),%xmm12
+	movaps	0x70(%rsp),%xmm13
+	movaps	0x80(%rsp),%xmm14
+	movaps	0x90(%rsp),%xmm15
+	add	\$0xa8,%rsp
+___
+$code.=<<___;
 	ret
+.LSEH_end_rsaz_512_gather4:
 .size	rsaz_512_gather4,.-rsaz_512_gather4
+
+.align	64
+.Linc:
+	.long	0,0, 1,1
+	.long	2,2, 2,2
 ___
 }
 
@@ -2039,6 +2217,18 @@ se_handler:
 
 	lea	128+24+48(%rax),%rax
 
+	lea	.Lmul_gather4_epilogue(%rip),%rbx
+	cmp	%r10,%rbx
+	jne	.Lse_not_in_mul_gather4
+
+	lea	0xb0(%rax),%rax
+
+	lea	-48-0xa8(%rax),%rsi
+	lea	512($context),%rdi
+	mov	\$20,%ecx
+	.long	0xa548f3fc		# cld; rep movsq
+
+.Lse_not_in_mul_gather4:
 	mov	-8(%rax),%rbx
 	mov	-16(%rax),%rbp
 	mov	-24(%rax),%r12
@@ -2090,7 +2280,7 @@ se_handler:
 	pop	%rdi
 	pop	%rsi
 	ret
-.size	sqr_handler,.-sqr_handler
+.size	se_handler,.-se_handler
 
 .section	.pdata
 .align	4
@@ -2114,6 +2304,10 @@ se_handler:
 	.rva	.LSEH_end_rsaz_512_mul_by_one
 	.rva	.LSEH_info_rsaz_512_mul_by_one
 
+	.rva	.LSEH_begin_rsaz_512_gather4
+	.rva	.LSEH_end_rsaz_512_gather4
+	.rva	.LSEH_info_rsaz_512_gather4
+
 .section	.xdata
 .align	8
 .LSEH_info_rsaz_512_sqr:
@@ -2136,6 +2330,19 @@ se_handler:
 	.byte	9,0,0,0
 	.rva	se_handler
 	.rva	.Lmul_by_one_body,.Lmul_by_one_epilogue		# HandlerData[]
+.LSEH_info_rsaz_512_gather4:
+	.byte	0x01,0x46,0x16,0x00
+	.byte	0x46,0xf8,0x09,0x00	# vmovaps 0x90(rsp),xmm15
+	.byte	0x3d,0xe8,0x08,0x00	# vmovaps 0x80(rsp),xmm14
+	.byte	0x34,0xd8,0x07,0x00	# vmovaps 0x70(rsp),xmm13
+	.byte	0x2e,0xc8,0x06,0x00	# vmovaps 0x60(rsp),xmm12
+	.byte	0x28,0xb8,0x05,0x00	# vmovaps 0x50(rsp),xmm11
+	.byte	0x22,0xa8,0x04,0x00	# vmovaps 0x40(rsp),xmm10
+	.byte	0x1c,0x98,0x03,0x00	# vmovaps 0x30(rsp),xmm9
+	.byte	0x16,0x88,0x02,0x00	# vmovaps 0x20(rsp),xmm8
+	.byte	0x10,0x78,0x01,0x00	# vmovaps 0x10(rsp),xmm7
+	.byte	0x0b,0x68,0x00,0x00	# vmovaps 0x00(rsp),xmm6
+	.byte	0x07,0x01,0x15,0x00	# sub     rsp,0xa8
 ___
 }
 
diff --git a/crypto/bn/asm/x86_64-mont.pl b/crypto/bn/asm/x86_64-mont.pl
index e82e451..29ba122 100755
--- a/crypto/bn/asm/x86_64-mont.pl
+++ b/crypto/bn/asm/x86_64-mont.pl
@@ -775,100 +775,126 @@ bn_sqr8x_mont:
 	# 4096. this is done to allow memory disambiguation logic
 	# do its job.
 	#
-	lea	-64(%rsp,$num,4),%r11
+	lea	-64(%rsp,$num,2),%r11
 	mov	($n0),$n0		# *n0
 	sub	$aptr,%r11
 	and	\$4095,%r11
 	cmp	%r11,%r10
 	jb	.Lsqr8x_sp_alt
 	sub	%r11,%rsp		# align with $aptr
-	lea	-64(%rsp,$num,4),%rsp	# alloca(frame+4*$num)
+	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
 	jmp	.Lsqr8x_sp_done
 
 .align	32
 .Lsqr8x_sp_alt:
-	lea	4096-64(,$num,4),%r10	# 4096-frame-4*$num
-	lea	-64(%rsp,$num,4),%rsp	# alloca(frame+4*$num)
+	lea	4096-64(,$num,2),%r10	# 4096-frame-2*$num
+	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
 	sub	%r10,%r11
 	mov	\$0,%r10
 	cmovc	%r10,%r11
 	sub	%r11,%rsp
 .Lsqr8x_sp_done:
 	and	\$-64,%rsp
-	mov	$num,%r10	
+	mov	$num,%r10
 	neg	$num
 
-	lea	64(%rsp,$num,2),%r11	# copy of modulus
 	mov	$n0,  32(%rsp)
 	mov	%rax, 40(%rsp)		# save original %rsp
 .Lsqr8x_body:
 
-	mov	$num,$i
-	movq	%r11, %xmm2		# save pointer to modulus copy
-	shr	\$3+2,$i
-	mov	OPENSSL_ia32cap_P+8(%rip),%eax
-	jmp	.Lsqr8x_copy_n
-
-.align	32
-.Lsqr8x_copy_n:
-	movq	8*0($nptr),%xmm0
-	movq	8*1($nptr),%xmm1
-	movq	8*2($nptr),%xmm3
-	movq	8*3($nptr),%xmm4
-	lea	8*4($nptr),$nptr
-	movdqa	%xmm0,16*0(%r11)
-	movdqa	%xmm1,16*1(%r11)
-	movdqa	%xmm3,16*2(%r11)
-	movdqa	%xmm4,16*3(%r11)
-	lea	16*4(%r11),%r11
-	dec	$i
-	jnz	.Lsqr8x_copy_n
-
+	movq	$nptr, %xmm2		# save pointer to modulus
 	pxor	%xmm0,%xmm0
 	movq	$rptr,%xmm1		# save $rptr
 	movq	%r10, %xmm3		# -$num
 ___
 $code.=<<___ if ($addx);
+	mov	OPENSSL_ia32cap_P+8(%rip),%eax
 	and	\$0x80100,%eax
 	cmp	\$0x80100,%eax
 	jne	.Lsqr8x_nox
 
 	call	bn_sqrx8x_internal	# see x86_64-mont5 module
-
-	pxor	%xmm0,%xmm0
-	lea	48(%rsp),%rax
-	lea	64(%rsp,$num,2),%rdx
-	shr	\$3+2,$num
-	mov	40(%rsp),%rsi		# restore %rsp
-	jmp	.Lsqr8x_zero
+					# %rax	top-most carry
+					# %rbp	nptr
+					# %rcx	-8*num
+					# %r8	end of tp[2*num]
+	lea	(%r8,%rcx),%rbx
+	mov	%rcx,$num
+	mov	%rcx,%rdx
+	movq	%xmm1,$rptr
+	sar	\$3+2,%rcx		# %cf=0
+	jmp	.Lsqr8x_sub
 
 .align	32
 .Lsqr8x_nox:
 ___
 $code.=<<___;
 	call	bn_sqr8x_internal	# see x86_64-mont5 module
+					# %rax	top-most carry
+					# %rbp	nptr
+					# %r8	-8*num
+					# %rdi	end of tp[2*num]
+	lea	(%rdi,$num),%rbx
+	mov	$num,%rcx
+	mov	$num,%rdx
+	movq	%xmm1,$rptr
+	sar	\$3+2,%rcx		# %cf=0
+	jmp	.Lsqr8x_sub
 
+.align	32
+.Lsqr8x_sub:
+	mov	8*0(%rbx),%r12
+	mov	8*1(%rbx),%r13
+	mov	8*2(%rbx),%r14
+	mov	8*3(%rbx),%r15
+	lea	8*4(%rbx),%rbx
+	sbb	8*0(%rbp),%r12
+	sbb	8*1(%rbp),%r13
+	sbb	8*2(%rbp),%r14
+	sbb	8*3(%rbp),%r15
+	lea	8*4(%rbp),%rbp
+	mov	%r12,8*0($rptr)
+	mov	%r13,8*1($rptr)
+	mov	%r14,8*2($rptr)
+	mov	%r15,8*3($rptr)
+	lea	8*4($rptr),$rptr
+	inc	%rcx			# preserves %cf
+	jnz	.Lsqr8x_sub
+
+	sbb	\$0,%rax		# top-most carry
+	lea	(%rbx,$num),%rbx	# rewind
+	lea	($rptr,$num),$rptr	# rewind
+
+	movq	%rax,%xmm1
 	pxor	%xmm0,%xmm0
-	lea	48(%rsp),%rax
-	lea	64(%rsp,$num,2),%rdx
-	shr	\$3+2,$num
+	pshufd	\$0,%xmm1,%xmm1
 	mov	40(%rsp),%rsi		# restore %rsp
-	jmp	.Lsqr8x_zero
+	jmp	.Lsqr8x_cond_copy
 
 .align	32
-.Lsqr8x_zero:
-	movdqa	%xmm0,16*0(%rax)	# wipe t
-	movdqa	%xmm0,16*1(%rax)
-	movdqa	%xmm0,16*2(%rax)
-	movdqa	%xmm0,16*3(%rax)
-	lea	16*4(%rax),%rax
-	movdqa	%xmm0,16*0(%rdx)	# wipe n
-	movdqa	%xmm0,16*1(%rdx)
-	movdqa	%xmm0,16*2(%rdx)
-	movdqa	%xmm0,16*3(%rdx)
-	lea	16*4(%rdx),%rdx
-	dec	$num
-	jnz	.Lsqr8x_zero
+.Lsqr8x_cond_copy:
+	movdqa	16*0(%rbx),%xmm2
+	movdqa	16*1(%rbx),%xmm3
+	lea	16*2(%rbx),%rbx
+	movdqu	16*0($rptr),%xmm4
+	movdqu	16*1($rptr),%xmm5
+	lea	16*2($rptr),$rptr
+	movdqa	%xmm0,-16*2(%rbx)	# zero tp
+	movdqa	%xmm0,-16*1(%rbx)
+	movdqa	%xmm0,-16*2(%rbx,%rdx)
+	movdqa	%xmm0,-16*1(%rbx,%rdx)
+	pcmpeqd	%xmm1,%xmm0
+	pand	%xmm1,%xmm2
+	pand	%xmm1,%xmm3
+	pand	%xmm0,%xmm4
+	pand	%xmm0,%xmm5
+	pxor	%xmm0,%xmm0
+	por	%xmm2,%xmm4
+	por	%xmm3,%xmm5
+	movdqu	%xmm4,-16*2($rptr)
+	movdqu	%xmm5,-16*1($rptr)
+	add	\$32,$num
+	jnz	.Lsqr8x_cond_copy
 
 	mov	\$1,%rax
 	mov	-48(%rsi),%r15
@@ -1135,64 +1161,75 @@ $code.=<<___;
 	adc	$zero,%r15		# modulo-scheduled
 	sub	0*8($tptr),$zero	# pull top-most carry
 	adc	%r15,%r14
-	mov	-8($nptr),$mi
 	sbb	%r15,%r15		# top-most carry
 	mov	%r14,-1*8($tptr)
 
 	cmp	16(%rsp),$bptr
 	jne	.Lmulx4x_outer
 
-	sub	%r14,$mi		# compare top-most words
-	sbb	$mi,$mi
-	or	$mi,%r15
-
-	neg	$num
-	xor	%rdx,%rdx
+	lea	64(%rsp),$tptr
+	sub	$num,$nptr		# rewind $nptr
+	neg	%r15
+	mov	$num,%rdx
+	shr	\$3+2,$num		# %cf=0
 	mov	32(%rsp),$rptr		# restore rp
+	jmp	.Lmulx4x_sub
+
+.align	32
+.Lmulx4x_sub:
+	mov	8*0($tptr),%r11
+	mov	8*1($tptr),%r12
+	mov	8*2($tptr),%r13
+	mov	8*3($tptr),%r14
+	lea	8*4($tptr),$tptr
+	sbb	8*0($nptr),%r11
+	sbb	8*1($nptr),%r12
+	sbb	8*2($nptr),%r13
+	sbb	8*3($nptr),%r14
+	lea	8*4($nptr),$nptr
+	mov	%r11,8*0($rptr)
+	mov	%r12,8*1($rptr)
+	mov	%r13,8*2($rptr)
+	mov	%r14,8*3($rptr)
+	lea	8*4($rptr),$rptr
+	dec	$num			# preserves %cf
+	jnz	.Lmulx4x_sub
+
+	sbb	\$0,%r15		# top-most carry
 	lea	64(%rsp),$tptr
+	sub	%rdx,$rptr		# rewind
 
+	movq	%r15,%xmm1
 	pxor	%xmm0,%xmm0
-	mov	0*8($nptr,$num),%r8
-	mov	1*8($nptr,$num),%r9
-	neg	%r8
-	jmp	.Lmulx4x_sub_entry
+	pshufd	\$0,%xmm1,%xmm1
+	mov	40(%rsp),%rsi		# restore %rsp
+	jmp	.Lmulx4x_cond_copy
 
 .align	32
-.Lmulx4x_sub:
-	mov	0*8($nptr,$num),%r8
-	mov	1*8($nptr,$num),%r9
-	not	%r8
-.Lmulx4x_sub_entry:
-	mov	2*8($nptr,$num),%r10
-	not	%r9
-	and	%r15,%r8
-	mov	3*8($nptr,$num),%r11
-	not	%r10
-	and	%r15,%r9
-	not	%r11
-	and	%r15,%r10
-	and	%r15,%r11
-
-	neg	%rdx			# mov %rdx,%cf
-	adc	0*8($tptr),%r8
-	adc	1*8($tptr),%r9
-	movdqa	%xmm0,($tptr)
-	adc	2*8($tptr),%r10
-	adc	3*8($tptr),%r11
-	movdqa	%xmm0,16($tptr)
-	lea	4*8($tptr),$tptr
-	sbb	%rdx,%rdx		# mov %cf,%rdx
+.Lmulx4x_cond_copy:
+	movdqa	16*0($tptr),%xmm2
+	movdqa	16*1($tptr),%xmm3
+	lea	16*2($tptr),$tptr
+	movdqu	16*0($rptr),%xmm4
+	movdqu	16*1($rptr),%xmm5
+	lea	16*2($rptr),$rptr
+	movdqa	%xmm0,-16*2($tptr)	# zero tp
+	movdqa	%xmm0,-16*1($tptr)
+	pcmpeqd	%xmm1,%xmm0
+	pand	%xmm1,%xmm2
+	pand	%xmm1,%xmm3
+	pand	%xmm0,%xmm4
+	pand	%xmm0,%xmm5
+	pxor	%xmm0,%xmm0
+	por	%xmm2,%xmm4
+	por	%xmm3,%xmm5
+	movdqu	%xmm4,-16*2($rptr)
+	movdqu	%xmm5,-16*1($rptr)
+	sub	\$32,%rdx
+	jnz	.Lmulx4x_cond_copy
 
-	mov	%r8,0*8($rptr)
-	mov	%r9,1*8($rptr)
-	mov	%r10,2*8($rptr)
-	mov	%r11,3*8($rptr)
-	lea	4*8($rptr),$rptr
+	mov	%rdx,($tptr)
 
-	add	\$32,$num
-	jnz	.Lmulx4x_sub
-
-	mov	40(%rsp),%rsi		# restore %rsp
 	mov	\$1,%rax
 	mov	-48(%rsi),%r15
 	mov	-40(%rsi),%r14
diff --git a/crypto/bn/asm/x86_64-mont5.pl b/crypto/bn/asm/x86_64-mont5.pl
index 292409c..2e8c9db 100755
--- a/crypto/bn/asm/x86_64-mont5.pl
+++ b/crypto/bn/asm/x86_64-mont5.pl
@@ -99,58 +99,111 @@ $code.=<<___;
 .Lmul_enter:
 	mov	${num}d,${num}d
 	mov	%rsp,%rax
-	mov	`($win64?56:8)`(%rsp),%r10d	# load 7th argument
+	movd	`($win64?56:8)`(%rsp),%xmm5	# load 7th argument
+	lea	.Linc(%rip),%r10
 	push	%rbx
 	push	%rbp
 	push	%r12
 	push	%r13
 	push	%r14
 	push	%r15
-___
-$code.=<<___ if ($win64);
-	lea	-0x28(%rsp),%rsp
-	movaps	%xmm6,(%rsp)
-	movaps	%xmm7,0x10(%rsp)
-___
-$code.=<<___;
+
 	lea	2($num),%r11
 	neg	%r11
-	lea	(%rsp,%r11,8),%rsp	# tp=alloca(8*(num+2))
+	lea	-264(%rsp,%r11,8),%rsp	# tp=alloca(8*(num+2)+256+8)
 	and	\$-1024,%rsp		# minimize TLB usage
 
 	mov	%rax,8(%rsp,$num,8)	# tp[num+1]=%rsp
 .Lmul_body:
-	mov	$bp,%r12		# reassign $bp
+	lea	128($bp),%r12		# reassign $bp (+size optimization)
 ___
 		$bp="%r12";
 		$STRIDE=2**5*8;		# 5 is "window size"
 		$N=$STRIDE/4;		# should match cache line size
 $code.=<<___;
-	mov	%r10,%r11
-	shr	\$`log($N/8)/log(2)`,%r10
-	and	\$`$N/8-1`,%r11
-	not	%r10
-	lea	.Lmagic_masks(%rip),%rax
-	and	\$`2**5/($N/8)-1`,%r10	# 5 is "window size"
-	lea	96($bp,%r11,8),$bp	# pointer within 1st cache line
-	movq	0(%rax,%r10,8),%xmm4	# set of masks denoting which
-	movq	8(%rax,%r10,8),%xmm5	# cache line contains element
-	movq	16(%rax,%r10,8),%xmm6	# denoted by 7th argument
-	movq	24(%rax,%r10,8),%xmm7
-
-	movq	`0*$STRIDE/4-96`($bp),%xmm0
-	movq	`1*$STRIDE/4-96`($bp),%xmm1
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-96`($bp),%xmm2
-	pand	%xmm5,%xmm1
-	movq	`3*$STRIDE/4-96`($bp),%xmm3
-	pand	%xmm6,%xmm2
-	por	%xmm1,%xmm0
-	pand	%xmm7,%xmm3
+	movdqa	0(%r10),%xmm0		# 00000001000000010000000000000000
+	movdqa	16(%r10),%xmm1		# 00000002000000020000000200000002
+	lea	24-112(%rsp,$num,8),%r10# place the mask after tp[num+3] (+ICache optimization)
+	and	\$-16,%r10
+
+	pshufd	\$0,%xmm5,%xmm5		# broadcast index
+	movdqa	%xmm1,%xmm4
+	movdqa	%xmm1,%xmm2
+___
+########################################################################
+# calculate mask by comparing 0..31 to index and save result to stack
+#
+$code.=<<___;
+	paddd	%xmm0,%xmm1
+	pcmpeqd	%xmm5,%xmm0		# compare to 1,0
+	.byte	0x67
+	movdqa	%xmm4,%xmm3
+___
+for($k=0;$k<$STRIDE/16-4;$k+=4) {
+$code.=<<___;
+	paddd	%xmm1,%xmm2
+	pcmpeqd	%xmm5,%xmm1		# compare to 3,2
+	movdqa	%xmm0,`16*($k+0)+112`(%r10)
+	movdqa	%xmm4,%xmm0
+
+	paddd	%xmm2,%xmm3
+	pcmpeqd	%xmm5,%xmm2		# compare to 5,4
+	movdqa	%xmm1,`16*($k+1)+112`(%r10)
+	movdqa	%xmm4,%xmm1
+
+	paddd	%xmm3,%xmm0
+	pcmpeqd	%xmm5,%xmm3		# compare to 7,6
+	movdqa	%xmm2,`16*($k+2)+112`(%r10)
+	movdqa	%xmm4,%xmm2
+
+	paddd	%xmm0,%xmm1
+	pcmpeqd	%xmm5,%xmm0
+	movdqa	%xmm3,`16*($k+3)+112`(%r10)
+	movdqa	%xmm4,%xmm3
+___
+}
+$code.=<<___;				# last iteration can be optimized
+	paddd	%xmm1,%xmm2
+	pcmpeqd	%xmm5,%xmm1
+	movdqa	%xmm0,`16*($k+0)+112`(%r10)
+
+	paddd	%xmm2,%xmm3
+	.byte	0x67
+	pcmpeqd	%xmm5,%xmm2
+	movdqa	%xmm1,`16*($k+1)+112`(%r10)
+
+	pcmpeqd	%xmm5,%xmm3
+	movdqa	%xmm2,`16*($k+2)+112`(%r10)
+	pand	`16*($k+0)-128`($bp),%xmm0	# while it's still in register
+
+	pand	`16*($k+1)-128`($bp),%xmm1
+	pand	`16*($k+2)-128`($bp),%xmm2
+	movdqa	%xmm3,`16*($k+3)+112`(%r10)
+	pand	`16*($k+3)-128`($bp),%xmm3
 	por	%xmm2,%xmm0
+	por	%xmm3,%xmm1
+___
+for($k=0;$k<$STRIDE/16-4;$k+=4) {
+$code.=<<___;
+	movdqa	`16*($k+0)-128`($bp),%xmm4
+	movdqa	`16*($k+1)-128`($bp),%xmm5
+	movdqa	`16*($k+2)-128`($bp),%xmm2
+	pand	`16*($k+0)+112`(%r10),%xmm4
+	movdqa	`16*($k+3)-128`($bp),%xmm3
+	pand	`16*($k+1)+112`(%r10),%xmm5
+	por	%xmm4,%xmm0
+	pand	`16*($k+2)+112`(%r10),%xmm2
+	por	%xmm5,%xmm1
+	pand	`16*($k+3)+112`(%r10),%xmm3
+	por	%xmm2,%xmm0
+	por	%xmm3,%xmm1
+___
+}
+$code.=<<___;
+	por	%xmm1,%xmm0
+	pshufd	\$0x4e,%xmm0,%xmm1
+	por	%xmm1,%xmm0
 	lea	$STRIDE($bp),$bp
-	por	%xmm3,%xmm0
-
 	movq	%xmm0,$m0		# m0=bp[0]
 
 	mov	($n0),$n0		# pull n0[0] value
@@ -159,29 +212,14 @@ $code.=<<___;
 	xor	$i,$i			# i=0
 	xor	$j,$j			# j=0
 
-	movq	`0*$STRIDE/4-96`($bp),%xmm0
-	movq	`1*$STRIDE/4-96`($bp),%xmm1
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-96`($bp),%xmm2
-	pand	%xmm5,%xmm1
-
 	mov	$n0,$m1
 	mulq	$m0			# ap[0]*bp[0]
 	mov	%rax,$lo0
 	mov	($np),%rax
 
-	movq	`3*$STRIDE/4-96`($bp),%xmm3
-	pand	%xmm6,%xmm2
-	por	%xmm1,%xmm0
-	pand	%xmm7,%xmm3
-
 	imulq	$lo0,$m1		# "tp[0]"*n0
 	mov	%rdx,$hi0
 
-	por	%xmm2,%xmm0
-	lea	$STRIDE($bp),$bp
-	por	%xmm3,%xmm0
-
 	mulq	$m1			# np[0]*m1
 	add	%rax,$lo0		# discarded
 	mov	8($ap),%rax
@@ -212,16 +250,14 @@ $code.=<<___;
 
 	mulq	$m1			# np[j]*m1
 	cmp	$num,$j
-	jne	.L1st
-
-	movq	%xmm0,$m0		# bp[1]
+	jne	.L1st			# note that upon exit $j==$num, so
+					# they can be used interchangeably
 
 	add	%rax,$hi1
-	mov	($ap),%rax		# ap[0]
 	adc	\$0,%rdx
 	add	$hi0,$hi1		# np[j]*m1+ap[j]*bp[0]
 	adc	\$0,%rdx
-	mov	$hi1,-16(%rsp,$j,8)	# tp[j-1]
+	mov	$hi1,-16(%rsp,$num,8)	# tp[num-1]
 	mov	%rdx,$hi1
 	mov	$lo0,$hi0
 
@@ -235,33 +271,48 @@ $code.=<<___;
 	jmp	.Louter
 .align	16
 .Louter:
+	lea	24+128(%rsp,$num,8),%rdx	# where 256-byte mask is (+size optimization)
+	and	\$-16,%rdx
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+for($k=0;$k<$STRIDE/16;$k+=4) {
+$code.=<<___;
+	movdqa	`16*($k+0)-128`($bp),%xmm0
+	movdqa	`16*($k+1)-128`($bp),%xmm1
+	movdqa	`16*($k+2)-128`($bp),%xmm2
+	movdqa	`16*($k+3)-128`($bp),%xmm3
+	pand	`16*($k+0)-128`(%rdx),%xmm0
+	pand	`16*($k+1)-128`(%rdx),%xmm1
+	por	%xmm0,%xmm4
+	pand	`16*($k+2)-128`(%rdx),%xmm2
+	por	%xmm1,%xmm5
+	pand	`16*($k+3)-128`(%rdx),%xmm3
+	por	%xmm2,%xmm4
+	por	%xmm3,%xmm5
+___
+}
+$code.=<<___;
+	por	%xmm5,%xmm4
+	pshufd	\$0x4e,%xmm4,%xmm0
+	por	%xmm4,%xmm0
+	lea	$STRIDE($bp),$bp
+
+	mov	($ap),%rax		# ap[0]
+	movq	%xmm0,$m0		# m0=bp[i]
+
 	xor	$j,$j			# j=0
 	mov	$n0,$m1
 	mov	(%rsp),$lo0
 
-	movq	`0*$STRIDE/4-96`($bp),%xmm0
-	movq	`1*$STRIDE/4-96`($bp),%xmm1
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-96`($bp),%xmm2
-	pand	%xmm5,%xmm1
-
 	mulq	$m0			# ap[0]*bp[i]
 	add	%rax,$lo0		# ap[0]*bp[i]+tp[0]
 	mov	($np),%rax
 	adc	\$0,%rdx
 
-	movq	`3*$STRIDE/4-96`($bp),%xmm3
-	pand	%xmm6,%xmm2
-	por	%xmm1,%xmm0
-	pand	%xmm7,%xmm3
-
 	imulq	$lo0,$m1		# tp[0]*n0
 	mov	%rdx,$hi0
 
-	por	%xmm2,%xmm0
-	lea	$STRIDE($bp),$bp
-	por	%xmm3,%xmm0
-
 	mulq	$m1			# np[0]*m1
 	add	%rax,$lo0		# discarded
 	mov	8($ap),%rax
@@ -295,17 +346,14 @@ $code.=<<___;
 
 	mulq	$m1			# np[j]*m1
 	cmp	$num,$j
-	jne	.Linner
-
-	movq	%xmm0,$m0		# bp[i+1]
-
+	jne	.Linner			# note that upon exit $j==$num, so
+					# they can be used interchangeably
 	add	%rax,$hi1
-	mov	($ap),%rax		# ap[0]
 	adc	\$0,%rdx
 	add	$lo0,$hi1		# np[j]*m1+ap[j]*bp[i]+tp[j]
-	mov	(%rsp,$j,8),$lo0
+	mov	(%rsp,$num,8),$lo0
 	adc	\$0,%rdx
-	mov	$hi1,-16(%rsp,$j,8)	# tp[j-1]
+	mov	$hi1,-16(%rsp,$num,8)	# tp[num-1]
 	mov	%rdx,$hi1
 
 	xor	%rdx,%rdx
@@ -352,12 +400,7 @@ $code.=<<___;
 
 	mov	8(%rsp,$num,8),%rsi	# restore %rsp
 	mov	\$1,%rax
-___
-$code.=<<___ if ($win64);
-	movaps	-88(%rsi),%xmm6
-	movaps	-72(%rsi),%xmm7
-___
-$code.=<<___;
+
 	mov	-48(%rsi),%r15
 	mov	-40(%rsi),%r14
 	mov	-32(%rsi),%r13
@@ -379,8 +422,8 @@ bn_mul4x_mont_gather5:
 .Lmul4x_enter:
 ___
 $code.=<<___ if ($addx);
-	and	\$0x80100,%r11d
-	cmp	\$0x80100,%r11d
+	and	\$0x80108,%r11d
+	cmp	\$0x80108,%r11d		# check for AD*X+BMI2+BMI1
 	je	.Lmulx4x_enter
 ___
 $code.=<<___;
@@ -392,39 +435,34 @@ $code.=<<___;
 	push	%r13
 	push	%r14
 	push	%r15
-___
-$code.=<<___ if ($win64);
-	lea	-0x28(%rsp),%rsp
-	movaps	%xmm6,(%rsp)
-	movaps	%xmm7,0x10(%rsp)
-___
-$code.=<<___;
+
 	.byte	0x67
-	mov	${num}d,%r10d
-	shl	\$3,${num}d
-	shl	\$3+2,%r10d		# 4*$num
+	shl	\$3,${num}d		# convert $num to bytes
+	lea	($num,$num,2),%r10	# 3*$num in bytes
 	neg	$num			# -$num
 
 	##############################################################
-	# ensure that stack frame doesn't alias with $aptr+4*$num
-	# modulo 4096, which covers ret[num], am[num] and n[2*num]
-	# (see bn_exp.c). this is done to allow memory disambiguation
-	# logic do its magic. [excessive frame is allocated in order
-	# to allow bn_from_mont8x to clear it.]
+	# Ensure that stack frame doesn't alias with $rptr+3*$num
+	# modulo 4096, which covers ret[num], am[num] and n[num]
+	# (see bn_exp.c). This is done to allow memory disambiguation
+	# logic do its magic. [Extra [num] is allocated in order
+	# to align with bn_power5's frame, which is cleansed after
+	# completing exponentiation. Extra 256 bytes is for power mask
+	# calculated from 7th argument, the index.]
 	#
-	lea	-64(%rsp,$num,2),%r11
-	sub	$ap,%r11
+	lea	-320(%rsp,$num,2),%r11
+	sub	$rp,%r11
 	and	\$4095,%r11
 	cmp	%r11,%r10
 	jb	.Lmul4xsp_alt
-	sub	%r11,%rsp		# align with $ap
-	lea	-64(%rsp,$num,2),%rsp	# alloca(128+num*8)
+	sub	%r11,%rsp		# align with $rp
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*num*8+256)
 	jmp	.Lmul4xsp_done
 
 .align	32
 .Lmul4xsp_alt:
-	lea	4096-64(,$num,2),%r10
-	lea	-64(%rsp,$num,2),%rsp	# alloca(128+num*8)
+	lea	4096-320(,$num,2),%r10
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*num*8+256)
 	sub	%r10,%r11
 	mov	\$0,%r10
 	cmovc	%r10,%r11
@@ -440,12 +478,7 @@ $code.=<<___;
 
 	mov	40(%rsp),%rsi		# restore %rsp
 	mov	\$1,%rax
-___
-$code.=<<___ if ($win64);
-	movaps	-88(%rsi),%xmm6
-	movaps	-72(%rsi),%xmm7
-___
-$code.=<<___;
+
 	mov	-48(%rsi),%r15
 	mov	-40(%rsi),%r14
 	mov	-32(%rsi),%r13
@@ -460,9 +493,10 @@ $code.=<<___;
 .type	mul4x_internal,\@abi-omnipotent
 .align	32
 mul4x_internal:
-	shl	\$5,$num
-	mov	`($win64?56:8)`(%rax),%r10d	# load 7th argument
-	lea	256(%rdx,$num),%r13
+	shl	\$5,$num		# $num was in bytes
+	movd	`($win64?56:8)`(%rax),%xmm5	# load 7th argument, index
+	lea	.Linc(%rip),%rax
+	lea	128(%rdx,$num),%r13	# end of powers table (+size optimization)
 	shr	\$5,$num		# restore $num
 ___
 		$bp="%r12";
@@ -470,44 +504,92 @@ ___
 		$N=$STRIDE/4;		# should match cache line size
 		$tp=$i;
 $code.=<<___;
-	mov	%r10,%r11
-	shr	\$`log($N/8)/log(2)`,%r10
-	and	\$`$N/8-1`,%r11
-	not	%r10
-	lea	.Lmagic_masks(%rip),%rax
-	and	\$`2**5/($N/8)-1`,%r10	# 5 is "window size"
-	lea	96(%rdx,%r11,8),$bp	# pointer within 1st cache line
-	movq	0(%rax,%r10,8),%xmm4	# set of masks denoting which
-	movq	8(%rax,%r10,8),%xmm5	# cache line contains element
-	add	\$7,%r11
-	movq	16(%rax,%r10,8),%xmm6	# denoted by 7th argument
-	movq	24(%rax,%r10,8),%xmm7
-	and	\$7,%r11
-
-	movq	`0*$STRIDE/4-96`($bp),%xmm0
-	lea	$STRIDE($bp),$tp	# borrow $tp
-	movq	`1*$STRIDE/4-96`($bp),%xmm1
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-96`($bp),%xmm2
-	pand	%xmm5,%xmm1
-	movq	`3*$STRIDE/4-96`($bp),%xmm3
-	pand	%xmm6,%xmm2
-	.byte	0x67
-	por	%xmm1,%xmm0
-	movq	`0*$STRIDE/4-96`($tp),%xmm1
-	.byte	0x67
-	pand	%xmm7,%xmm3
-	.byte	0x67
-	por	%xmm2,%xmm0
-	movq	`1*$STRIDE/4-96`($tp),%xmm2
+	movdqa	0(%rax),%xmm0		# 00000001000000010000000000000000
+	movdqa	16(%rax),%xmm1		# 00000002000000020000000200000002
+	lea	88-112(%rsp,$num),%r10	# place the mask after tp[num+1] (+ICache optimization)
+	lea	128(%rdx),$bp		# size optimization
+
+	pshufd	\$0,%xmm5,%xmm5		# broadcast index
+	movdqa	%xmm1,%xmm4
+	.byte	0x67,0x67
+	movdqa	%xmm1,%xmm2
+___
+########################################################################
+# calculate mask by comparing 0..31 to index and save result to stack
+#
+$code.=<<___;
+	paddd	%xmm0,%xmm1
+	pcmpeqd	%xmm5,%xmm0		# compare to 1,0
 	.byte	0x67
-	pand	%xmm4,%xmm1
+	movdqa	%xmm4,%xmm3
+___
+for($i=0;$i<$STRIDE/16-4;$i+=4) {
+$code.=<<___;
+	paddd	%xmm1,%xmm2
+	pcmpeqd	%xmm5,%xmm1		# compare to 3,2
+	movdqa	%xmm0,`16*($i+0)+112`(%r10)
+	movdqa	%xmm4,%xmm0
+
+	paddd	%xmm2,%xmm3
+	pcmpeqd	%xmm5,%xmm2		# compare to 5,4
+	movdqa	%xmm1,`16*($i+1)+112`(%r10)
+	movdqa	%xmm4,%xmm1
+
+	paddd	%xmm3,%xmm0
+	pcmpeqd	%xmm5,%xmm3		# compare to 7,6
+	movdqa	%xmm2,`16*($i+2)+112`(%r10)
+	movdqa	%xmm4,%xmm2
+
+	paddd	%xmm0,%xmm1
+	pcmpeqd	%xmm5,%xmm0
+	movdqa	%xmm3,`16*($i+3)+112`(%r10)
+	movdqa	%xmm4,%xmm3
+___
+}
+$code.=<<___;				# last iteration can be optimized
+	paddd	%xmm1,%xmm2
+	pcmpeqd	%xmm5,%xmm1
+	movdqa	%xmm0,`16*($i+0)+112`(%r10)
+
+	paddd	%xmm2,%xmm3
 	.byte	0x67
-	por	%xmm3,%xmm0
-	movq	`2*$STRIDE/4-96`($tp),%xmm3
+	pcmpeqd	%xmm5,%xmm2
+	movdqa	%xmm1,`16*($i+1)+112`(%r10)
 
+	pcmpeqd	%xmm5,%xmm3
+	movdqa	%xmm2,`16*($i+2)+112`(%r10)
+	pand	`16*($i+0)-128`($bp),%xmm0	# while it's still in register
+
+	pand	`16*($i+1)-128`($bp),%xmm1
+	pand	`16*($i+2)-128`($bp),%xmm2
+	movdqa	%xmm3,`16*($i+3)+112`(%r10)
+	pand	`16*($i+3)-128`($bp),%xmm3
+	por	%xmm2,%xmm0
+	por	%xmm3,%xmm1
+___
+for($i=0;$i<$STRIDE/16-4;$i+=4) {
+$code.=<<___;
+	movdqa	`16*($i+0)-128`($bp),%xmm4
+	movdqa	`16*($i+1)-128`($bp),%xmm5
+	movdqa	`16*($i+2)-128`($bp),%xmm2
+	pand	`16*($i+0)+112`(%r10),%xmm4
+	movdqa	`16*($i+3)-128`($bp),%xmm3
+	pand	`16*($i+1)+112`(%r10),%xmm5
+	por	%xmm4,%xmm0
+	pand	`16*($i+2)+112`(%r10),%xmm2
+	por	%xmm5,%xmm1
+	pand	`16*($i+3)+112`(%r10),%xmm3
+	por	%xmm2,%xmm0
+	por	%xmm3,%xmm1
+___
+}
+$code.=<<___;
+	por	%xmm1,%xmm0
+	pshufd	\$0x4e,%xmm0,%xmm1
+	por	%xmm1,%xmm0
+	lea	$STRIDE($bp),$bp
 	movq	%xmm0,$m0		# m0=bp[0]
-	movq	`3*$STRIDE/4-96`($tp),%xmm0
+
 	mov	%r13,16+8(%rsp)		# save end of b[num]
 	mov	$rp, 56+8(%rsp)		# save $rp
 
@@ -521,26 +603,10 @@ $code.=<<___;
 	mov	%rax,$A[0]
 	mov	($np),%rax
 
-	pand	%xmm5,%xmm2
-	pand	%xmm6,%xmm3
-	por	%xmm2,%xmm1
-
 	imulq	$A[0],$m1		# "tp[0]"*n0
-	##############################################################
-	# $tp is chosen so that writing to top-most element of the
-	# vector occurs just "above" references to powers table,
-	# "above" modulo cache-line size, which effectively precludes
-	# possibility of memory disambiguation logic failure when
-	# accessing the table.
-	# 
-	lea	64+8(%rsp,%r11,8),$tp
+	lea	64+8(%rsp),$tp
 	mov	%rdx,$A[1]
 
-	pand	%xmm7,%xmm0
-	por	%xmm3,%xmm1
-	lea	2*$STRIDE($bp),$bp
-	por	%xmm1,%xmm0
-
 	mulq	$m1			# np[0]*m1
 	add	%rax,$A[0]		# discarded
 	mov	8($ap,$num),%rax
@@ -549,7 +615,7 @@ $code.=<<___;
 
 	mulq	$m0
 	add	%rax,$A[1]
-	mov	16*1($np),%rax		# interleaved with 0, therefore 16*n
+	mov	8*1($np),%rax
 	adc	\$0,%rdx
 	mov	%rdx,$A[0]
 
@@ -559,7 +625,7 @@ $code.=<<___;
 	adc	\$0,%rdx
 	add	$A[1],$N[1]
 	lea	4*8($num),$j		# j=4
-	lea	16*4($np),$np
+	lea	8*4($np),$np
 	adc	\$0,%rdx
 	mov	$N[1],($tp)
 	mov	%rdx,$N[0]
@@ -569,7 +635,7 @@ $code.=<<___;
 .L1st4x:
 	mulq	$m0			# ap[j]*bp[0]
 	add	%rax,$A[0]
-	mov	-16*2($np),%rax
+	mov	-8*2($np),%rax
 	lea	32($tp),$tp
 	adc	\$0,%rdx
 	mov	%rdx,$A[1]
@@ -585,7 +651,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[0]
 	add	%rax,$A[1]
-	mov	-16*1($np),%rax
+	mov	-8*1($np),%rax
 	adc	\$0,%rdx
 	mov	%rdx,$A[0]
 
@@ -600,7 +666,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[0]
 	add	%rax,$A[0]
-	mov	16*0($np),%rax
+	mov	8*0($np),%rax
 	adc	\$0,%rdx
 	mov	%rdx,$A[1]
 
@@ -615,7 +681,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[0]
 	add	%rax,$A[1]
-	mov	16*1($np),%rax
+	mov	8*1($np),%rax
 	adc	\$0,%rdx
 	mov	%rdx,$A[0]
 
@@ -624,7 +690,7 @@ $code.=<<___;
 	mov	16($ap,$j),%rax
 	adc	\$0,%rdx
 	add	$A[1],$N[1]		# np[j]*m1+ap[j]*bp[0]
-	lea	16*4($np),$np
+	lea	8*4($np),$np
 	adc	\$0,%rdx
 	mov	$N[1],($tp)		# tp[j-1]
 	mov	%rdx,$N[0]
@@ -634,7 +700,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[0]
 	add	%rax,$A[0]
-	mov	-16*2($np),%rax
+	mov	-8*2($np),%rax
 	lea	32($tp),$tp
 	adc	\$0,%rdx
 	mov	%rdx,$A[1]
@@ -650,7 +716,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[0]
 	add	%rax,$A[1]
-	mov	-16*1($np),%rax
+	mov	-8*1($np),%rax
 	adc	\$0,%rdx
 	mov	%rdx,$A[0]
 
@@ -663,8 +729,7 @@ $code.=<<___;
 	mov	$N[1],-16($tp)		# tp[j-1]
 	mov	%rdx,$N[0]
 
-	movq	%xmm0,$m0		# bp[1]
-	lea	($np,$num,2),$np	# rewind $np
+	lea	($np,$num),$np		# rewind $np
 
 	xor	$N[1],$N[1]
 	add	$A[0],$N[0]
@@ -675,6 +740,33 @@ $code.=<<___;
 
 .align	32
 .Louter4x:
+	lea	16+128($tp),%rdx	# where 256-byte mask is (+size optimization)
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+for($i=0;$i<$STRIDE/16;$i+=4) {
+$code.=<<___;
+	movdqa	`16*($i+0)-128`($bp),%xmm0
+	movdqa	`16*($i+1)-128`($bp),%xmm1
+	movdqa	`16*($i+2)-128`($bp),%xmm2
+	movdqa	`16*($i+3)-128`($bp),%xmm3
+	pand	`16*($i+0)-128`(%rdx),%xmm0
+	pand	`16*($i+1)-128`(%rdx),%xmm1
+	por	%xmm0,%xmm4
+	pand	`16*($i+2)-128`(%rdx),%xmm2
+	por	%xmm1,%xmm5
+	pand	`16*($i+3)-128`(%rdx),%xmm3
+	por	%xmm2,%xmm4
+	por	%xmm3,%xmm5
+___
+}
+$code.=<<___;
+	por	%xmm5,%xmm4
+	pshufd	\$0x4e,%xmm4,%xmm0
+	por	%xmm4,%xmm0
+	lea	$STRIDE($bp),$bp
+	movq	%xmm0,$m0		# m0=bp[i]
+
 	mov	($tp,$num),$A[0]
 	mov	$n0,$m1
 	mulq	$m0			# ap[0]*bp[i]
@@ -682,25 +774,11 @@ $code.=<<___;
 	mov	($np),%rax
 	adc	\$0,%rdx
 
-	movq	`0*$STRIDE/4-96`($bp),%xmm0
-	movq	`1*$STRIDE/4-96`($bp),%xmm1
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-96`($bp),%xmm2
-	pand	%xmm5,%xmm1
-	movq	`3*$STRIDE/4-96`($bp),%xmm3
-
 	imulq	$A[0],$m1		# tp[0]*n0
-	.byte	0x67
 	mov	%rdx,$A[1]
 	mov	$N[1],($tp)		# store upmost overflow bit
 
-	pand	%xmm6,%xmm2
-	por	%xmm1,%xmm0
-	pand	%xmm7,%xmm3
-	por	%xmm2,%xmm0
 	lea	($tp,$num),$tp		# rewind $tp
-	lea	$STRIDE($bp),$bp
-	por	%xmm3,%xmm0
 
 	mulq	$m1			# np[0]*m1
 	add	%rax,$A[0]		# "$N[0]", discarded
@@ -710,7 +788,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[i]
 	add	%rax,$A[1]
-	mov	16*1($np),%rax		# interleaved with 0, therefore 16*n
+	mov	8*1($np),%rax
 	adc	\$0,%rdx
 	add	8($tp),$A[1]		# +tp[1]
 	adc	\$0,%rdx
@@ -722,7 +800,7 @@ $code.=<<___;
 	adc	\$0,%rdx
 	add	$A[1],$N[1]		# np[j]*m1+ap[j]*bp[i]+tp[j]
 	lea	4*8($num),$j		# j=4
-	lea	16*4($np),$np
+	lea	8*4($np),$np
 	adc	\$0,%rdx
 	mov	%rdx,$N[0]
 	jmp	.Linner4x
@@ -731,7 +809,7 @@ $code.=<<___;
 .Linner4x:
 	mulq	$m0			# ap[j]*bp[i]
 	add	%rax,$A[0]
-	mov	-16*2($np),%rax
+	mov	-8*2($np),%rax
 	adc	\$0,%rdx
 	add	16($tp),$A[0]		# ap[j]*bp[i]+tp[j]
 	lea	32($tp),$tp
@@ -749,7 +827,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[i]
 	add	%rax,$A[1]
-	mov	-16*1($np),%rax
+	mov	-8*1($np),%rax
 	adc	\$0,%rdx
 	add	-8($tp),$A[1]
 	adc	\$0,%rdx
@@ -766,7 +844,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[i]
 	add	%rax,$A[0]
-	mov	16*0($np),%rax
+	mov	8*0($np),%rax
 	adc	\$0,%rdx
 	add	($tp),$A[0]		# ap[j]*bp[i]+tp[j]
 	adc	\$0,%rdx
@@ -783,7 +861,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[i]
 	add	%rax,$A[1]
-	mov	16*1($np),%rax
+	mov	8*1($np),%rax
 	adc	\$0,%rdx
 	add	8($tp),$A[1]
 	adc	\$0,%rdx
@@ -794,7 +872,7 @@ $code.=<<___;
 	mov	16($ap,$j),%rax
 	adc	\$0,%rdx
 	add	$A[1],$N[1]
-	lea	16*4($np),$np
+	lea	8*4($np),$np
 	adc	\$0,%rdx
 	mov	$N[0],-8($tp)		# tp[j-1]
 	mov	%rdx,$N[0]
@@ -804,7 +882,7 @@ $code.=<<___;
 
 	mulq	$m0			# ap[j]*bp[i]
 	add	%rax,$A[0]
-	mov	-16*2($np),%rax
+	mov	-8*2($np),%rax
 	adc	\$0,%rdx
 	add	16($tp),$A[0]		# ap[j]*bp[i]+tp[j]
 	lea	32($tp),$tp
@@ -823,7 +901,7 @@ $code.=<<___;
 	mulq	$m0			# ap[j]*bp[i]
 	add	%rax,$A[1]
 	mov	$m1,%rax
-	mov	-16*1($np),$m1
+	mov	-8*1($np),$m1
 	adc	\$0,%rdx
 	add	-8($tp),$A[1]
 	adc	\$0,%rdx
@@ -838,9 +916,8 @@ $code.=<<___;
 	mov	$N[0],-24($tp)		# tp[j-1]
 	mov	%rdx,$N[0]
 
-	movq	%xmm0,$m0		# bp[i+1]
 	mov	$N[1],-16($tp)		# tp[j-1]
-	lea	($np,$num,2),$np	# rewind $np
+	lea	($np,$num),$np		# rewind $np
 
 	xor	$N[1],$N[1]
 	add	$A[0],$N[0]
@@ -854,16 +931,23 @@ $code.=<<___;
 ___
 if (1) {
 $code.=<<___;
+	xor	%rax,%rax
 	sub	$N[0],$m1		# compare top-most words
 	adc	$j,$j			# $j is zero
 	or	$j,$N[1]
-	xor	\$1,$N[1]
+	sub	$N[1],%rax		# %rax=-$N[1]
 	lea	($tp,$num),%rbx		# tptr in .sqr4x_sub
-	lea	($np,$N[1],8),%rbp	# nptr in .sqr4x_sub
+	mov	($np),%r12
+	lea	($np),%rbp		# nptr in .sqr4x_sub
 	mov	%r9,%rcx
-	sar	\$3+2,%rcx		# cf=0
+	sar	\$3+2,%rcx
 	mov	56+8(%rsp),%rdi		# rptr in .sqr4x_sub
-	jmp	.Lsqr4x_sub
+	dec	%r12			# so that after 'not' we get -n[0]
+	xor	%r10,%r10
+	mov	8*1(%rbp),%r13
+	mov	8*2(%rbp),%r14
+	mov	8*3(%rbp),%r15
+	jmp	.Lsqr4x_sub_entry
 ___
 } else {
 my @ri=("%rax",$bp,$m0,$m1);
@@ -930,8 +1014,8 @@ bn_power5:
 ___
 $code.=<<___ if ($addx);
 	mov	OPENSSL_ia32cap_P+8(%rip),%r11d
-	and	\$0x80100,%r11d
-	cmp	\$0x80100,%r11d
+	and	\$0x80108,%r11d
+	cmp	\$0x80108,%r11d		# check for AD*X+BMI2+BMI1
 	je	.Lpowerx5_enter
 ___
 $code.=<<___;
@@ -942,38 +1026,32 @@ $code.=<<___;
 	push	%r13
 	push	%r14
 	push	%r15
-___
-$code.=<<___ if ($win64);
-	lea	-0x28(%rsp),%rsp
-	movaps	%xmm6,(%rsp)
-	movaps	%xmm7,0x10(%rsp)
-___
-$code.=<<___;
-	mov	${num}d,%r10d
+
 	shl	\$3,${num}d		# convert $num to bytes
-	shl	\$3+2,%r10d		# 4*$num
+	lea	($num,$num,2),%r10d	# 3*$num
 	neg	$num
 	mov	($n0),$n0		# *n0
 
 	##############################################################
-	# ensure that stack frame doesn't alias with $aptr+4*$num
-	# modulo 4096, which covers ret[num], am[num] and n[2*num]
-	# (see bn_exp.c). this is done to allow memory disambiguation
-	# logic do its magic.
+	# Ensure that stack frame doesn't alias with $rptr+3*$num
+	# modulo 4096, which covers ret[num], am[num] and n[num]
+	# (see bn_exp.c). This is done to allow memory disambiguation
+	# logic do its magic. [Extra 256 bytes is for power mask
+	# calculated from 7th argument, the index.]
 	#
-	lea	-64(%rsp,$num,2),%r11
-	sub	$aptr,%r11
+	lea	-320(%rsp,$num,2),%r11
+	sub	$rptr,%r11
 	and	\$4095,%r11
 	cmp	%r11,%r10
 	jb	.Lpwr_sp_alt
 	sub	%r11,%rsp		# align with $aptr
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*num*8+256)
 	jmp	.Lpwr_sp_done
 
 .align	32
 .Lpwr_sp_alt:
-	lea	4096-64(,$num,2),%r10	# 4096-frame-2*$num
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
+	lea	4096-320(,$num,2),%r10
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*num*8+256)
 	sub	%r10,%r11
 	mov	\$0,%r10
 	cmovc	%r10,%r11
@@ -995,16 +1073,21 @@ $code.=<<___;
 	mov	$n0,  32(%rsp)
 	mov	%rax, 40(%rsp)		# save original %rsp
 .Lpower5_body:
-	movq	$rptr,%xmm1		# save $rptr
+	movq	$rptr,%xmm1		# save $rptr, used in sqr8x
 	movq	$nptr,%xmm2		# save $nptr
-	movq	%r10, %xmm3		# -$num
+	movq	%r10, %xmm3		# -$num, used in sqr8x
 	movq	$bptr,%xmm4
 
 	call	__bn_sqr8x_internal
+	call	__bn_post4x_internal
 	call	__bn_sqr8x_internal
+	call	__bn_post4x_internal
 	call	__bn_sqr8x_internal
+	call	__bn_post4x_internal
 	call	__bn_sqr8x_internal
+	call	__bn_post4x_internal
 	call	__bn_sqr8x_internal
+	call	__bn_post4x_internal
 
 	movq	%xmm2,$nptr
 	movq	%xmm4,$bptr
@@ -1565,9 +1648,9 @@ my ($nptr,$tptr,$carry,$m0)=("%rbp","%rdi","%rsi","%rbx");
 
 $code.=<<___;
 	movq	%xmm2,$nptr
-sqr8x_reduction:
+__bn_sqr8x_reduction:
 	xor	%rax,%rax
-	lea	($nptr,$num,2),%rcx	# end of n[]
+	lea	($nptr,$num),%rcx	# end of n[]
 	lea	48+8(%rsp,$num,2),%rdx	# end of t[] buffer
 	mov	%rcx,0+8(%rsp)
 	lea	48+8(%rsp,$num),$tptr	# end of initial t[] window
@@ -1593,21 +1676,21 @@ sqr8x_reduction:
 	.byte	0x67
 	mov	$m0,%r8
 	imulq	32+8(%rsp),$m0		# n0*a[0]
-	mov	16*0($nptr),%rax	# n[0]
+	mov	8*0($nptr),%rax		# n[0]
 	mov	\$8,%ecx
 	jmp	.L8x_reduce
 
 .align	32
 .L8x_reduce:
 	mulq	$m0
-	 mov	16*1($nptr),%rax	# n[1]
+	 mov	8*1($nptr),%rax		# n[1]
 	neg	%r8
 	mov	%rdx,%r8
 	adc	\$0,%r8
 
 	mulq	$m0
 	add	%rax,%r9
-	 mov	16*2($nptr),%rax
+	 mov	8*2($nptr),%rax
 	adc	\$0,%rdx
 	add	%r9,%r8
 	 mov	$m0,48-8+8(%rsp,%rcx,8)	# put aside n0*a[i]
@@ -1616,7 +1699,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r10
-	 mov	16*3($nptr),%rax
+	 mov	8*3($nptr),%rax
 	adc	\$0,%rdx
 	add	%r10,%r9
 	 mov	32+8(%rsp),$carry	# pull n0, borrow $carry
@@ -1625,7 +1708,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r11
-	 mov	16*4($nptr),%rax
+	 mov	8*4($nptr),%rax
 	adc	\$0,%rdx
 	 imulq	%r8,$carry		# modulo-scheduled
 	add	%r11,%r10
@@ -1634,7 +1717,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r12
-	 mov	16*5($nptr),%rax
+	 mov	8*5($nptr),%rax
 	adc	\$0,%rdx
 	add	%r12,%r11
 	mov	%rdx,%r12
@@ -1642,7 +1725,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r13
-	 mov	16*6($nptr),%rax
+	 mov	8*6($nptr),%rax
 	adc	\$0,%rdx
 	add	%r13,%r12
 	mov	%rdx,%r13
@@ -1650,7 +1733,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r14
-	 mov	16*7($nptr),%rax
+	 mov	8*7($nptr),%rax
 	adc	\$0,%rdx
 	add	%r14,%r13
 	mov	%rdx,%r14
@@ -1659,7 +1742,7 @@ sqr8x_reduction:
 	mulq	$m0
 	 mov	$carry,$m0		# n0*a[i]
 	add	%rax,%r15
-	 mov	16*0($nptr),%rax	# n[0]
+	 mov	8*0($nptr),%rax		# n[0]
 	adc	\$0,%rdx
 	add	%r15,%r14
 	mov	%rdx,%r15
@@ -1668,7 +1751,7 @@ sqr8x_reduction:
 	dec	%ecx
 	jnz	.L8x_reduce
 
-	lea	16*8($nptr),$nptr
+	lea	8*8($nptr),$nptr
 	xor	%rax,%rax
 	mov	8+8(%rsp),%rdx		# pull end of t[]
 	cmp	0+8(%rsp),$nptr		# end of n[]?
@@ -1687,21 +1770,21 @@ sqr8x_reduction:
 
 	mov	48+56+8(%rsp),$m0	# pull n0*a[0]
 	mov	\$8,%ecx
-	mov	16*0($nptr),%rax
+	mov	8*0($nptr),%rax
 	jmp	.L8x_tail
 
 .align	32
 .L8x_tail:
 	mulq	$m0
 	add	%rax,%r8
-	 mov	16*1($nptr),%rax
+	 mov	8*1($nptr),%rax
 	 mov	%r8,($tptr)		# save result
 	mov	%rdx,%r8
 	adc	\$0,%r8
 
 	mulq	$m0
 	add	%rax,%r9
-	 mov	16*2($nptr),%rax
+	 mov	8*2($nptr),%rax
 	adc	\$0,%rdx
 	add	%r9,%r8
 	 lea	8($tptr),$tptr		# $tptr++
@@ -1710,7 +1793,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r10
-	 mov	16*3($nptr),%rax
+	 mov	8*3($nptr),%rax
 	adc	\$0,%rdx
 	add	%r10,%r9
 	mov	%rdx,%r10
@@ -1718,7 +1801,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r11
-	 mov	16*4($nptr),%rax
+	 mov	8*4($nptr),%rax
 	adc	\$0,%rdx
 	add	%r11,%r10
 	mov	%rdx,%r11
@@ -1726,7 +1809,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r12
-	 mov	16*5($nptr),%rax
+	 mov	8*5($nptr),%rax
 	adc	\$0,%rdx
 	add	%r12,%r11
 	mov	%rdx,%r12
@@ -1734,7 +1817,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r13
-	 mov	16*6($nptr),%rax
+	 mov	8*6($nptr),%rax
 	adc	\$0,%rdx
 	add	%r13,%r12
 	mov	%rdx,%r13
@@ -1742,7 +1825,7 @@ sqr8x_reduction:
 
 	mulq	$m0
 	add	%rax,%r14
-	 mov	16*7($nptr),%rax
+	 mov	8*7($nptr),%rax
 	adc	\$0,%rdx
 	add	%r14,%r13
 	mov	%rdx,%r14
@@ -1753,14 +1836,14 @@ sqr8x_reduction:
 	add	%rax,%r15
 	adc	\$0,%rdx
 	add	%r15,%r14
-	 mov	16*0($nptr),%rax	# pull n[0]
+	 mov	8*0($nptr),%rax		# pull n[0]
 	mov	%rdx,%r15
 	adc	\$0,%r15
 
 	dec	%ecx
 	jnz	.L8x_tail
 
-	lea	16*8($nptr),$nptr
+	lea	8*8($nptr),$nptr
 	mov	8+8(%rsp),%rdx		# pull end of t[]
 	cmp	0+8(%rsp),$nptr		# end of n[]?
 	jae	.L8x_tail_done		# break out of loop
@@ -1806,7 +1889,7 @@ sqr8x_reduction:
 	adc	8*6($tptr),%r14
 	adc	8*7($tptr),%r15
 	adc	\$0,%rax		# top-most carry
-	 mov	-16($nptr),%rcx		# np[num-1]
+	 mov	-8($nptr),%rcx		# np[num-1]
 	 xor	$carry,$carry
 
 	movq	%xmm2,$nptr		# restore $nptr
@@ -1824,6 +1907,8 @@ sqr8x_reduction:
 
 	cmp	%rdx,$tptr		# end of t[]?
 	jb	.L8x_reduction_loop
+	ret
+.size	bn_sqr8x_internal,.-bn_sqr8x_internal
 ___
 }

 ##############################################################
@@ -1832,48 +1917,62 @@ ___
 {
 my ($tptr,$nptr)=("%rbx","%rbp");
 $code.=<<___;
-	#xor	%rsi,%rsi		# %rsi was $carry above
-	sub	%r15,%rcx		# compare top-most words
+.type	__bn_post4x_internal,\@abi-omnipotent
+.align	32
+__bn_post4x_internal:
+	mov	8*0($nptr),%r12
 	lea	(%rdi,$num),$tptr	# %rdi was $tptr above
-	adc	%rsi,%rsi
 	mov	$num,%rcx
-	or	%rsi,%rax
 	movq	%xmm1,$rptr		# restore $rptr
-	xor	\$1,%rax
+	neg	%rax
 	movq	%xmm1,$aptr		# prepare for back-to-back call
-	lea	($nptr,%rax,8),$nptr
-	sar	\$3+2,%rcx		# cf=0
-	jmp	.Lsqr4x_sub
+	sar	\$3+2,%rcx
+	dec	%r12			# so that after 'not' we get -n[0]
+	xor	%r10,%r10
+	mov	8*1($nptr),%r13
+	mov	8*2($nptr),%r14
+	mov	8*3($nptr),%r15
+	jmp	.Lsqr4x_sub_entry
 
-.align	32
+.align	16
 .Lsqr4x_sub:
-	.byte	0x66
-	mov	8*0($tptr),%r12
-	mov	8*1($tptr),%r13
-	sbb	16*0($nptr),%r12
-	mov	8*2($tptr),%r14
-	sbb	16*1($nptr),%r13
-	mov	8*3($tptr),%r15
-	lea	8*4($tptr),$tptr
-	sbb	16*2($nptr),%r14
+	mov	8*0($nptr),%r12
+	mov	8*1($nptr),%r13
+	mov	8*2($nptr),%r14
+	mov	8*3($nptr),%r15
+.Lsqr4x_sub_entry:
+	lea	8*4($nptr),$nptr
+	not	%r12
+	not	%r13
+	not	%r14
+	not	%r15
+	and	%rax,%r12
+	and	%rax,%r13
+	and	%rax,%r14
+	and	%rax,%r15
+
+	neg	%r10			# mov %r10,%cf
+	adc	8*0($tptr),%r12
+	adc	8*1($tptr),%r13
+	adc	8*2($tptr),%r14
+	adc	8*3($tptr),%r15
 	mov	%r12,8*0($rptr)
-	sbb	16*3($nptr),%r15
-	lea	16*4($nptr),$nptr
+	lea	8*4($tptr),$tptr
 	mov	%r13,8*1($rptr)
+	sbb	%r10,%r10		# mov %cf,%r10
 	mov	%r14,8*2($rptr)
 	mov	%r15,8*3($rptr)
 	lea	8*4($rptr),$rptr
 
 	inc	%rcx			# pass %cf
 	jnz	.Lsqr4x_sub
-___
-}
-$code.=<<___;
+
 	mov	$num,%r10		# prepare for back-to-back call
 	neg	$num			# restore $num	
 	ret
-.size	bn_sqr8x_internal,.-bn_sqr8x_internal
+.size	__bn_post4x_internal,.-__bn_post4x_internal
 ___
+}
 {
 $code.=<<___;
 .globl	bn_from_montgomery
@@ -1897,39 +1996,32 @@ bn_from_mont8x:
 	push	%r13
 	push	%r14
 	push	%r15
-___
-$code.=<<___ if ($win64);
-	lea	-0x28(%rsp),%rsp
-	movaps	%xmm6,(%rsp)
-	movaps	%xmm7,0x10(%rsp)
-___
-$code.=<<___;
-	.byte	0x67
-	mov	${num}d,%r10d
+
 	shl	\$3,${num}d		# convert $num to bytes
-	shl	\$3+2,%r10d		# 4*$num
+	lea	($num,$num,2),%r10	# 3*$num in bytes
 	neg	$num
 	mov	($n0),$n0		# *n0
 
 	##############################################################
-	# ensure that stack frame doesn't alias with $aptr+4*$num
-	# modulo 4096, which covers ret[num], am[num] and n[2*num]
-	# (see bn_exp.c). this is done to allow memory disambiguation
-	# logic do its magic.
+	# Ensure that stack frame doesn't alias with $rptr+3*$num
+	# modulo 4096, which covers ret[num], am[num] and n[num]
+	# (see bn_exp.c). The stack is allocated to aligned with
+	# bn_power5's frame, and as bn_from_montgomery happens to be
+	# last operation, we use the opportunity to cleanse it.
 	#
-	lea	-64(%rsp,$num,2),%r11
-	sub	$aptr,%r11
+	lea	-320(%rsp,$num,2),%r11
+	sub	$rptr,%r11
 	and	\$4095,%r11
 	cmp	%r11,%r10
 	jb	.Lfrom_sp_alt
 	sub	%r11,%rsp		# align with $aptr
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*$num*8+256)
 	jmp	.Lfrom_sp_done
 
 .align	32
 .Lfrom_sp_alt:
-	lea	4096-64(,$num,2),%r10	# 4096-frame-2*$num
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
+	lea	4096-320(,$num,2),%r10
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*$num*8+256)
 	sub	%r10,%r11
 	mov	\$0,%r10
 	cmovc	%r10,%r11
@@ -1983,12 +2075,13 @@ $code.=<<___;
 ___
 $code.=<<___ if ($addx);
 	mov	OPENSSL_ia32cap_P+8(%rip),%r11d
-	and	\$0x80100,%r11d
-	cmp	\$0x80100,%r11d
+	and	\$0x80108,%r11d
+	cmp	\$0x80108,%r11d		# check for AD*X+BMI2+BMI1
 	jne	.Lfrom_mont_nox
 
 	lea	(%rax,$num),$rptr
-	call	sqrx8x_reduction
+	call	__bn_sqrx8x_reduction
+	call	__bn_postx4x_internal
 
 	pxor	%xmm0,%xmm0
 	lea	48(%rsp),%rax
@@ -1999,7 +2092,8 @@ $code.=<<___ if ($addx);
 .Lfrom_mont_nox:
 ___
 $code.=<<___;
-	call	sqr8x_reduction
+	call	__bn_sqr8x_reduction
+	call	__bn_post4x_internal
 
 	pxor	%xmm0,%xmm0
 	lea	48(%rsp),%rax
@@ -2039,7 +2133,6 @@ $code.=<<___;
 .align	32
 bn_mulx4x_mont_gather5:
 .Lmulx4x_enter:
-	.byte	0x67
 	mov	%rsp,%rax
 	push	%rbx
 	push	%rbp
@@ -2047,40 +2140,33 @@ bn_mulx4x_mont_gather5:
 	push	%r13
 	push	%r14
 	push	%r15
-___
-$code.=<<___ if ($win64);
-	lea	-0x28(%rsp),%rsp
-	movaps	%xmm6,(%rsp)
-	movaps	%xmm7,0x10(%rsp)
-___
-$code.=<<___;
-	.byte	0x67
-	mov	${num}d,%r10d
+
 	shl	\$3,${num}d		# convert $num to bytes
-	shl	\$3+2,%r10d		# 4*$num
+	lea	($num,$num,2),%r10	# 3*$num in bytes
 	neg	$num			# -$num
 	mov	($n0),$n0		# *n0
 
 	##############################################################
-	# ensure that stack frame doesn't alias with $aptr+4*$num
-	# modulo 4096, which covers a[num], ret[num] and n[2*num]
-	# (see bn_exp.c). this is done to allow memory disambiguation
-	# logic do its magic. [excessive frame is allocated in order
-	# to allow bn_from_mont8x to clear it.]
+	# Ensure that stack frame doesn't alias with $rptr+3*$num
+	# modulo 4096, which covers ret[num], am[num] and n[num]
+	# (see bn_exp.c). This is done to allow memory disambiguation
+	# logic do its magic. [Extra [num] is allocated in order
+	# to align with bn_power5's frame, which is cleansed after
+	# completing exponentiation. Extra 256 bytes is for power mask
+	# calculated from 7th argument, the index.]
 	#
-	lea	-64(%rsp,$num,2),%r11
-	sub	$ap,%r11
+	lea	-320(%rsp,$num,2),%r11
+	sub	$rp,%r11
 	and	\$4095,%r11
 	cmp	%r11,%r10
 	jb	.Lmulx4xsp_alt
 	sub	%r11,%rsp		# align with $aptr
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+$num)
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*$num*8+256)
 	jmp	.Lmulx4xsp_done
 
-.align	32
 .Lmulx4xsp_alt:
-	lea	4096-64(,$num,2),%r10	# 4096-frame-$num
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+$num)
+	lea	4096-320(,$num,2),%r10
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*$num*8+256)
 	sub	%r10,%r11
 	mov	\$0,%r10
 	cmovc	%r10,%r11
@@ -2106,12 +2192,7 @@ $code.=<<___;
 
 	mov	40(%rsp),%rsi		# restore %rsp
 	mov	\$1,%rax
-___
-$code.=<<___ if ($win64);
-	movaps	-88(%rsi),%xmm6
-	movaps	-72(%rsi),%xmm7
-___
-$code.=<<___;
+
 	mov	-48(%rsi),%r15
 	mov	-40(%rsi),%r14
 	mov	-32(%rsi),%r13
@@ -2126,14 +2207,16 @@ $code.=<<___;
 .type	mulx4x_internal,\@abi-omnipotent
 .align	32
 mulx4x_internal:
-	.byte	0x4c,0x89,0x8c,0x24,0x08,0x00,0x00,0x00	# mov	$num,8(%rsp)		# save -$num
-	.byte	0x67
+	mov	$num,8(%rsp)		# save -$num (it was in bytes)
+	mov	$num,%r10
 	neg	$num			# restore $num
 	shl	\$5,$num
-	lea	256($bp,$num),%r13
+	neg	%r10			# restore $num
+	lea	128($bp,$num),%r13	# end of powers table (+size optimization)
 	shr	\$5+5,$num
-	mov	`($win64?56:8)`(%rax),%r10d	# load 7th argument
+	movd	`($win64?56:8)`(%rax),%xmm5	# load 7th argument
 	sub	\$1,$num
+	lea	.Linc(%rip),%rax
 	mov	%r13,16+8(%rsp)		# end of b[num]
 	mov	$num,24+8(%rsp)		# inner counter
 	mov	$rp, 56+8(%rsp)		# save $rp
@@ -2144,52 +2227,92 @@ my $rptr=$bptr;
 my $STRIDE=2**5*8;		# 5 is "window size"
 my $N=$STRIDE/4;		# should match cache line size
 $code.=<<___;
-	mov	%r10,%r11
-	shr	\$`log($N/8)/log(2)`,%r10
-	and	\$`$N/8-1`,%r11
-	not	%r10
-	lea	.Lmagic_masks(%rip),%rax
-	and	\$`2**5/($N/8)-1`,%r10	# 5 is "window size"
-	lea	96($bp,%r11,8),$bptr	# pointer within 1st cache line
-	movq	0(%rax,%r10,8),%xmm4	# set of masks denoting which
-	movq	8(%rax,%r10,8),%xmm5	# cache line contains element
-	add	\$7,%r11
-	movq	16(%rax,%r10,8),%xmm6	# denoted by 7th argument
-	movq	24(%rax,%r10,8),%xmm7
-	and	\$7,%r11
-
-	movq	`0*$STRIDE/4-96`($bptr),%xmm0
-	lea	$STRIDE($bptr),$tptr	# borrow $tptr
-	movq	`1*$STRIDE/4-96`($bptr),%xmm1
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-96`($bptr),%xmm2
-	pand	%xmm5,%xmm1
-	movq	`3*$STRIDE/4-96`($bptr),%xmm3
-	pand	%xmm6,%xmm2
-	por	%xmm1,%xmm0
-	movq	`0*$STRIDE/4-96`($tptr),%xmm1
-	pand	%xmm7,%xmm3
-	por	%xmm2,%xmm0
-	movq	`1*$STRIDE/4-96`($tptr),%xmm2
-	por	%xmm3,%xmm0
-	.byte	0x67,0x67
-	pand	%xmm4,%xmm1
-	movq	`2*$STRIDE/4-96`($tptr),%xmm3
+	movdqa	0(%rax),%xmm0		# 00000001000000010000000000000000
+	movdqa	16(%rax),%xmm1		# 00000002000000020000000200000002
+	lea	88-112(%rsp,%r10),%r10	# place the mask after tp[num+1] (+ICache optimizaton)
+	lea	128($bp),$bptr		# size optimization
 
+	pshufd	\$0,%xmm5,%xmm5		# broadcast index
+	movdqa	%xmm1,%xmm4
+	.byte	0x67
+	movdqa	%xmm1,%xmm2
+___
+########################################################################
+# calculate mask by comparing 0..31 to index and save result to stack
+#
+$code.=<<___;
+	.byte	0x67
+	paddd	%xmm0,%xmm1
+	pcmpeqd	%xmm5,%xmm0		# compare to 1,0
+	movdqa	%xmm4,%xmm3
+___
+for($i=0;$i<$STRIDE/16-4;$i+=4) {
+$code.=<<___;
+	paddd	%xmm1,%xmm2
+	pcmpeqd	%xmm5,%xmm1		# compare to 3,2
+	movdqa	%xmm0,`16*($i+0)+112`(%r10)
+	movdqa	%xmm4,%xmm0
+
+	paddd	%xmm2,%xmm3
+	pcmpeqd	%xmm5,%xmm2		# compare to 5,4
+	movdqa	%xmm1,`16*($i+1)+112`(%r10)
+	movdqa	%xmm4,%xmm1
+
+	paddd	%xmm3,%xmm0
+	pcmpeqd	%xmm5,%xmm3		# compare to 7,6
+	movdqa	%xmm2,`16*($i+2)+112`(%r10)
+	movdqa	%xmm4,%xmm2
+
+	paddd	%xmm0,%xmm1
+	pcmpeqd	%xmm5,%xmm0
+	movdqa	%xmm3,`16*($i+3)+112`(%r10)
+	movdqa	%xmm4,%xmm3
+___
+}
+$code.=<<___;				# last iteration can be optimized
+	.byte	0x67
+	paddd	%xmm1,%xmm2
+	pcmpeqd	%xmm5,%xmm1
+	movdqa	%xmm0,`16*($i+0)+112`(%r10)
+
+	paddd	%xmm2,%xmm3
+	pcmpeqd	%xmm5,%xmm2
+	movdqa	%xmm1,`16*($i+1)+112`(%r10)
+
+	pcmpeqd	%xmm5,%xmm3
+	movdqa	%xmm2,`16*($i+2)+112`(%r10)
+
+	pand	`16*($i+0)-128`($bptr),%xmm0	# while it's still in register
+	pand	`16*($i+1)-128`($bptr),%xmm1
+	pand	`16*($i+2)-128`($bptr),%xmm2
+	movdqa	%xmm3,`16*($i+3)+112`(%r10)
+	pand	`16*($i+3)-128`($bptr),%xmm3
+	por	%xmm2,%xmm0
+	por	%xmm3,%xmm1
+___
+for($i=0;$i<$STRIDE/16-4;$i+=4) {
+$code.=<<___;
+	movdqa	`16*($i+0)-128`($bptr),%xmm4
+	movdqa	`16*($i+1)-128`($bptr),%xmm5
+	movdqa	`16*($i+2)-128`($bptr),%xmm2
+	pand	`16*($i+0)+112`(%r10),%xmm4
+	movdqa	`16*($i+3)-128`($bptr),%xmm3
+	pand	`16*($i+1)+112`(%r10),%xmm5
+	por	%xmm4,%xmm0
+	pand	`16*($i+2)+112`(%r10),%xmm2
+	por	%xmm5,%xmm1
+	pand	`16*($i+3)+112`(%r10),%xmm3
+	por	%xmm2,%xmm0
+	por	%xmm3,%xmm1
+___
+}
+$code.=<<___;
+	pxor	%xmm1,%xmm0
+	pshufd	\$0x4e,%xmm0,%xmm1
+	por	%xmm1,%xmm0
+	lea	$STRIDE($bptr),$bptr
 	movq	%xmm0,%rdx		# bp[0]
-	movq	`3*$STRIDE/4-96`($tptr),%xmm0
-	lea	2*$STRIDE($bptr),$bptr	# next &b[i]
-	pand	%xmm5,%xmm2
-	.byte	0x67,0x67
-	pand	%xmm6,%xmm3
-	##############################################################
-	# $tptr is chosen so that writing to top-most element of the
-	# vector occurs just "above" references to powers table,
-	# "above" modulo cache-line size, which effectively precludes
-	# possibility of memory disambiguation logic failure when
-	# accessing the table.
-	# 
-	lea	64+8*4+8(%rsp,%r11,8),$tptr
+	lea	64+8*4+8(%rsp),$tptr
 
 	mov	%rdx,$bi
 	mulx	0*8($aptr),$mi,%rax	# a[0]*b[0]
@@ -2205,37 +2328,31 @@ $code.=<<___;
 	xor	$zero,$zero		# cf=0, of=0
 	mov	$mi,%rdx
 
-	por	%xmm2,%xmm1
-	pand	%xmm7,%xmm0
-	por	%xmm3,%xmm1
 	mov	$bptr,8+8(%rsp)		# off-load &b[i]
-	por	%xmm1,%xmm0
 
-	.byte	0x48,0x8d,0xb6,0x20,0x00,0x00,0x00	# lea	4*8($aptr),$aptr
+	lea	4*8($aptr),$aptr
 	adcx	%rax,%r13
 	adcx	$zero,%r14		# cf=0
 
-	mulx	0*16($nptr),%rax,%r10
+	mulx	0*8($nptr),%rax,%r10
 	adcx	%rax,%r15		# discarded
 	adox	%r11,%r10
-	mulx	1*16($nptr),%rax,%r11
+	mulx	1*8($nptr),%rax,%r11
 	adcx	%rax,%r10
 	adox	%r12,%r11
-	mulx	2*16($nptr),%rax,%r12
+	mulx	2*8($nptr),%rax,%r12
 	mov	24+8(%rsp),$bptr	# counter value
-	.byte	0x66
 	mov	%r10,-8*4($tptr)
 	adcx	%rax,%r11
 	adox	%r13,%r12
-	mulx	3*16($nptr),%rax,%r15
-	 .byte	0x67,0x67
+	mulx	3*8($nptr),%rax,%r15
 	 mov	$bi,%rdx
 	mov	%r11,-8*3($tptr)
 	adcx	%rax,%r12
 	adox	$zero,%r15		# of=0
-	.byte	0x48,0x8d,0x89,0x40,0x00,0x00,0x00	# lea	4*16($nptr),$nptr
+	lea	4*8($nptr),$nptr
 	mov	%r12,-8*2($tptr)
-	#jmp	.Lmulx4x_1st
+	jmp	.Lmulx4x_1st
 
 .align	32
 .Lmulx4x_1st:
@@ -2255,30 +2372,29 @@ $code.=<<___;
 	lea	4*8($tptr),$tptr
 
 	adox	%r15,%r10
-	mulx	0*16($nptr),%rax,%r15
+	mulx	0*8($nptr),%rax,%r15
 	adcx	%rax,%r10
 	adox	%r15,%r11
-	mulx	1*16($nptr),%rax,%r15
+	mulx	1*8($nptr),%rax,%r15
 	adcx	%rax,%r11
 	adox	%r15,%r12
-	mulx	2*16($nptr),%rax,%r15
+	mulx	2*8($nptr),%rax,%r15
 	mov	%r10,-5*8($tptr)
 	adcx	%rax,%r12
 	mov	%r11,-4*8($tptr)
 	adox	%r15,%r13
-	mulx	3*16($nptr),%rax,%r15
+	mulx	3*8($nptr),%rax,%r15
 	 mov	$bi,%rdx
 	mov	%r12,-3*8($tptr)
 	adcx	%rax,%r13
 	adox	$zero,%r15
-	lea	4*16($nptr),$nptr
+	lea	4*8($nptr),$nptr
 	mov	%r13,-2*8($tptr)
 
 	dec	$bptr			# of=0, pass cf
 	jnz	.Lmulx4x_1st
 
 	mov	8(%rsp),$num		# load -num
-	movq	%xmm0,%rdx		# bp[1]
 	adc	$zero,%r15		# modulo-scheduled
 	lea	($aptr,$num),$aptr	# rewind $aptr
 	add	%r15,%r14
@@ -2289,6 +2405,34 @@ $code.=<<___;
 
 .align	32
 .Lmulx4x_outer:
+	lea	16-256($tptr),%r10	# where 256-byte mask is (+density control)
+	pxor	%xmm4,%xmm4
+	.byte	0x67,0x67
+	pxor	%xmm5,%xmm5
+___
+for($i=0;$i<$STRIDE/16;$i+=4) {
+$code.=<<___;
+	movdqa	`16*($i+0)-128`($bptr),%xmm0
+	movdqa	`16*($i+1)-128`($bptr),%xmm1
+	movdqa	`16*($i+2)-128`($bptr),%xmm2
+	pand	`16*($i+0)+256`(%r10),%xmm0
+	movdqa	`16*($i+3)-128`($bptr),%xmm3
+	pand	`16*($i+1)+256`(%r10),%xmm1
+	por	%xmm0,%xmm4
+	pand	`16*($i+2)+256`(%r10),%xmm2
+	por	%xmm1,%xmm5
+	pand	`16*($i+3)+256`(%r10),%xmm3
+	por	%xmm2,%xmm4
+	por	%xmm3,%xmm5
+___
+}
+$code.=<<___;
+	por	%xmm5,%xmm4
+	pshufd	\$0x4e,%xmm4,%xmm0
+	por	%xmm4,%xmm0
+	lea	$STRIDE($bptr),$bptr
+	movq	%xmm0,%rdx		# m0=bp[i]
+
 	mov	$zero,($tptr)		# save top-most carry
 	lea	4*8($tptr,$num),$tptr	# rewind $tptr
 	mulx	0*8($aptr),$mi,%r11	# a[0]*b[i]
@@ -2303,54 +2447,37 @@ $code.=<<___;
 	mulx	3*8($aptr),%rdx,%r14
 	adox	-2*8($tptr),%r12
 	adcx	%rdx,%r13
-	lea	($nptr,$num,2),$nptr	# rewind $nptr
+	lea	($nptr,$num),$nptr	# rewind $nptr
 	lea	4*8($aptr),$aptr
 	adox	-1*8($tptr),%r13
 	adcx	$zero,%r14
 	adox	$zero,%r14
 
-	.byte	0x67
 	mov	$mi,%r15
 	imulq	32+8(%rsp),$mi		# "t[0]"*n0
 
-	movq	`0*$STRIDE/4-96`($bptr),%xmm0
-	.byte	0x67,0x67
 	mov	$mi,%rdx
-	movq	`1*$STRIDE/4-96`($bptr),%xmm1
-	.byte	0x67
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-96`($bptr),%xmm2
-	.byte	0x67
-	pand	%xmm5,%xmm1
-	movq	`3*$STRIDE/4-96`($bptr),%xmm3
-	add	\$$STRIDE,$bptr		# next &b[i]
-	.byte	0x67
-	pand	%xmm6,%xmm2
-	por	%xmm1,%xmm0
-	pand	%xmm7,%xmm3
 	xor	$zero,$zero		# cf=0, of=0
 	mov	$bptr,8+8(%rsp)		# off-load &b[i]
 
-	mulx	0*16($nptr),%rax,%r10
+	mulx	0*8($nptr),%rax,%r10
 	adcx	%rax,%r15		# discarded
 	adox	%r11,%r10
-	mulx	1*16($nptr),%rax,%r11
+	mulx	1*8($nptr),%rax,%r11
 	adcx	%rax,%r10
 	adox	%r12,%r11
-	mulx	2*16($nptr),%rax,%r12
+	mulx	2*8($nptr),%rax,%r12
 	adcx	%rax,%r11
 	adox	%r13,%r12
-	mulx	3*16($nptr),%rax,%r15
+	mulx	3*8($nptr),%rax,%r15
 	 mov	$bi,%rdx
-	 por	%xmm2,%xmm0
 	mov	24+8(%rsp),$bptr	# counter value
 	mov	%r10,-8*4($tptr)
-	 por	%xmm3,%xmm0
 	adcx	%rax,%r12
 	mov	%r11,-8*3($tptr)
 	adox	$zero,%r15		# of=0
 	mov	%r12,-8*2($tptr)
-	lea	4*16($nptr),$nptr
+	lea	4*8($nptr),$nptr
 	jmp	.Lmulx4x_inner
 
 .align	32
@@ -2375,20 +2502,20 @@ $code.=<<___;
 	adcx	$zero,%r14		# cf=0
 
 	adox	%r15,%r10
-	mulx	0*16($nptr),%rax,%r15
+	mulx	0*8($nptr),%rax,%r15
 	adcx	%rax,%r10
 	adox	%r15,%r11
-	mulx	1*16($nptr),%rax,%r15
+	mulx	1*8($nptr),%rax,%r15
 	adcx	%rax,%r11
 	adox	%r15,%r12
-	mulx	2*16($nptr),%rax,%r15
+	mulx	2*8($nptr),%rax,%r15
 	mov	%r10,-5*8($tptr)
 	adcx	%rax,%r12
 	adox	%r15,%r13
 	mov	%r11,-4*8($tptr)
-	mulx	3*16($nptr),%rax,%r15
+	mulx	3*8($nptr),%rax,%r15
 	 mov	$bi,%rdx
-	lea	4*16($nptr),$nptr
+	lea	4*8($nptr),$nptr
 	mov	%r12,-3*8($tptr)
 	adcx	%rax,%r13
 	adox	$zero,%r15
@@ -2398,7 +2525,6 @@ $code.=<<___;
 	jnz	.Lmulx4x_inner
 
 	mov	0+8(%rsp),$num		# load -num
-	movq	%xmm0,%rdx		# bp[i+1]
 	adc	$zero,%r15		# modulo-scheduled
 	sub	0*8($tptr),$bptr	# pull top-most carry to %cf
 	mov	8+8(%rsp),$bptr		# re-load &b[i]
@@ -2411,20 +2537,26 @@ $code.=<<___;
 	cmp	%r10,$bptr
 	jb	.Lmulx4x_outer
 
-	mov	-16($nptr),%r10
+	mov	-8($nptr),%r10
+	mov	$zero,%r8
+	mov	($nptr,$num),%r12
+	lea	($nptr,$num),%rbp	# rewind $nptr
+	mov	$num,%rcx
+	lea	($tptr,$num),%rdi	# rewind $tptr
+	xor	%eax,%eax
 	xor	%r15,%r15
 	sub	%r14,%r10		# compare top-most words
 	adc	%r15,%r15
-	or	%r15,$zero
-	xor	\$1,$zero
-	lea	($tptr,$num),%rdi	# rewind $tptr
-	lea	($nptr,$num,2),$nptr	# rewind $nptr
-	.byte	0x67,0x67
-	sar	\$3+2,$num		# cf=0
-	lea	($nptr,$zero,8),%rbp
+	or	%r15,%r8
+	sar	\$3+2,%rcx
+	sub	%r8,%rax		# %rax=-%r8
 	mov	56+8(%rsp),%rdx		# restore rp
-	mov	$num,%rcx
-	jmp	.Lsqrx4x_sub		# common post-condition
+	dec	%r12			# so that after 'not' we get -n[0]
+	mov	8*1(%rbp),%r13
+	xor	%r8,%r8
+	mov	8*2(%rbp),%r14
+	mov	8*3(%rbp),%r15
+	jmp	.Lsqrx4x_sub_entry	# common post-condition
 .size	mulx4x_internal,.-mulx4x_internal
 ___
 }
{
@@ -2448,7 +2580,6 @@ $code.=<<___;
 .align	32
 bn_powerx5:
 .Lpowerx5_enter:
-	.byte	0x67
 	mov	%rsp,%rax
 	push	%rbx
 	push	%rbp
@@ -2456,39 +2587,32 @@ bn_powerx5:
 	push	%r13
 	push	%r14
 	push	%r15
-___
-$code.=<<___ if ($win64);
-	lea	-0x28(%rsp),%rsp
-	movaps	%xmm6,(%rsp)
-	movaps	%xmm7,0x10(%rsp)
-___
-$code.=<<___;
-	.byte	0x67
-	mov	${num}d,%r10d
+
 	shl	\$3,${num}d		# convert $num to bytes
-	shl	\$3+2,%r10d		# 4*$num
+	lea	($num,$num,2),%r10	# 3*$num in bytes
 	neg	$num
 	mov	($n0),$n0		# *n0
 
 	##############################################################
-	# ensure that stack frame doesn't alias with $aptr+4*$num
-	# modulo 4096, which covers ret[num], am[num] and n[2*num]
-	# (see bn_exp.c). this is done to allow memory disambiguation
-	# logic do its magic.
+	# Ensure that stack frame doesn't alias with $rptr+3*$num
+	# modulo 4096, which covers ret[num], am[num] and n[num]
+	# (see bn_exp.c). This is done to allow memory disambiguation
+	# logic do its magic. [Extra 256 bytes is for power mask
+	# calculated from 7th argument, the index.]
 	#
-	lea	-64(%rsp,$num,2),%r11
-	sub	$aptr,%r11
+	lea	-320(%rsp,$num,2),%r11
+	sub	$rptr,%r11
 	and	\$4095,%r11
 	cmp	%r11,%r10
 	jb	.Lpwrx_sp_alt
 	sub	%r11,%rsp		# align with $aptr
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*$num*8+256)
 	jmp	.Lpwrx_sp_done
 
 .align	32
 .Lpwrx_sp_alt:
-	lea	4096-64(,$num,2),%r10	# 4096-frame-2*$num
-	lea	-64(%rsp,$num,2),%rsp	# alloca(frame+2*$num)
+	lea	4096-320(,$num,2),%r10
+	lea	-320(%rsp,$num,2),%rsp	# alloca(frame+2*$num*8+256)
 	sub	%r10,%r11
 	mov	\$0,%r10
 	cmovc	%r10,%r11
@@ -2519,10 +2643,15 @@ $code.=<<___;
 .Lpowerx5_body:
 
 	call	__bn_sqrx8x_internal
+	call	__bn_postx4x_internal
 	call	__bn_sqrx8x_internal
+	call	__bn_postx4x_internal
 	call	__bn_sqrx8x_internal
+	call	__bn_postx4x_internal
 	call	__bn_sqrx8x_internal
+	call	__bn_postx4x_internal
 	call	__bn_sqrx8x_internal
+	call	__bn_postx4x_internal
 
 	mov	%r10,$num		# -num
 	mov	$aptr,$rptr
@@ -2534,12 +2663,7 @@ $code.=<<___;
 
 	mov	40(%rsp),%rsi		# restore %rsp
 	mov	\$1,%rax
-___
-$code.=<<___ if ($win64);
-	movaps	-88(%rsi),%xmm6
-	movaps	-72(%rsi),%xmm7
-___
-$code.=<<___;
+
 	mov	-48(%rsi),%r15
 	mov	-40(%rsi),%r14
 	mov	-32(%rsi),%r13
@@ -2973,11 +3097,11 @@ my ($nptr,$carry,$m0)=("%rbp","%rsi","%rdx");
 
 $code.=<<___;
 	movq	%xmm2,$nptr
-sqrx8x_reduction:
+__bn_sqrx8x_reduction:
 	xor	%eax,%eax		# initial top-most carry bit
 	mov	32+8(%rsp),%rbx		# n0
 	mov	48+8(%rsp),%rdx		# "%r8", 8*0($tptr)
-	lea	-128($nptr,$num,2),%rcx	# end of n[]
+	lea	-8*8($nptr,$num),%rcx	# end of n[]
 	#lea	48+8(%rsp,$num,2),$tptr	# end of t[] buffer
 	mov	%rcx, 0+8(%rsp)		# save end of n[]
 	mov	$tptr,8+8(%rsp)		# save end of t[]
@@ -3006,23 +3130,23 @@ sqrx8x_reduction:
 .align	32
 .Lsqrx8x_reduce:
 	mov	%r8, %rbx
-	mulx	16*0($nptr),%rax,%r8	# n[0]
+	mulx	8*0($nptr),%rax,%r8	# n[0]
 	adcx	%rbx,%rax		# discarded
 	adox	%r9,%r8
 
-	mulx	16*1($nptr),%rbx,%r9	# n[1]
+	mulx	8*1($nptr),%rbx,%r9	# n[1]
 	adcx	%rbx,%r8
 	adox	%r10,%r9
 
-	mulx	16*2($nptr),%rbx,%r10
+	mulx	8*2($nptr),%rbx,%r10
 	adcx	%rbx,%r9
 	adox	%r11,%r10
 
-	mulx	16*3($nptr),%rbx,%r11
+	mulx	8*3($nptr),%rbx,%r11
 	adcx	%rbx,%r10
 	adox	%r12,%r11
 
-	.byte	0xc4,0x62,0xe3,0xf6,0xa5,0x40,0x00,0x00,0x00	# mulx	16*4($nptr),%rbx,%r12
+	.byte	0xc4,0x62,0xe3,0xf6,0xa5,0x20,0x00,0x00,0x00	# mulx	8*4($nptr),%rbx,%r12
 	 mov	%rdx,%rax
 	 mov	%r8,%rdx
 	adcx	%rbx,%r11
@@ -3032,15 +3156,15 @@ sqrx8x_reduction:
 	 mov	%rax,%rdx
 	 mov	%rax,64+48+8(%rsp,%rcx,8)	# put aside n0*a[i]
 
-	mulx	16*5($nptr),%rax,%r13
+	mulx	8*5($nptr),%rax,%r13
 	adcx	%rax,%r12
 	adox	%r14,%r13
 
-	mulx	16*6($nptr),%rax,%r14
+	mulx	8*6($nptr),%rax,%r14
 	adcx	%rax,%r13
 	adox	%r15,%r14
 
-	mulx	16*7($nptr),%rax,%r15
+	mulx	8*7($nptr),%rax,%r15
 	 mov	%rbx,%rdx
 	adcx	%rax,%r14
 	adox	$carry,%r15		# $carry is 0
@@ -3056,7 +3180,7 @@ sqrx8x_reduction:
 
 	mov	48+8(%rsp),%rdx		# pull n0*a[0]
 	add	8*0($tptr),%r8
-	lea	16*8($nptr),$nptr
+	lea	8*8($nptr),$nptr
 	mov	\$-8,%rcx
 	adcx	8*1($tptr),%r9
 	adcx	8*2($tptr),%r10
@@ -3075,35 +3199,35 @@ sqrx8x_reduction:
 .align	32
 .Lsqrx8x_tail:
 	mov	%r8,%rbx
-	mulx	16*0($nptr),%rax,%r8
+	mulx	8*0($nptr),%rax,%r8
 	adcx	%rax,%rbx
 	adox	%r9,%r8
 
-	mulx	16*1($nptr),%rax,%r9
+	mulx	8*1($nptr),%rax,%r9
 	adcx	%rax,%r8
 	adox	%r10,%r9
 
-	mulx	16*2($nptr),%rax,%r10
+	mulx	8*2($nptr),%rax,%r10
 	adcx	%rax,%r9
 	adox	%r11,%r10
 
-	mulx	16*3($nptr),%rax,%r11
+	mulx	8*3($nptr),%rax,%r11
 	adcx	%rax,%r10
 	adox	%r12,%r11
 
-	.byte	0xc4,0x62,0xfb,0xf6,0xa5,0x40,0x00,0x00,0x00	# mulx	16*4($nptr),%rax,%r12
+	.byte	0xc4,0x62,0xfb,0xf6,0xa5,0x20,0x00,0x00,0x00	# mulx	8*4($nptr),%rax,%r12
 	adcx	%rax,%r11
 	adox	%r13,%r12
 
-	mulx	16*5($nptr),%rax,%r13
+	mulx	8*5($nptr),%rax,%r13
 	adcx	%rax,%r12
 	adox	%r14,%r13
 
-	mulx	16*6($nptr),%rax,%r14
+	mulx	8*6($nptr),%rax,%r14
 	adcx	%rax,%r13
 	adox	%r15,%r14
 
-	mulx	16*7($nptr),%rax,%r15
+	mulx	8*7($nptr),%rax,%r15
 	 mov	72+48+8(%rsp,%rcx,8),%rdx	# pull n0*a[i]
 	adcx	%rax,%r14
 	adox	$carry,%r15
@@ -3119,7 +3243,7 @@ sqrx8x_reduction:
 
 	sub	16+8(%rsp),$carry	# mov 16(%rsp),%cf
 	 mov	48+8(%rsp),%rdx		# pull n0*a[0]
-	 lea	16*8($nptr),$nptr
+	 lea	8*8($nptr),$nptr
 	adc	8*0($tptr),%r8
 	adc	8*1($tptr),%r9
 	adc	8*2($tptr),%r10
@@ -3155,7 +3279,7 @@ sqrx8x_reduction:
 	adc	8*0($tptr),%r8
 	 movq	%xmm3,%rcx
 	adc	8*1($tptr),%r9
-	 mov	16*7($nptr),$carry
+	 mov	8*7($nptr),$carry
 	 movq	%xmm2,$nptr		# restore $nptr
 	adc	8*2($tptr),%r10
 	adc	8*3($tptr),%r11
@@ -3181,6 +3305,8 @@ sqrx8x_reduction:
 	lea	8*8($tptr,%rcx),$tptr	# start of current t[] window
 	cmp	8+8(%rsp),%r8		# end of t[]?
 	jb	.Lsqrx8x_reduction_loop
+	ret
+.size	bn_sqrx8x_internal,.-bn_sqrx8x_internal
 ___
 }

 ##############################################################
@@ -3188,52 +3314,59 @@ ___
 #
 {
 my ($rptr,$nptr)=("%rdx","%rbp");
-my @ri=map("%r$_",(10..13));
-my @ni=map("%r$_",(14..15));
 $code.=<<___;
-	xor	%ebx,%ebx
-	sub	%r15,%rsi		# compare top-most words
-	adc	%rbx,%rbx
+.align	32
+__bn_postx4x_internal:
+	mov	8*0($nptr),%r12
 	mov	%rcx,%r10		# -$num
-	or	%rbx,%rax
 	mov	%rcx,%r9		# -$num
-	xor	\$1,%rax
-	sar	\$3+2,%rcx		# cf=0
+	neg	%rax
+	sar	\$3+2,%rcx
 	#lea	48+8(%rsp,%r9),$tptr
-	lea	($nptr,%rax,8),$nptr
 	movq	%xmm1,$rptr		# restore $rptr
 	movq	%xmm1,$aptr		# prepare for back-to-back call
-	jmp	.Lsqrx4x_sub
+	dec	%r12			# so that after 'not' we get -n[0]
+	mov	8*1($nptr),%r13
+	xor	%r8,%r8
+	mov	8*2($nptr),%r14
+	mov	8*3($nptr),%r15
+	jmp	.Lsqrx4x_sub_entry
 
-.align	32
+.align	16
 .Lsqrx4x_sub:
-	.byte	0x66
-	mov	8*0($tptr),%r12
-	mov	8*1($tptr),%r13
-	sbb	16*0($nptr),%r12
-	mov	8*2($tptr),%r14
-	sbb	16*1($nptr),%r13
-	mov	8*3($tptr),%r15
-	lea	8*4($tptr),$tptr
-	sbb	16*2($nptr),%r14
+	mov	8*0($nptr),%r12
+	mov	8*1($nptr),%r13
+	mov	8*2($nptr),%r14
+	mov	8*3($nptr),%r15
+.Lsqrx4x_sub_entry:
+	andn	%rax,%r12,%r12
+	lea	8*4($nptr),$nptr
+	andn	%rax,%r13,%r13
+	andn	%rax,%r14,%r14
+	andn	%rax,%r15,%r15
+
+	neg	%r8			# mov %r8,%cf
+	adc	8*0($tptr),%r12
+	adc	8*1($tptr),%r13
+	adc	8*2($tptr),%r14
+	adc	8*3($tptr),%r15
 	mov	%r12,8*0($rptr)
-	sbb	16*3($nptr),%r15
-	lea	16*4($nptr),$nptr
+	lea	8*4($tptr),$tptr
 	mov	%r13,8*1($rptr)
+	sbb	%r8,%r8			# mov %cf,%r8
 	mov	%r14,8*2($rptr)
 	mov	%r15,8*3($rptr)
 	lea	8*4($rptr),$rptr
 
 	inc	%rcx
 	jnz	.Lsqrx4x_sub
-___
-}
-$code.=<<___;
+
 	neg	%r9			# restore $num
 
 	ret
-.size	bn_sqrx8x_internal,.-bn_sqrx8x_internal
+.size	__bn_postx4x_internal,.-__bn_postx4x_internal
 ___
+}
 }}}
 {
 my ($inp,$num,$tbl,$idx)=$win64?("%rcx","%edx","%r8", "%r9d") : # Win64 order
@@ -3282,56 +3415,91 @@ bn_scatter5:
 
 .globl	bn_gather5
 .type	bn_gather5,\@abi-omnipotent
-.align	16
+.align	32
 bn_gather5:
-___
-$code.=<<___ if ($win64);
-.LSEH_begin_bn_gather5:
+.LSEH_begin_bn_gather5:			# Win64 thing, but harmless in other cases
 	# I can't trust assembler to use specific encoding:-(
-	.byte	0x48,0x83,0xec,0x28		#sub	\$0x28,%rsp
-	.byte	0x0f,0x29,0x34,0x24		#movaps	%xmm6,(%rsp)
-	.byte	0x0f,0x29,0x7c,0x24,0x10	#movdqa	%xmm7,0x10(%rsp)
+	.byte	0x4c,0x8d,0x14,0x24			#lea    (%rsp),%r10
+	.byte	0x48,0x81,0xec,0x08,0x01,0x00,0x00	#sub	$0x108,%rsp
+	lea	.Linc(%rip),%rax
+	and	\$-16,%rsp		# shouldn't be formally required
+
+	movd	$idx,%xmm5
+	movdqa	0(%rax),%xmm0		# 00000001000000010000000000000000
+	movdqa	16(%rax),%xmm1		# 00000002000000020000000200000002
+	lea	128($tbl),%r11		# size optimization
+	lea	128(%rsp),%rax		# size optimization
+
+	pshufd	\$0,%xmm5,%xmm5		# broadcast $idx
+	movdqa	%xmm1,%xmm4
+	movdqa	%xmm1,%xmm2
 ___
+########################################################################
+# calculate mask by comparing 0..31 to $idx and save result to stack
+#
+for($i=0;$i<$STRIDE/16;$i+=4) {
 $code.=<<___;
-	mov	$idx,%r11d
-	shr	\$`log($N/8)/log(2)`,$idx
-	and	\$`$N/8-1`,%r11
-	not	$idx
-	lea	.Lmagic_masks(%rip),%rax
-	and	\$`2**5/($N/8)-1`,$idx	# 5 is "window size"
-	lea	128($tbl,%r11,8),$tbl	# pointer within 1st cache line
-	movq	0(%rax,$idx,8),%xmm4	# set of masks denoting which
-	movq	8(%rax,$idx,8),%xmm5	# cache line contains element
-	movq	16(%rax,$idx,8),%xmm6	# denoted by 7th argument
-	movq	24(%rax,$idx,8),%xmm7
+	paddd	%xmm0,%xmm1
+	pcmpeqd	%xmm5,%xmm0		# compare to 1,0
+___
+$code.=<<___	if ($i);
+	movdqa	%xmm3,`16*($i-1)-128`(%rax)
+___
+$code.=<<___;
+	movdqa	%xmm4,%xmm3
+
+	paddd	%xmm1,%xmm2
+	pcmpeqd	%xmm5,%xmm1		# compare to 3,2
+	movdqa	%xmm0,`16*($i+0)-128`(%rax)
+	movdqa	%xmm4,%xmm0
+
+	paddd	%xmm2,%xmm3
+	pcmpeqd	%xmm5,%xmm2		# compare to 5,4
+	movdqa	%xmm1,`16*($i+1)-128`(%rax)
+	movdqa	%xmm4,%xmm1
+
+	paddd	%xmm3,%xmm0
+	pcmpeqd	%xmm5,%xmm3		# compare to 7,6
+	movdqa	%xmm2,`16*($i+2)-128`(%rax)
+	movdqa	%xmm4,%xmm2
+___
+}
+$code.=<<___;
+	movdqa	%xmm3,`16*($i-1)-128`(%rax)
 	jmp	.Lgather
-.align	16
-.Lgather:
-	movq	`0*$STRIDE/4-128`($tbl),%xmm0
-	movq	`1*$STRIDE/4-128`($tbl),%xmm1
-	pand	%xmm4,%xmm0
-	movq	`2*$STRIDE/4-128`($tbl),%xmm2
-	pand	%xmm5,%xmm1
-	movq	`3*$STRIDE/4-128`($tbl),%xmm3
-	pand	%xmm6,%xmm2
-	por	%xmm1,%xmm0
-	pand	%xmm7,%xmm3
-	.byte	0x67,0x67
-	por	%xmm2,%xmm0
-	lea	$STRIDE($tbl),$tbl
-	por	%xmm3,%xmm0
 
+.align	32
+.Lgather:
+	pxor	%xmm4,%xmm4
+	pxor	%xmm5,%xmm5
+___
+for($i=0;$i<$STRIDE/16;$i+=4) {
+$code.=<<___;
+	movdqa	`16*($i+0)-128`(%r11),%xmm0
+	movdqa	`16*($i+1)-128`(%r11),%xmm1
+	movdqa	`16*($i+2)-128`(%r11),%xmm2
+	pand	`16*($i+0)-128`(%rax),%xmm0
+	movdqa	`16*($i+3)-128`(%r11),%xmm3
+	pand	`16*($i+1)-128`(%rax),%xmm1
+	por	%xmm0,%xmm4
+	pand	`16*($i+2)-128`(%rax),%xmm2
+	por	%xmm1,%xmm5
+	pand	`16*($i+3)-128`(%rax),%xmm3
+	por	%xmm2,%xmm4
+	por	%xmm3,%xmm5
+___
+}
+$code.=<<___;
+	por	%xmm5,%xmm4
+	lea	$STRIDE(%r11),%r11
+	pshufd	\$0x4e,%xmm4,%xmm0
+	por	%xmm4,%xmm0
 	movq	%xmm0,($out)		# m0=bp[0]
 	lea	8($out),$out
 	sub	\$1,$num
 	jnz	.Lgather
-___
-$code.=<<___ if ($win64);
-	movaps	(%rsp),%xmm6
-	movaps	0x10(%rsp),%xmm7
-	lea	0x28(%rsp),%rsp
-___
-$code.=<<___;
+
+	lea	(%r10),%rsp
 	ret
 .LSEH_end_bn_gather5:
 .size	bn_gather5,.-bn_gather5
@@ -3339,9 +3507,9 @@ ___
 }
 $code.=<<___;
 .align	64
-.Lmagic_masks:
-	.long	0,0, 0,0, 0,0, -1,-1
-	.long	0,0, 0,0, 0,0,  0,0
+.Linc:
+	.long	0,0, 1,1
+	.long	2,2, 2,2
 .asciz	"Montgomery Multiplication with scatter/gather for x86_64, CRYPTOGAMS by <appro\@openssl.org>"
 ___
 
@@ -3389,19 +3557,16 @@ mul_handler:
 
 	lea	.Lmul_epilogue(%rip),%r10
 	cmp	%r10,%rbx
-	jb	.Lbody_40
+	ja	.Lbody_40
 
 	mov	192($context),%r10	# pull $num
 	mov	8(%rax,%r10,8),%rax	# pull saved stack pointer
+
 	jmp	.Lbody_proceed
 
 .Lbody_40:
 	mov	40(%rax),%rax		# pull saved stack pointer
 .Lbody_proceed:
-
-	movaps	-88(%rax),%xmm0
-	movaps	-72(%rax),%xmm1
-
 	mov	-8(%rax),%rbx
 	mov	-16(%rax),%rbp
 	mov	-24(%rax),%r12
@@ -3414,8 +3579,6 @@ mul_handler:
 	mov	%r13,224($context)	# restore context->R13
 	mov	%r14,232($context)	# restore context->R14
 	mov	%r15,240($context)	# restore context->R15
-	movups	%xmm0,512($context)	# restore context->Xmm6
-	movups	%xmm1,528($context)	# restore context->Xmm7
 
 .Lcommon_seh_tail:
 	mov	8(%rax),%rdi
@@ -3526,10 +3689,9 @@ ___
 $code.=<<___;
 .align	8
 .LSEH_info_bn_gather5:
-        .byte   0x01,0x0d,0x05,0x00
-        .byte   0x0d,0x78,0x01,0x00	#movaps	0x10(rsp),xmm7
-        .byte   0x08,0x68,0x00,0x00	#movaps	(rsp),xmm6
-        .byte   0x04,0x42,0x00,0x00	#sub	rsp,0x28
+	.byte	0x01,0x0b,0x03,0x0a
+	.byte	0x0b,0x01,0x21,0x00	# sub	rsp,0x108
+	.byte	0x04,0xa3,0x00,0x00	# lea	r10,(rsp)
 .align	8
 ___
 }
diff --git a/crypto/bn/bn_exp.c b/crypto/bn/bn_exp.c
index 6d30d1e..1670f01 100644
--- a/crypto/bn/bn_exp.c
+++ b/crypto/bn/bn_exp.c
@@ -110,6 +110,7 @@
  */
 
 #include "cryptlib.h"
+#include "constant_time_locl.h"
 #include "bn_lcl.h"
 
 #include <stdlib.h>
@@ -606,15 +607,17 @@ static BN_ULONG bn_get_bits(const BIGNUM *a, int bitpos)
 
 static int MOD_EXP_CTIME_COPY_TO_PREBUF(const BIGNUM *b, int top,
                                         unsigned char *buf, int idx,
-                                        int width)
+                                        int window)
 {
-    size_t i, j;
+    int i, j;
+    int width = 1 << window;
+    BN_ULONG *table = (BN_ULONG *)buf;
 
     if (top > b->top)
         top = b->top;           /* this works because 'buf' is explicitly
                                  * zeroed */
-    for (i = 0, j = idx; i < top * sizeof b->d[0]; i++, j += width) {
-        buf[j] = ((unsigned char *)b->d)[i];
+    for (i = 0, j = idx; i < top; i++, j += width) {
+        table[j] = b->d[i];
     }
 
     return 1;
@@ -622,15 +625,51 @@ static int MOD_EXP_CTIME_COPY_TO_PREBUF(const BIGNUM *b, int top,
 
 static int MOD_EXP_CTIME_COPY_FROM_PREBUF(BIGNUM *b, int top,
                                           unsigned char *buf, int idx,
-                                          int width)
+                                          int window)
 {
-    size_t i, j;
+    int i, j;
+    int width = 1 << window;
+    volatile BN_ULONG *table = (volatile BN_ULONG *)buf;
 
     if (bn_wexpand(b, top) == NULL)
         return 0;
 
-    for (i = 0, j = idx; i < top * sizeof b->d[0]; i++, j += width) {
-        ((unsigned char *)b->d)[i] = buf[j];
+    if (window <= 3) {
+        for (i = 0; i < top; i++, table += width) {
+            BN_ULONG acc = 0;
+
+            for (j = 0; j < width; j++) {
+                acc |= table[j] &
+                       ((BN_ULONG)0 - (constant_time_eq_int(j,idx)&1));
+            }
+
+            b->d[i] = acc;
+        }
+    } else {
+        int xstride = 1 << (window - 2);
+        BN_ULONG y0, y1, y2, y3;
+
+        i = idx >> (window - 2);        /* equivalent of idx / xstride */
+        idx &= xstride - 1;             /* equivalent of idx % xstride */
+
+        y0 = (BN_ULONG)0 - (constant_time_eq_int(i,0)&1);
+        y1 = (BN_ULONG)0 - (constant_time_eq_int(i,1)&1);
+        y2 = (BN_ULONG)0 - (constant_time_eq_int(i,2)&1);
+        y3 = (BN_ULONG)0 - (constant_time_eq_int(i,3)&1);
+
+        for (i = 0; i < top; i++, table += width) {
+            BN_ULONG acc = 0;
+
+            for (j = 0; j < xstride; j++) {
+                acc |= ( (table[j + 0 * xstride] & y0) |
+                         (table[j + 1 * xstride] & y1) |
+                         (table[j + 2 * xstride] & y2) |
+                         (table[j + 3 * xstride] & y3) )
+                       & ((BN_ULONG)0 - (constant_time_eq_int(j,idx)&1));
+            }
+
+            b->d[i] = acc;
+        }
     }
 
     b->top = top;
@@ -749,8 +788,8 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
     if (window >= 5) {
         window = 5;             /* ~5% improvement for RSA2048 sign, and even
                                  * for RSA4096 */
-        if ((top & 7) == 0)
-            powerbufLen += 2 * top * sizeof(m->d[0]);
+        /* reserve space for mont->N.d[] copy */
+        powerbufLen += top * sizeof(mont->N.d[0]);
     }
 #endif
     (void)0;
@@ -971,7 +1010,7 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
                                const BN_ULONG *not_used, const BN_ULONG *np,
                                const BN_ULONG *n0, int num);
 
-        BN_ULONG *np = mont->N.d, *n0 = mont->n0, *np2;
+        BN_ULONG *n0 = mont->n0, *np;
 
         /*
          * BN_to_montgomery can contaminate words above .top [in
@@ -982,11 +1021,11 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
         for (i = tmp.top; i < top; i++)
             tmp.d[i] = 0;
 
-        if (top & 7)
-            np2 = np;
-        else
-            for (np2 = am.d + top, i = 0; i < top; i++)
-                np2[2 * i] = np[i];
+        /*
+         * copy mont->N.d[] to improve cache locality
+         */
+        for (np = am.d + top, i = 0; i < top; i++)
+            np[i] = mont->N.d[i];
 
         bn_scatter5(tmp.d, top, powerbuf, 0);
         bn_scatter5(am.d, am.top, powerbuf, 1);
@@ -996,7 +1035,7 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
 # if 0
         for (i = 3; i < 32; i++) {
             /* Calculate a^i = a^(i-1) * a */
-            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np2, n0, top, i - 1);
+            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1);
             bn_scatter5(tmp.d, top, powerbuf, i);
         }
 # else
@@ -1007,7 +1046,7 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
         }
         for (i = 3; i < 8; i += 2) {
             int j;
-            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np2, n0, top, i - 1);
+            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1);
             bn_scatter5(tmp.d, top, powerbuf, i);
             for (j = 2 * i; j < 32; j *= 2) {
                 bn_mul_mont(tmp.d, tmp.d, tmp.d, np, n0, top);
@@ -1015,13 +1054,13 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
             }
         }
         for (; i < 16; i += 2) {
-            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np2, n0, top, i - 1);
+            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1);
             bn_scatter5(tmp.d, top, powerbuf, i);
             bn_mul_mont(tmp.d, tmp.d, tmp.d, np, n0, top);
             bn_scatter5(tmp.d, top, powerbuf, 2 * i);
         }
         for (; i < 32; i += 2) {
-            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np2, n0, top, i - 1);
+            bn_mul_mont_gather5(tmp.d, am.d, powerbuf, np, n0, top, i - 1);
             bn_scatter5(tmp.d, top, powerbuf, i);
         }
 # endif
@@ -1050,11 +1089,11 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
             while (bits >= 0) {
                 wvalue = bn_get_bits5(p->d, bits - 4);
                 bits -= 5;
-                bn_power5(tmp.d, tmp.d, powerbuf, np2, n0, top, wvalue);
+                bn_power5(tmp.d, tmp.d, powerbuf, np, n0, top, wvalue);
             }
         }
 
-        ret = bn_from_montgomery(tmp.d, tmp.d, NULL, np2, n0, top);
+        ret = bn_from_montgomery(tmp.d, tmp.d, NULL, np, n0, top);
         tmp.top = top;
         bn_correct_top(&tmp);
         if (ret) {
@@ -1065,9 +1104,9 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
     } else
 #endif
     {
-        if (!MOD_EXP_CTIME_COPY_TO_PREBUF(&tmp, top, powerbuf, 0, numPowers))
+        if (!MOD_EXP_CTIME_COPY_TO_PREBUF(&tmp, top, powerbuf, 0, window))
             goto err;
-        if (!MOD_EXP_CTIME_COPY_TO_PREBUF(&am, top, powerbuf, 1, numPowers))
+        if (!MOD_EXP_CTIME_COPY_TO_PREBUF(&am, top, powerbuf, 1, window))
             goto err;
 
         /*
@@ -1079,15 +1118,15 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
         if (window > 1) {
             if (!BN_mod_mul_montgomery(&tmp, &am, &am, mont, ctx))
                 goto err;
-            if (!MOD_EXP_CTIME_COPY_TO_PREBUF
-                (&tmp, top, powerbuf, 2, numPowers))
+            if (!MOD_EXP_CTIME_COPY_TO_PREBUF(&tmp, top, powerbuf, 2,
+                                              window))
                 goto err;
             for (i = 3; i < numPowers; i++) {
                 /* Calculate a^i = a^(i-1) * a */
                 if (!BN_mod_mul_montgomery(&tmp, &am, &tmp, mont, ctx))
                     goto err;
-                if (!MOD_EXP_CTIME_COPY_TO_PREBUF
-                    (&tmp, top, powerbuf, i, numPowers))
+                if (!MOD_EXP_CTIME_COPY_TO_PREBUF(&tmp, top, powerbuf, i,
+                                                  window))
                     goto err;
             }
         }
@@ -1095,8 +1134,8 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
         bits--;
         for (wvalue = 0, i = bits % window; i >= 0; i--, bits--)
             wvalue = (wvalue << 1) + BN_is_bit_set(p, bits);
-        if (!MOD_EXP_CTIME_COPY_FROM_PREBUF
-            (&tmp, top, powerbuf, wvalue, numPowers))
+        if (!MOD_EXP_CTIME_COPY_FROM_PREBUF(&tmp, top, powerbuf, wvalue,
+                                            window))
             goto err;
 
         /*
@@ -1116,8 +1155,8 @@ int BN_mod_exp_mont_consttime(BIGNUM *rr, const BIGNUM *a, const BIGNUM *p,
             /*
              * Fetch the appropriate pre-computed value from the pre-buf
              */
-            if (!MOD_EXP_CTIME_COPY_FROM_PREBUF
-                (&am, top, powerbuf, wvalue, numPowers))
+            if (!MOD_EXP_CTIME_COPY_FROM_PREBUF(&am, top, powerbuf, wvalue,
+                                                window))
                 goto err;
 
             /* Multiply the result into the intermediate result */
diff --git a/crypto/opensslv.h b/crypto/opensslv.h
index ae6387d..d6d671a 100644
--- a/crypto/opensslv.h
+++ b/crypto/opensslv.h
@@ -30,11 +30,11 @@ extern "C" {
  * (Prior to 0.9.5a beta1, a different scheme was used: MMNNFFRBB for
  *  major minor fix final patch/beta)
  */
-# define OPENSSL_VERSION_NUMBER  0x10002070L
+# define OPENSSL_VERSION_NUMBER  0x10002080L
 # ifdef OPENSSL_FIPS
-#  define OPENSSL_VERSION_TEXT    "OpenSSL 1.0.2g-fips-dev  xx XXX xxxx"
+#  define OPENSSL_VERSION_TEXT    "OpenSSL 1.0.2h-fips-dev  xx XXX xxxx"
 # else
-#  define OPENSSL_VERSION_TEXT    "OpenSSL 1.0.2g-dev  xx XXX xxxx"
+#  define OPENSSL_VERSION_TEXT    "OpenSSL 1.0.2h-dev  xx XXX xxxx"
 # endif
 # define OPENSSL_VERSION_PTEXT   " part of " OPENSSL_VERSION_TEXT
 
diff --git a/doc/apps/ciphers.pod b/doc/apps/ciphers.pod
index 1c26e3b..9643b4d 100644
--- a/doc/apps/ciphers.pod
+++ b/doc/apps/ciphers.pod
@@ -38,25 +38,21 @@ SSL v2 and for SSL v3/TLS v1.
 
 Like B<-v>, but include cipher suite codes in output (hex format).
 
-=item B<-ssl3>
+=item B<-ssl3>, B<-tls1>
 
-only include SSL v3 ciphers.
+This lists ciphers compatible with any of SSLv3, TLSv1, TLSv1.1 or TLSv1.2.
 
 =item B<-ssl2>
 
-only include SSL v2 ciphers.
-
-=item B<-tls1>
-
-only include TLS v1 ciphers.
+Only include SSLv2 ciphers.
 
 =item B<-h>, B<-?>
 
-print a brief usage message.
+Print a brief usage message.
 
 =item B<cipherlist>
 
-a cipher list to convert to a cipher preference list. If it is not included
+A cipher list to convert to a cipher preference list. If it is not included
 then the default cipher list will be used. The format is described below.
 
 =back
@@ -109,9 +105,10 @@ The following is a list of all permitted cipher strings and their meanings.
 
 =item B<DEFAULT>
 
-the default cipher list. This is determined at compile time and
-is normally B<ALL:!EXPORT:!aNULL:!eNULL:!SSLv2>. This must be the firstcipher string
-specified.
+The default cipher list.
+This is determined at compile time and is normally
+B<ALL:!EXPORT:!aNULL:!eNULL:!SSLv2>.
+When used, this must be the first cipherstring specified.
 
 =item B<COMPLEMENTOFDEFAULT>
 
@@ -139,34 +136,46 @@ than 128 bits, and some cipher suites with 128-bit keys.
 
 =item B<LOW>
 
-"low" encryption cipher suites, currently those using 64 or 56 bit encryption algorithms
-but excluding export cipher suites.
+Low strength encryption cipher suites, currently those using 64 or 56 bit
+encryption algorithms but excluding export cipher suites.
+As of OpenSSL 1.0.2g, these are disabled in default builds.
 
 =item B<EXP>, B<EXPORT>
 
-export encryption algorithms. Including 40 and 56 bits algorithms.
+Export strength encryption algorithms. Including 40 and 56 bits algorithms.
+As of OpenSSL 1.0.2g, these are disabled in default builds.
 
 =item B<EXPORT40>
 
-40 bit export encryption algorithms
+40-bit export encryption algorithms
+As of OpenSSL 1.0.2g, these are disabled in default builds.
 
 =item B<EXPORT56>
 
-56 bit export encryption algorithms. In OpenSSL 0.9.8c and later the set of
+56-bit export encryption algorithms. In OpenSSL 0.9.8c and later the set of
 56 bit export ciphers is empty unless OpenSSL has been explicitly configured
 with support for experimental ciphers.
+As of OpenSSL 1.0.2g, these are disabled in default builds.
 
 =item B<eNULL>, B<NULL>
 
-the "NULL" ciphers that is those offering no encryption. Because these offer no
-encryption at all and are a security risk they are disabled unless explicitly
-included.
+The "NULL" ciphers that is those offering no encryption. Because these offer no
+encryption at all and are a security risk they are not enabled via either the
+B<DEFAULT> or B<ALL> cipher strings.
+Be careful when building cipherlists out of lower-level primitives such as
+B<kRSA> or B<aECDSA> as these do overlap with the B<eNULL> ciphers.
+When in doubt, include B<!eNULL> in your cipherlist.
 
 =item B<aNULL>
 
-the cipher suites offering no authentication. This is currently the anonymous
+The cipher suites offering no authentication. This is currently the anonymous
 DH algorithms and anonymous ECDH algorithms. These cipher suites are vulnerable
 to a "man in the middle" attack and so their use is normally discouraged.
+These are excluded from the B<DEFAULT> ciphers, but included in the B<ALL>
+ciphers.
+Be careful when building cipherlists out of lower-level primitives such as
+B<kDHE> or B<AES> as these do overlap with the B<aNULL> ciphers.
+When in doubt, include B<!aNULL> in your cipherlist.
 
 =item B<kRSA>, B<RSA>
 
@@ -582,11 +591,11 @@ Note: these ciphers can also be used in SSL v3.
 =head2 Deprecated SSL v2.0 cipher suites.
 
  SSL_CK_RC4_128_WITH_MD5                 RC4-MD5
- SSL_CK_RC4_128_EXPORT40_WITH_MD5        EXP-RC4-MD5
- SSL_CK_RC2_128_CBC_WITH_MD5             RC2-MD5
- SSL_CK_RC2_128_CBC_EXPORT40_WITH_MD5    EXP-RC2-MD5
+ SSL_CK_RC4_128_EXPORT40_WITH_MD5        Not implemented.
+ SSL_CK_RC2_128_CBC_WITH_MD5             RC2-CBC-MD5
+ SSL_CK_RC2_128_CBC_EXPORT40_WITH_MD5    Not implemented.
  SSL_CK_IDEA_128_CBC_WITH_MD5            IDEA-CBC-MD5
- SSL_CK_DES_64_CBC_WITH_MD5              DES-CBC-MD5
+ SSL_CK_DES_64_CBC_WITH_MD5              Not implemented.
  SSL_CK_DES_192_EDE3_CBC_WITH_MD5        DES-CBC3-MD5
 
 =head1 NOTES
diff --git a/doc/apps/s_client.pod b/doc/apps/s_client.pod
index 84d0527..618df96 100644
--- a/doc/apps/s_client.pod
+++ b/doc/apps/s_client.pod
@@ -201,15 +201,11 @@ Use the PSK key B<key> when using a PSK cipher suite. The key is
 given as a hexadecimal number without leading 0x, for example -psk
 1a2b3c4d.
 
-=item B<-ssl2>, B<-ssl3>, B<-tls1>, B<-no_ssl2>, B<-no_ssl3>, B<-no_tls1>, B<-no_tls1_1>, B<-no_tls1_2>
+=item B<-ssl2>, B<-ssl3>, B<-tls1>, B<-tls1_1>, B<-tls1_2>, B<-no_ssl2>, B<-no_ssl3>, B<-no_tls1>, B<-no_tls1_1>, B<-no_tls1_2>
 
-these options disable the use of certain SSL or TLS protocols. By default
-the initial handshake uses a method which should be compatible with all
-servers and permit them to use SSL v3, SSL v2 or TLS as appropriate.
-
-Unfortunately there are still ancient and broken servers in use which
-cannot handle this technique and will fail to connect. Some servers only
-work if TLS is turned off.
+These options require or disable the use of the specified SSL or TLS protocols.
+By default the initial handshake uses a I<version-flexible> method which will
+negotiate the highest mutually supported protocol version.
 
 =item B<-fallback_scsv>
 
diff --git a/doc/apps/s_server.pod b/doc/apps/s_server.pod
index baca779..6f4acb7 100644
--- a/doc/apps/s_server.pod
+++ b/doc/apps/s_server.pod
@@ -217,11 +217,11 @@ Use the PSK key B<key> when using a PSK cipher suite. The key is
 given as a hexadecimal number without leading 0x, for example -psk
 1a2b3c4d.
 
-=item B<-ssl2>, B<-ssl3>, B<-tls1>, B<-no_ssl2>, B<-no_ssl3>, B<-no_tls1>
+=item B<-ssl2>, B<-ssl3>, B<-tls1>, B<-tls1_1>, B<-tls1_2>, B<-no_ssl2>, B<-no_ssl3>, B<-no_tls1>, B<-no_tls1_1>, B<-no_tls1_2>
 
-these options disable the use of certain SSL or TLS protocols. By default
-the initial handshake uses a method which should be compatible with all
-servers and permit them to use SSL v3, SSL v2 or TLS as appropriate.
+These options require or disable the use of the specified SSL or TLS protocols.
+By default the initial handshake uses a I<version-flexible> method which will
+negotiate the highest mutually supported protocol version.
 
 =item B<-bugs>
 
diff --git a/doc/ssl/SSL_CONF_cmd.pod b/doc/ssl/SSL_CONF_cmd.pod
index 2bf1a60..e81d76a 100644
--- a/doc/ssl/SSL_CONF_cmd.pod
+++ b/doc/ssl/SSL_CONF_cmd.pod
@@ -74,7 +74,7 @@ B<prime256v1>). Curve names are case sensitive.
 
 =item B<-named_curve>
 
-This sets the temporary curve used for ephemeral ECDH modes. Only used by 
+This sets the temporary curve used for ephemeral ECDH modes. Only used by
 servers
 
 The B<value> argument is a curve name or the special value B<auto> which
@@ -85,7 +85,7 @@ can be either the B<NIST> name (e.g. B<P-256>) or an OpenSSL OID name
 =item B<-cipher>
 
 Sets the cipher suite list to B<value>. Note: syntax checking of B<value> is
-currently not performed unless a B<SSL> or B<SSL_CTX> structure is 
+currently not performed unless a B<SSL> or B<SSL_CTX> structure is
 associated with B<cctx>.
 
 =item B<-cert>
@@ -111,9 +111,9 @@ operations are permitted.
 
 =item B<-no_ssl2>, B<-no_ssl3>, B<-no_tls1>, B<-no_tls1_1>, B<-no_tls1_2>
 
-Disables protocol support for SSLv2, SSLv3, TLS 1.0, TLS 1.1 or TLS 1.2 
-by setting the corresponding options B<SSL_OP_NO_SSL2>, B<SSL_OP_NO_SSL3>,
-B<SSL_OP_NO_TLS1>, B<SSL_OP_NO_TLS1_1> and B<SSL_OP_NO_TLS1_2> respectively.
+Disables protocol support for SSLv2, SSLv3, TLSv1.0, TLSv1.1 or TLSv1.2
+by setting the corresponding options B<SSL_OP_NO_SSLv2>, B<SSL_OP_NO_SSLv3>,
+B<SSL_OP_NO_TLSv1>, B<SSL_OP_NO_TLSv1_1> and B<SSL_OP_NO_TLSv1_2> respectively.
 
 =item B<-bugs>
 
@@ -177,7 +177,7 @@ Note: the command prefix (if set) alters the recognised B<cmd> values.
 =item B<CipherString>
 
 Sets the cipher suite list to B<value>. Note: syntax checking of B<value> is
-currently not performed unless an B<SSL> or B<SSL_CTX> structure is 
+currently not performed unless an B<SSL> or B<SSL_CTX> structure is
 associated with B<cctx>.
 
 =item B<Certificate>
@@ -244,7 +244,7 @@ B<prime256v1>). Curve names are case sensitive.
 
 =item B<ECDHParameters>
 
-This sets the temporary curve used for ephemeral ECDH modes. Only used by 
+This sets the temporary curve used for ephemeral ECDH modes. Only used by
 servers
 
 The B<value> argument is a curve name or the special value B<Automatic> which
@@ -258,10 +258,11 @@ The supported versions of the SSL or TLS protocol.
 
 The B<value> argument is a comma separated list of supported protocols to
 enable or disable. If an protocol is preceded by B<-> that version is disabled.
-All versions are enabled by default, though applications may choose to
-explicitly disable some. Currently supported protocol values are B<SSLv2>,
-B<SSLv3>, B<TLSv1>, B<TLSv1.1> and B<TLSv1.2>. The special value B<ALL> refers
-to all supported versions.
+Currently supported protocol values are B<SSLv2>, B<SSLv3>, B<TLSv1>,
+B<TLSv1.1> and B<TLSv1.2>.
+All protocol versions other than B<SSLv2> are enabled by default.
+To avoid inadvertent enabling of B<SSLv2>, when SSLv2 is disabled, it is not
+possible to enable it via the B<Protocol> command.
 
 =item B<Options>
 
@@ -339,16 +340,16 @@ The value is a directory name.
 The order of operations is significant. This can be used to set either defaults
 or values which cannot be overridden. For example if an application calls:
 
- SSL_CONF_cmd(ctx, "Protocol", "-SSLv2");
+ SSL_CONF_cmd(ctx, "Protocol", "-SSLv3");
  SSL_CONF_cmd(ctx, userparam, uservalue);
 
-it will disable SSLv2 support by default but the user can override it. If 
+it will disable SSLv3 support by default but the user can override it. If
 however the call sequence is:
 
  SSL_CONF_cmd(ctx, userparam, uservalue);
- SSL_CONF_cmd(ctx, "Protocol", "-SSLv2");
+ SSL_CONF_cmd(ctx, "Protocol", "-SSLv3");
 
-SSLv2 is B<always> disabled and attempt to override this by the user are
+then SSLv3 is B<always> disabled and attempt to override this by the user are
 ignored.
 
 By checking the return code of SSL_CTX_cmd() it is possible to query if a
@@ -372,7 +373,7 @@ can be checked instead. If -3 is returned a required argument is missing
 and an error is indicated. If 0 is returned some other error occurred and
 this can be reported back to the user.
 
-The function SSL_CONF_cmd_value_type() can be used by applications to 
+The function SSL_CONF_cmd_value_type() can be used by applications to
 check for the existence of a command or to perform additional syntax
 checking or translation of the command value. For example if the return
 value is B<SSL_CONF_TYPE_FILE> an application could translate a relative
diff --git a/doc/ssl/SSL_CTX_new.pod b/doc/ssl/SSL_CTX_new.pod
index 491ac8c..b8cc879 100644
--- a/doc/ssl/SSL_CTX_new.pod
+++ b/doc/ssl/SSL_CTX_new.pod
@@ -2,13 +2,55 @@
 
 =head1 NAME
 
-SSL_CTX_new - create a new SSL_CTX object as framework for TLS/SSL enabled functions
+SSL_CTX_new,
+SSLv23_method, SSLv23_server_method, SSLv23_client_method,
+TLSv1_2_method, TLSv1_2_server_method, TLSv1_2_client_method,
+TLSv1_1_method, TLSv1_1_server_method, TLSv1_1_client_method,
+TLSv1_method, TLSv1_server_method, TLSv1_client_method,
+SSLv3_method, SSLv3_server_method, SSLv3_client_method,
+SSLv2_method, SSLv2_server_method, SSLv2_client_method,
+DTLS_method, DTLS_server_method, DTLS_client_method,
+DTLSv1_2_method, DTLSv1_2_server_method, DTLSv1_2_client_method,
+DTLSv1_method, DTLSv1_server_method, DTLSv1_client_method -
+create a new SSL_CTX object as framework for TLS/SSL enabled functions
 
 =head1 SYNOPSIS
 
  #include <openssl/ssl.h>
 
  SSL_CTX *SSL_CTX_new(const SSL_METHOD *method);
+ const SSL_METHOD *SSLv23_method(void);
+ const SSL_METHOD *SSLv23_server_method(void);
+ const SSL_METHOD *SSLv23_client_method(void);
+ const SSL_METHOD *TLSv1_2_method(void);
+ const SSL_METHOD *TLSv1_2_server_method(void);
+ const SSL_METHOD *TLSv1_2_client_method(void);
+ const SSL_METHOD *TLSv1_1_method(void);
+ const SSL_METHOD *TLSv1_1_server_method(void);
+ const SSL_METHOD *TLSv1_1_client_method(void);
+ const SSL_METHOD *TLSv1_method(void);
+ const SSL_METHOD *TLSv1_server_method(void);
+ const SSL_METHOD *TLSv1_client_method(void);
+ #ifndef OPENSSL_NO_SSL3_METHOD
+ const SSL_METHOD *SSLv3_method(void);
+ const SSL_METHOD *SSLv3_server_method(void);
+ const SSL_METHOD *SSLv3_client_method(void);
+ #endif
+ #ifndef OPENSSL_NO_SSL2
+ const SSL_METHOD *SSLv2_method(void);
+ const SSL_METHOD *SSLv2_server_method(void);
+ const SSL_METHOD *SSLv2_client_method(void);
+ #endif
+
+ const SSL_METHOD *DTLS_method(void);
+ const SSL_METHOD *DTLS_server_method(void);
+ const SSL_METHOD *DTLS_client_method(void);
+ const SSL_METHOD *DTLSv1_2_method(void);
+ const SSL_METHOD *DTLSv1_2_server_method(void);
+ const SSL_METHOD *DTLSv1_2_client_method(void);
+ const SSL_METHOD *DTLSv1_method(void);
+ const SSL_METHOD *DTLSv1_server_method(void);
+ const SSL_METHOD *DTLSv1_client_method(void);
 
 =head1 DESCRIPTION
 
@@ -23,65 +65,88 @@ client only type. B<method> can be of the following types:
 
 =over 4
 
-=item SSLv2_method(void), SSLv2_server_method(void), SSLv2_client_method(void)
+=item SSLv23_method(), SSLv23_server_method(), SSLv23_client_method()
+
+These are the general-purpose I<version-flexible> SSL/TLS methods.
+The actual protocol version used will be negotiated to the highest version
+mutually supported by the client and the server.
+The supported protocols are SSLv2, SSLv3, TLSv1, TLSv1.1 and TLSv1.2.
+Most applications should use these method, and avoid the version specific
+methods described below.
+
+The list of protocols available can be further limited using the
+B<SSL_OP_NO_SSLv2>, B<SSL_OP_NO_SSLv3>, B<SSL_OP_NO_TLSv1>,
+B<SSL_OP_NO_TLSv1_1> and B<SSL_OP_NO_TLSv1_2> options of the
+L<SSL_CTX_set_options(3)> or L<SSL_set_options(3)> functions.
+Clients should avoid creating "holes" in the set of protocols they support,
+when disabling a protocol, make sure that you also disable either all previous
+or all subsequent protocol versions.
+In clients, when a protocol version is disabled without disabling I<all>
+previous protocol versions, the effect is to also disable all subsequent
+protocol versions.
+
+The SSLv2 and SSLv3 protocols are deprecated and should generally not be used.
+Applications should typically use L<SSL_CTX_set_options(3)> in combination with
+the B<SSL_OP_NO_SSLv3> flag to disable negotiation of SSLv3 via the above
+I<version-flexible> SSL/TLS methods.
+The B<SSL_OP_NO_SSLv2> option is set by default, and would need to be cleared
+via L<SSL_CTX_clear_options(3)> in order to enable negotiation of SSLv2.
+
+=item TLSv1_2_method(), TLSv1_2_server_method(), TLSv1_2_client_method()
 
-A TLS/SSL connection established with these methods will only understand
-the SSLv2 protocol. A client will send out SSLv2 client hello messages
-and will also indicate that it only understand SSLv2. A server will only
-understand SSLv2 client hello messages.
+A TLS/SSL connection established with these methods will only understand the
+TLSv1.2 protocol.  A client will send out TLSv1.2 client hello messages and
+will also indicate that it only understand TLSv1.2.  A server will only
+understand TLSv1.2 client hello messages.
 
-=item SSLv3_method(void), SSLv3_server_method(void), SSLv3_client_method(void)
+=item TLSv1_1_method(), TLSv1_1_server_method(), TLSv1_1_client_method()
 
 A TLS/SSL connection established with these methods will only understand the
-SSLv3 protocol. A client will send out SSLv3 client hello messages
-and will indicate that it only understands SSLv3. A server will only understand
-SSLv3 client hello messages. This especially means, that it will
-not understand SSLv2 client hello messages which are widely used for
-compatibility reasons, see SSLv23_*_method().
+TLSv1.1 protocol.  A client will send out TLSv1.1 client hello messages and
+will also indicate that it only understand TLSv1.1.  A server will only
+understand TLSv1.1 client hello messages.
 
-=item TLSv1_method(void), TLSv1_server_method(void), TLSv1_client_method(void)
+=item TLSv1_method(), TLSv1_server_method(), TLSv1_client_method()
 
 A TLS/SSL connection established with these methods will only understand the
-TLSv1 protocol. A client will send out TLSv1 client hello messages
-and will indicate that it only understands TLSv1. A server will only understand
-TLSv1 client hello messages. This especially means, that it will
-not understand SSLv2 client hello messages which are widely used for
-compatibility reasons, see SSLv23_*_method(). It will also not understand
-SSLv3 client hello messages.
-
-=item SSLv23_method(void), SSLv23_server_method(void), SSLv23_client_method(void)
-
-A TLS/SSL connection established with these methods may understand the SSLv2,
-SSLv3, TLSv1, TLSv1.1 and TLSv1.2 protocols.
-
-If the cipher list does not contain any SSLv2 ciphersuites (the default
-cipher list does not) or extensions are required (for example server name)
-a client will send out TLSv1 client hello messages including extensions and
-will indicate that it also understands TLSv1.1, TLSv1.2 and permits a
-fallback to SSLv3. A server will support SSLv3, TLSv1, TLSv1.1 and TLSv1.2
-protocols. This is the best choice when compatibility is a concern.
-
-If any SSLv2 ciphersuites are included in the cipher list and no extensions
-are required then SSLv2 compatible client hellos will be used by clients and
-SSLv2 will be accepted by servers. This is B<not> recommended due to the
-insecurity of SSLv2 and the limited nature of the SSLv2 client hello
-prohibiting the use of extensions.
+TLSv1 protocol.  A client will send out TLSv1 client hello messages and will
+indicate that it only understands TLSv1.  A server will only understand TLSv1
+client hello messages.
 
-=back
+=item SSLv3_method(), SSLv3_server_method(), SSLv3_client_method()
+
+A TLS/SSL connection established with these methods will only understand the
+SSLv3 protocol.  A client will send out SSLv3 client hello messages and will
+indicate that it only understands SSLv3.  A server will only understand SSLv3
+client hello messages.  The SSLv3 protocol is deprecated and should not be
+used.
+
+=item SSLv2_method(), SSLv2_server_method(), SSLv2_client_method()
+
+A TLS/SSL connection established with these methods will only understand the
+SSLv2 protocol.  A client will send out SSLv2 client hello messages and will
+also indicate that it only understand SSLv2.  A server will only understand
+SSLv2 client hello messages.  The SSLv2 protocol offers little to no security
+and should not be used.
+As of OpenSSL 1.0.2g, EXPORT ciphers and 56-bit DES are no longer available
+with SSLv2.
 
-The list of protocols available can later be limited using the SSL_OP_NO_SSLv2,
-SSL_OP_NO_SSLv3, SSL_OP_NO_TLSv1, SSL_OP_NO_TLSv1_1 and SSL_OP_NO_TLSv1_2
-options of the SSL_CTX_set_options() or SSL_set_options() functions.
-Using these options it is possible to choose e.g. SSLv23_server_method() and
-be able to negotiate with all possible clients, but to only allow newer
-protocols like TLSv1, TLSv1.1 or TLS v1.2.
+=item DTLS_method(), DTLS_server_method(), DTLS_client_method()
 
-Applications which never want to support SSLv2 (even is the cipher string
-is configured to use SSLv2 ciphersuites) can set SSL_OP_NO_SSLv2.
+These are the version-flexible DTLS methods.
+
+=item DTLSv1_2_method(), DTLSv1_2_server_method(), DTLSv1_2_client_method()
+
+These are the version-specific methods for DTLSv1.2.
+
+=item DTLSv1_method(), DTLSv1_server_method(), DTLSv1_client_method()
+
+These are the version-specific methods for DTLSv1.
+
+=back
 
-SSL_CTX_new() initializes the list of ciphers, the session cache setting,
-the callbacks, the keys and certificates and the options to its default
-values.
+SSL_CTX_new() initializes the list of ciphers, the session cache setting, the
+callbacks, the keys and certificates and the options to its default values.
 
 =head1 RETURN VALUES
 
@@ -91,8 +156,8 @@ The following return values can occur:
 
 =item NULL
 
-The creation of a new SSL_CTX object failed. Check the error stack to
-find out the reason.
+The creation of a new SSL_CTX object failed. Check the error stack to find out
+the reason.
 
 =item Pointer to an SSL_CTX object
 
@@ -102,6 +167,7 @@ The return value points to an allocated SSL_CTX object.
 
 =head1 SEE ALSO
 
+L<SSL_CTX_set_options(3)>, L<SSL_CTX_clear_options(3)>, L<SSL_set_options(3)>,
 L<SSL_CTX_free(3)|SSL_CTX_free(3)>, L<SSL_accept(3)|SSL_accept(3)>,
 L<ssl(3)|ssl(3)>,  L<SSL_set_connect_state(3)|SSL_set_connect_state(3)>
 
diff --git a/doc/ssl/SSL_CTX_set_options.pod b/doc/ssl/SSL_CTX_set_options.pod
index e80a72c..9a7e98c 100644
--- a/doc/ssl/SSL_CTX_set_options.pod
+++ b/doc/ssl/SSL_CTX_set_options.pod
@@ -189,15 +189,25 @@ browser has a cert, it will crash/hang.  Works for 3.x and 4.xbeta
 =item SSL_OP_NO_SSLv2
 
 Do not use the SSLv2 protocol.
+As of OpenSSL 1.0.2g the B<SSL_OP_NO_SSLv2> option is set by default.
 
 =item SSL_OP_NO_SSLv3
 
 Do not use the SSLv3 protocol.
+It is recommended that applications should set this option.
 
 =item SSL_OP_NO_TLSv1
 
 Do not use the TLSv1 protocol.
 
+=item SSL_OP_NO_TLSv1_1
+
+Do not use the TLSv1.1 protocol.
+
+=item SSL_OP_NO_TLSv1_2
+
+Do not use the TLSv1.2 protocol.
+
 =item SSL_OP_NO_SESSION_RESUMPTION_ON_RENEGOTIATION
 
 When performing renegotiation as a server, always start a new session
diff --git a/doc/ssl/ssl.pod b/doc/ssl/ssl.pod
index 242087e..70cca17 100644
--- a/doc/ssl/ssl.pod
+++ b/doc/ssl/ssl.pod
@@ -130,41 +130,86 @@ protocol methods defined in B<SSL_METHOD> structures.
 
 =over 4
 
-=item const SSL_METHOD *B<SSLv2_client_method>(void);
+=item const SSL_METHOD *B<SSLv23_method>(void);
 
-Constructor for the SSLv2 SSL_METHOD structure for a dedicated client.
+Constructor for the I<version-flexible> SSL_METHOD structure for
+clients, servers or both.
+See L<SSL_CTX_new(3)> for details.
 
-=item const SSL_METHOD *B<SSLv2_server_method>(void);
+=item const SSL_METHOD *B<SSLv23_client_method>(void);
 
-Constructor for the SSLv2 SSL_METHOD structure for a dedicated server.
+Constructor for the I<version-flexible> SSL_METHOD structure for
+clients.
 
-=item const SSL_METHOD *B<SSLv2_method>(void);
+=item const SSL_METHOD *B<SSLv23_client_method>(void);
 
-Constructor for the SSLv2 SSL_METHOD structure for combined client and server.
+Constructor for the I<version-flexible> SSL_METHOD structure for
+servers.
 
-=item const SSL_METHOD *B<SSLv3_client_method>(void);
+=item const SSL_METHOD *B<TLSv1_2_method>(void);
 
-Constructor for the SSLv3 SSL_METHOD structure for a dedicated client.
+Constructor for the TLSv1.2 SSL_METHOD structure for clients, servers
+or both.
 
-=item const SSL_METHOD *B<SSLv3_server_method>(void);
+=item const SSL_METHOD *B<TLSv1_2_client_method>(void);
 
-Constructor for the SSLv3 SSL_METHOD structure for a dedicated server.
+Constructor for the TLSv1.2 SSL_METHOD structure for clients.
 
-=item const SSL_METHOD *B<SSLv3_method>(void);
+=item const SSL_METHOD *B<TLSv1_2_server_method>(void);
+
+Constructor for the TLSv1.2 SSL_METHOD structure for servers.
+
+=item const SSL_METHOD *B<TLSv1_1_method>(void);
 
-Constructor for the SSLv3 SSL_METHOD structure for combined client and server.
+Constructor for the TLSv1.1 SSL_METHOD structure for clients, servers
+or both.
+
+=item const SSL_METHOD *B<TLSv1_1_client_method>(void);
+
+Constructor for the TLSv1.1 SSL_METHOD structure for clients.
+
+=item const SSL_METHOD *B<TLSv1_1_server_method>(void);
+
+Constructor for the TLSv1.1 SSL_METHOD structure for servers.
+
+=item const SSL_METHOD *B<TLSv1_method>(void);
+
+Constructor for the TLSv1 SSL_METHOD structure for clients, servers
+or both.
 
 =item const SSL_METHOD *B<TLSv1_client_method>(void);
 
-Constructor for the TLSv1 SSL_METHOD structure for a dedicated client.
+Constructor for the TLSv1 SSL_METHOD structure for clients.
 
 =item const SSL_METHOD *B<TLSv1_server_method>(void);
 
-Constructor for the TLSv1 SSL_METHOD structure for a dedicated server.
+Constructor for the TLSv1 SSL_METHOD structure for servers.
 
-=item const SSL_METHOD *B<TLSv1_method>(void);
+=item const SSL_METHOD *B<SSLv3_method>(void);
+
+Constructor for the SSLv3 SSL_METHOD structure for clients, servers
+or both.
+
+=item const SSL_METHOD *B<SSLv3_client_method>(void);
+
+Constructor for the SSLv3 SSL_METHOD structure for clients.
+
+=item const SSL_METHOD *B<SSLv3_server_method>(void);
+
+Constructor for the SSLv3 SSL_METHOD structure for servers.
+
+=item const SSL_METHOD *B<SSLv2_method>(void);
+
+Constructor for the SSLv2 SSL_METHOD structure for clients, servers
+or both.
+
+=item const SSL_METHOD *B<SSLv2_client_method>(void);
+
+Constructor for the SSLv2 SSL_METHOD structure for clients.
+
+=item const SSL_METHOD *B<SSLv2_server_method>(void);
 
-Constructor for the TLSv1 SSL_METHOD structure for combined client and server.
+Constructor for the SSLv2 SSL_METHOD structure for servers.
 
 =back
 
diff --git a/openssl.spec b/openssl.spec
index 67fb073..55c05c4 100644
--- a/openssl.spec
+++ b/openssl.spec
@@ -6,7 +6,7 @@ Release: 1
 
 Summary: Secure Sockets Layer and cryptography libraries and tools
 Name: openssl
-Version: 1.0.2g
+Version: 1.0.2h
 Source0: ftp://ftp.openssl.org/source/%{name}-%{version}.tar.gz
 License: OpenSSL
 Group: System Environment/Libraries
diff --git a/ssl/Makefile b/ssl/Makefile
index 7b90fb0..b6dee5b 100644
--- a/ssl/Makefile
+++ b/ssl/Makefile
@@ -15,7 +15,7 @@ KRB5_INCLUDES=
 CFLAGS= $(INCLUDES) $(CFLAG)
 
 GENERAL=Makefile README ssl-lib.com install.com
-TEST=ssltest.c heartbeat_test.c clienthellotest.c
+TEST=ssltest.c heartbeat_test.c clienthellotest.c sslv2conftest.c
 APPS=
 
 LIB=$(TOP)/libssl.a
@@ -399,14 +399,14 @@ s2_clnt.o: ../include/openssl/obj_mac.h ../include/openssl/objects.h
 s2_clnt.o: ../include/openssl/opensslconf.h ../include/openssl/opensslv.h
 s2_clnt.o: ../include/openssl/ossl_typ.h ../include/openssl/pem.h
 s2_clnt.o: ../include/openssl/pem2.h ../include/openssl/pkcs7.h
-s2_clnt.o: ../include/openssl/pqueue.h ../include/openssl/rand.h
-s2_clnt.o: ../include/openssl/rsa.h ../include/openssl/safestack.h
-s2_clnt.o: ../include/openssl/sha.h ../include/openssl/srtp.h
-s2_clnt.o: ../include/openssl/ssl.h ../include/openssl/ssl2.h
-s2_clnt.o: ../include/openssl/ssl23.h ../include/openssl/ssl3.h
-s2_clnt.o: ../include/openssl/stack.h ../include/openssl/symhacks.h
-s2_clnt.o: ../include/openssl/tls1.h ../include/openssl/x509.h
-s2_clnt.o: ../include/openssl/x509_vfy.h s2_clnt.c ssl_locl.h
+s2_clnt.o: ../include/openssl/pqueue.h ../include/openssl/rsa.h
+s2_clnt.o: ../include/openssl/safestack.h ../include/openssl/sha.h
+s2_clnt.o: ../include/openssl/srtp.h ../include/openssl/ssl.h
+s2_clnt.o: ../include/openssl/ssl2.h ../include/openssl/ssl23.h
+s2_clnt.o: ../include/openssl/ssl3.h ../include/openssl/stack.h
+s2_clnt.o: ../include/openssl/symhacks.h ../include/openssl/tls1.h
+s2_clnt.o: ../include/openssl/x509.h ../include/openssl/x509_vfy.h s2_clnt.c
+s2_clnt.o: ssl_locl.h
 s2_enc.o: ../e_os.h ../include/openssl/asn1.h ../include/openssl/bio.h
 s2_enc.o: ../include/openssl/buffer.h ../include/openssl/comp.h
 s2_enc.o: ../include/openssl/crypto.h ../include/openssl/dsa.h
@@ -435,18 +435,18 @@ s2_lib.o: ../include/openssl/ec.h ../include/openssl/ecdh.h
 s2_lib.o: ../include/openssl/ecdsa.h ../include/openssl/err.h
 s2_lib.o: ../include/openssl/evp.h ../include/openssl/hmac.h
 s2_lib.o: ../include/openssl/kssl.h ../include/openssl/lhash.h
-s2_lib.o: ../include/openssl/md5.h ../include/openssl/obj_mac.h
-s2_lib.o: ../include/openssl/objects.h ../include/openssl/opensslconf.h
-s2_lib.o: ../include/openssl/opensslv.h ../include/openssl/ossl_typ.h
-s2_lib.o: ../include/openssl/pem.h ../include/openssl/pem2.h
-s2_lib.o: ../include/openssl/pkcs7.h ../include/openssl/pqueue.h
-s2_lib.o: ../include/openssl/rsa.h ../include/openssl/safestack.h
-s2_lib.o: ../include/openssl/sha.h ../include/openssl/srtp.h
-s2_lib.o: ../include/openssl/ssl.h ../include/openssl/ssl2.h
-s2_lib.o: ../include/openssl/ssl23.h ../include/openssl/ssl3.h
-s2_lib.o: ../include/openssl/stack.h ../include/openssl/symhacks.h
-s2_lib.o: ../include/openssl/tls1.h ../include/openssl/x509.h
-s2_lib.o: ../include/openssl/x509_vfy.h s2_lib.c ssl_locl.h
+s2_lib.o: ../include/openssl/obj_mac.h ../include/openssl/objects.h
+s2_lib.o: ../include/openssl/opensslconf.h ../include/openssl/opensslv.h
+s2_lib.o: ../include/openssl/ossl_typ.h ../include/openssl/pem.h
+s2_lib.o: ../include/openssl/pem2.h ../include/openssl/pkcs7.h
+s2_lib.o: ../include/openssl/pqueue.h ../include/openssl/rsa.h
+s2_lib.o: ../include/openssl/safestack.h ../include/openssl/sha.h
+s2_lib.o: ../include/openssl/srtp.h ../include/openssl/ssl.h
+s2_lib.o: ../include/openssl/ssl2.h ../include/openssl/ssl23.h
+s2_lib.o: ../include/openssl/ssl3.h ../include/openssl/stack.h
+s2_lib.o: ../include/openssl/symhacks.h ../include/openssl/tls1.h
+s2_lib.o: ../include/openssl/x509.h ../include/openssl/x509_vfy.h s2_lib.c
+s2_lib.o: ssl_locl.h
 s2_meth.o: ../e_os.h ../include/openssl/asn1.h ../include/openssl/bio.h
 s2_meth.o: ../include/openssl/buffer.h ../include/openssl/comp.h
 s2_meth.o: ../include/openssl/crypto.h ../include/openssl/dsa.h
@@ -487,20 +487,19 @@ s2_pkt.o: ../include/openssl/ssl3.h ../include/openssl/stack.h
 s2_pkt.o: ../include/openssl/symhacks.h ../include/openssl/tls1.h
 s2_pkt.o: ../include/openssl/x509.h ../include/openssl/x509_vfy.h s2_pkt.c
 s2_pkt.o: ssl_locl.h
-s2_srvr.o: ../crypto/constant_time_locl.h ../e_os.h ../include/openssl/asn1.h
-s2_srvr.o: ../include/openssl/bio.h ../include/openssl/buffer.h
-s2_srvr.o: ../include/openssl/comp.h ../include/openssl/crypto.h
-s2_srvr.o: ../include/openssl/dsa.h ../include/openssl/dtls1.h
-s2_srvr.o: ../include/openssl/e_os2.h ../include/openssl/ec.h
-s2_srvr.o: ../include/openssl/ecdh.h ../include/openssl/ecdsa.h
-s2_srvr.o: ../include/openssl/err.h ../include/openssl/evp.h
-s2_srvr.o: ../include/openssl/hmac.h ../include/openssl/kssl.h
-s2_srvr.o: ../include/openssl/lhash.h ../include/openssl/obj_mac.h
-s2_srvr.o: ../include/openssl/objects.h ../include/openssl/opensslconf.h
-s2_srvr.o: ../include/openssl/opensslv.h ../include/openssl/ossl_typ.h
-s2_srvr.o: ../include/openssl/pem.h ../include/openssl/pem2.h
-s2_srvr.o: ../include/openssl/pkcs7.h ../include/openssl/pqueue.h
-s2_srvr.o: ../include/openssl/rand.h ../include/openssl/rsa.h
+s2_srvr.o: ../e_os.h ../include/openssl/asn1.h ../include/openssl/bio.h
+s2_srvr.o: ../include/openssl/buffer.h ../include/openssl/comp.h
+s2_srvr.o: ../include/openssl/crypto.h ../include/openssl/dsa.h
+s2_srvr.o: ../include/openssl/dtls1.h ../include/openssl/e_os2.h
+s2_srvr.o: ../include/openssl/ec.h ../include/openssl/ecdh.h
+s2_srvr.o: ../include/openssl/ecdsa.h ../include/openssl/err.h
+s2_srvr.o: ../include/openssl/evp.h ../include/openssl/hmac.h
+s2_srvr.o: ../include/openssl/kssl.h ../include/openssl/lhash.h
+s2_srvr.o: ../include/openssl/obj_mac.h ../include/openssl/objects.h
+s2_srvr.o: ../include/openssl/opensslconf.h ../include/openssl/opensslv.h
+s2_srvr.o: ../include/openssl/ossl_typ.h ../include/openssl/pem.h
+s2_srvr.o: ../include/openssl/pem2.h ../include/openssl/pkcs7.h
+s2_srvr.o: ../include/openssl/pqueue.h ../include/openssl/rsa.h
 s2_srvr.o: ../include/openssl/safestack.h ../include/openssl/sha.h
 s2_srvr.o: ../include/openssl/srtp.h ../include/openssl/ssl.h
 s2_srvr.o: ../include/openssl/ssl2.h ../include/openssl/ssl23.h
diff --git a/ssl/s2_lib.c b/ssl/s2_lib.c
index d55b93f..a8036b3 100644
--- a/ssl/s2_lib.c
+++ b/ssl/s2_lib.c
@@ -156,6 +156,7 @@ OPENSSL_GLOBAL const SSL_CIPHER ssl2_ciphers[] = {
      128,
      },
 
+# if 0
 /* RC4_128_EXPORT40_WITH_MD5 */
     {
      1,
@@ -171,6 +172,7 @@ OPENSSL_GLOBAL const SSL_CIPHER ssl2_ciphers[] = {
      40,
      128,
      },
+# endif
 
 /* RC2_128_CBC_WITH_MD5 */
     {
@@ -188,6 +190,7 @@ OPENSSL_GLOBAL const SSL_CIPHER ssl2_ciphers[] = {
      128,
      },
 
+# if 0
 /* RC2_128_CBC_EXPORT40_WITH_MD5 */
     {
      1,
@@ -203,6 +206,7 @@ OPENSSL_GLOBAL const SSL_CIPHER ssl2_ciphers[] = {
      40,
      128,
      },
+# endif
 
 # ifndef OPENSSL_NO_IDEA
 /* IDEA_128_CBC_WITH_MD5 */
@@ -222,6 +226,7 @@ OPENSSL_GLOBAL const SSL_CIPHER ssl2_ciphers[] = {
      },
 # endif
 
+# if 0
 /* DES_64_CBC_WITH_MD5 */
     {
      1,
@@ -237,6 +242,7 @@ OPENSSL_GLOBAL const SSL_CIPHER ssl2_ciphers[] = {
      56,
      56,
      },
+# endif
 
 /* DES_192_EDE3_CBC_WITH_MD5 */
     {
diff --git a/ssl/s3_lib.c b/ssl/s3_lib.c
index 6a06625..4aac3b2 100644
--- a/ssl/s3_lib.c
+++ b/ssl/s3_lib.c
@@ -198,6 +198,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 03 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_RSA_RC4_40_MD5,
@@ -212,6 +213,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+#endif
 
 /* Cipher 04 */
     {
@@ -246,6 +248,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 06 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_RSA_RC2_40_MD5,
@@ -260,6 +263,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+#endif
 
 /* Cipher 07 */
 #ifndef OPENSSL_NO_IDEA
@@ -280,6 +284,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
 #endif
 
 /* Cipher 08 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_RSA_DES_40_CBC_SHA,
@@ -294,8 +299,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      56,
      },
+#endif
 
 /* Cipher 09 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_RSA_DES_64_CBC_SHA,
@@ -310,6 +317,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+#endif
 
 /* Cipher 0A */
     {
@@ -329,6 +337,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
 
 /* The DH ciphers */
 /* Cipher 0B */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      0,
      SSL3_TXT_DH_DSS_DES_40_CBC_SHA,
@@ -343,8 +352,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      56,
      },
+#endif
 
 /* Cipher 0C */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_DH_DSS_DES_64_CBC_SHA,
@@ -359,6 +370,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+#endif
 
 /* Cipher 0D */
     {
@@ -377,6 +389,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 0E */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      0,
      SSL3_TXT_DH_RSA_DES_40_CBC_SHA,
@@ -391,8 +404,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      56,
      },
+#endif
 
 /* Cipher 0F */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_DH_RSA_DES_64_CBC_SHA,
@@ -407,6 +422,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+#endif
 
 /* Cipher 10 */
     {
@@ -426,6 +442,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
 
 /* The Ephemeral DH ciphers */
 /* Cipher 11 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_EDH_DSS_DES_40_CBC_SHA,
@@ -440,8 +457,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      56,
      },
+#endif
 
 /* Cipher 12 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_EDH_DSS_DES_64_CBC_SHA,
@@ -456,6 +475,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+#endif
 
 /* Cipher 13 */
     {
@@ -474,6 +494,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 14 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_EDH_RSA_DES_40_CBC_SHA,
@@ -488,8 +509,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      56,
      },
+#endif
 
 /* Cipher 15 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_EDH_RSA_DES_64_CBC_SHA,
@@ -504,6 +527,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+#endif
 
 /* Cipher 16 */
     {
@@ -522,6 +546,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 17 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_ADH_RC4_40_MD5,
@@ -536,6 +561,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+#endif
 
 /* Cipher 18 */
     {
@@ -554,6 +580,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 19 */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_ADH_DES_40_CBC_SHA,
@@ -568,8 +595,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+#endif
 
 /* Cipher 1A */
+#ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_ADH_DES_64_CBC_SHA,
@@ -584,6 +613,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+#endif
 
 /* Cipher 1B */
     {
@@ -655,6 +685,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
 #ifndef OPENSSL_NO_KRB5
 /* The Kerberos ciphers*/
 /* Cipher 1E */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_DES_64_CBC_SHA,
@@ -669,6 +700,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+# endif
 
 /* Cipher 1F */
     {
@@ -719,6 +751,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 22 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_DES_64_CBC_MD5,
@@ -733,6 +766,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+# endif
 
 /* Cipher 23 */
     {
@@ -783,6 +817,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      },
 
 /* Cipher 26 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_DES_40_CBC_SHA,
@@ -797,8 +832,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      56,
      },
+# endif
 
 /* Cipher 27 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_RC2_40_CBC_SHA,
@@ -813,8 +850,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+# endif
 
 /* Cipher 28 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_RC4_40_SHA,
@@ -829,8 +868,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+# endif
 
 /* Cipher 29 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_DES_40_CBC_MD5,
@@ -845,8 +886,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      56,
      },
+# endif
 
 /* Cipher 2A */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_RC2_40_CBC_MD5,
@@ -861,8 +904,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+# endif
 
 /* Cipher 2B */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      SSL3_TXT_KRB5_RC4_40_MD5,
@@ -877,6 +922,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      40,
      128,
      },
+# endif
 #endif                          /* OPENSSL_NO_KRB5 */
 
 /* New AES ciphersuites */
@@ -1300,6 +1346,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
 # endif
 
     /* Cipher 62 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      TLS1_TXT_RSA_EXPORT1024_WITH_DES_CBC_SHA,
@@ -1314,8 +1361,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+# endif
 
     /* Cipher 63 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      TLS1_TXT_DHE_DSS_EXPORT1024_WITH_DES_CBC_SHA,
@@ -1330,8 +1379,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      56,
      },
+# endif
 
     /* Cipher 64 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      TLS1_TXT_RSA_EXPORT1024_WITH_RC4_56_SHA,
@@ -1346,8 +1397,10 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      128,
      },
+# endif
 
     /* Cipher 65 */
+# ifndef OPENSSL_NO_WEAK_SSL_CIPHERS
     {
      1,
      TLS1_TXT_DHE_DSS_EXPORT1024_WITH_RC4_56_SHA,
@@ -1362,6 +1415,7 @@ OPENSSL_GLOBAL SSL_CIPHER ssl3_ciphers[] = {
      56,
      128,
      },
+# endif
 
     /* Cipher 66 */
     {
diff --git a/ssl/ssl_conf.c b/ssl/ssl_conf.c
index 5478840..8d3709d 100644
--- a/ssl/ssl_conf.c
+++ b/ssl/ssl_conf.c
@@ -330,11 +330,19 @@ static int cmd_Protocol(SSL_CONF_CTX *cctx, const char *value)
         SSL_FLAG_TBL_INV("TLSv1.1", SSL_OP_NO_TLSv1_1),
         SSL_FLAG_TBL_INV("TLSv1.2", SSL_OP_NO_TLSv1_2)
     };
+    int ret;
+    int sslv2off;
+
     if (!(cctx->flags & SSL_CONF_FLAG_FILE))
         return -2;
     cctx->tbl = ssl_protocol_list;
     cctx->ntbl = sizeof(ssl_protocol_list) / sizeof(ssl_flag_tbl);
-    return CONF_parse_list(value, ',', 1, ssl_set_option_list, cctx);
+
+    sslv2off = *cctx->poptions & SSL_OP_NO_SSLv2;
+    ret = CONF_parse_list(value, ',', 1, ssl_set_option_list, cctx);
+    /* Never turn on SSLv2 through configuration */
+    *cctx->poptions |= sslv2off;
+    return ret;
 }
 
 static int cmd_Options(SSL_CONF_CTX *cctx, const char *value)
diff --git a/ssl/ssl_lib.c b/ssl/ssl_lib.c
index 7c23f9e..f1279bb 100644
--- a/ssl/ssl_lib.c
+++ b/ssl/ssl_lib.c
@@ -2054,6 +2054,13 @@ SSL_CTX *SSL_CTX_new(const SSL_METHOD *meth)
      */
     ret->options |= SSL_OP_LEGACY_SERVER_CONNECT;
 
+    /*
+     * Disable SSLv2 by default, callers that want to enable SSLv2 will have to
+     * explicitly clear this option via either of SSL_CTX_clear_options() or
+     * SSL_clear_options().
+     */
+    ret->options |= SSL_OP_NO_SSLv2;
+
     return (ret);
  err:
     SSLerr(SSL_F_SSL_CTX_NEW, ERR_R_MALLOC_FAILURE);
diff --git a/ssl/sslv2conftest.c b/ssl/sslv2conftest.c
new file mode 100644
index 0000000..1fd748b
--- /dev/null
+++ b/ssl/sslv2conftest.c
@@ -0,0 +1,231 @@
+/* Written by Matt Caswell for the OpenSSL Project */
+/* ====================================================================
+ * Copyright (c) 2016 The OpenSSL Project.  All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in
+ *    the documentation and/or other materials provided with the
+ *    distribution.
+ *
+ * 3. All advertising materials mentioning features or use of this
+ *    software must display the following acknowledgment:
+ *    "This product includes software developed by the OpenSSL Project
+ *    for use in the OpenSSL Toolkit. (http://www.openssl.org/)"
+ *
+ * 4. The names "OpenSSL Toolkit" and "OpenSSL Project" must not be used to
+ *    endorse or promote products derived from this software without
+ *    prior written permission. For written permission, please contact
+ *    openssl-core at openssl.org.
+ *
+ * 5. Products derived from this software may not be called "OpenSSL"
+ *    nor may "OpenSSL" appear in their names without prior written
+ *    permission of the OpenSSL Project.
+ *
+ * 6. Redistributions of any form whatsoever must retain the following
+ *    acknowledgment:
+ *    "This product includes software developed by the OpenSSL Project
+ *    for use in the OpenSSL Toolkit (http://www.openssl.org/)"
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE OpenSSL PROJECT ``AS IS'' AND ANY
+ * EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+ * PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE OpenSSL PROJECT OR
+ * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
+ * NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+ * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
+ * STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
+ * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED
+ * OF THE POSSIBILITY OF SUCH DAMAGE.
+ * ====================================================================
+ *
+ * This product includes cryptographic software written by Eric Young
+ * (eay at cryptsoft.com).  This product includes software written by Tim
+ * Hudson (tjh at cryptsoft.com).
+ *
+ */
+
+#include <stdlib.h>
+#include <openssl/bio.h>
+#include <openssl/ssl.h>
+#include <openssl/err.h>
+
+
+#define TOTAL_NUM_TESTS                         2
+#define TEST_SSL_CTX                            0
+
+#define SSLV2ON                                 1
+#define SSLV2OFF                                0
+
+SSL_CONF_CTX *confctx;
+SSL_CTX *ctx;
+SSL *ssl;
+
+static int checksslv2(int test, int sslv2)
+{
+    int options;
+    if (test == TEST_SSL_CTX) {
+        options = SSL_CTX_get_options(ctx);
+    } else {
+        options = SSL_get_options(ssl);
+    }
+    return ((options & SSL_OP_NO_SSLv2) == 0) ^ (sslv2 == SSLV2OFF);
+}
+
+int main(int argc, char *argv[])
+{
+    BIO *err;
+    int testresult = 0;
+    int currtest;
+
+    SSL_library_init();
+    SSL_load_error_strings();
+
+    err = BIO_new_fp(stderr, BIO_NOCLOSE | BIO_FP_TEXT);
+
+    CRYPTO_malloc_debug_init();
+    CRYPTO_set_mem_debug_options(V_CRYPTO_MDEBUG_ALL);
+    CRYPTO_mem_ctrl(CRYPTO_MEM_CHECK_ON);
+
+
+    confctx = SSL_CONF_CTX_new();
+    ctx = SSL_CTX_new(SSLv23_method());
+    ssl = SSL_new(ctx);
+    if (confctx == NULL || ctx == NULL)
+        goto end;
+
+    SSL_CONF_CTX_set_flags(confctx, SSL_CONF_FLAG_FILE
+                                    | SSL_CONF_FLAG_CLIENT
+                                    | SSL_CONF_FLAG_SERVER);
+
+    /*
+     * For each test set up an SSL_CTX and SSL and see whether SSLv2 is enabled
+     * as expected after various SSL_CONF_cmd("Protocol", ...) calls.
+     */
+    for (currtest = 0; currtest < TOTAL_NUM_TESTS; currtest++) {
+        BIO_printf(err, "SSLv2 CONF Test number %d\n", currtest);
+        if (currtest == TEST_SSL_CTX)
+            SSL_CONF_CTX_set_ssl_ctx(confctx, ctx);
+        else
+            SSL_CONF_CTX_set_ssl(confctx, ssl);
+
+        /* SSLv2 should be off by default */
+        if (!checksslv2(currtest, SSLV2OFF)) {
+            BIO_printf(err, "SSLv2 CONF Test: Off by default test FAIL\n");
+            goto end;
+        }
+
+        if (SSL_CONF_cmd(confctx, "Protocol", "ALL") != 2
+                || !SSL_CONF_CTX_finish(confctx)) {
+            BIO_printf(err, "SSLv2 CONF Test: SSL_CONF command FAIL\n");
+            goto end;
+        }
+
+        /* Should still be off even after ALL Protocols on */
+        if (!checksslv2(currtest, SSLV2OFF)) {
+            BIO_printf(err, "SSLv2 CONF Test: Off after config #1 FAIL\n");
+            goto end;
+        }
+
+        if (SSL_CONF_cmd(confctx, "Protocol", "SSLv2") != 2
+                || !SSL_CONF_CTX_finish(confctx)) {
+            BIO_printf(err, "SSLv2 CONF Test: SSL_CONF command FAIL\n");
+            goto end;
+        }
+
+        /* Should still be off even if explicitly asked for */
+        if (!checksslv2(currtest, SSLV2OFF)) {
+            BIO_printf(err, "SSLv2 CONF Test: Off after config #2 FAIL\n");
+            goto end;
+        }
+
+        if (SSL_CONF_cmd(confctx, "Protocol", "-SSLv2") != 2
+                || !SSL_CONF_CTX_finish(confctx)) {
+            BIO_printf(err, "SSLv2 CONF Test: SSL_CONF command FAIL\n");;
+            goto end;
+        }
+
+        if (!checksslv2(currtest, SSLV2OFF)) {
+            BIO_printf(err, "SSLv2 CONF Test: Off after config #3 FAIL\n");
+            goto end;
+        }
+
+        if (currtest == TEST_SSL_CTX)
+            SSL_CTX_clear_options(ctx, SSL_OP_NO_SSLv2);
+        else
+            SSL_clear_options(ssl, SSL_OP_NO_SSLv2);
+
+        if (!checksslv2(currtest, SSLV2ON)) {
+            BIO_printf(err, "SSLv2 CONF Test: On after clear FAIL\n");
+            goto end;
+        }
+
+        if (SSL_CONF_cmd(confctx, "Protocol", "ALL") != 2
+                || !SSL_CONF_CTX_finish(confctx)) {
+            BIO_printf(err, "SSLv2 CONF Test: SSL_CONF command FAIL\n");
+            goto end;
+        }
+
+        /* Option has been cleared and config says have SSLv2 so should be on */
+        if (!checksslv2(currtest, SSLV2ON)) {
+            BIO_printf(err, "SSLv2 CONF Test: On after config #1 FAIL\n");
+            goto end;
+        }
+
+        if (SSL_CONF_cmd(confctx, "Protocol", "SSLv2") != 2
+                || !SSL_CONF_CTX_finish(confctx)) {
+            BIO_printf(err, "SSLv2 CONF Test: SSL_CONF command FAIL\n");
+            goto end;
+        }
+
+        /* Option has been cleared and config says have SSLv2 so should be on */
+        if (!checksslv2(currtest, SSLV2ON)) {
+            BIO_printf(err, "SSLv2 CONF Test: On after config #2 FAIL\n");
+            goto end;
+        }
+
+        if (SSL_CONF_cmd(confctx, "Protocol", "-SSLv2") != 2
+                || !SSL_CONF_CTX_finish(confctx)) {
+            BIO_printf(err, "SSLv2 CONF Test: SSL_CONF command FAIL\n");
+            goto end;
+        }
+
+        /* Option has been cleared but config says no SSLv2 so should be off */
+        if (!checksslv2(currtest, SSLV2OFF)) {
+            BIO_printf(err, "SSLv2 CONF Test: Off after config #4 FAIL\n");
+            goto end;
+        }
+
+    }
+
+    testresult = 1;
+
+ end:
+    SSL_free(ssl);
+    SSL_CTX_free(ctx);
+    SSL_CONF_CTX_free(confctx);
+
+    if (!testresult) {
+        printf("SSLv2 CONF test: FAILED (Test %d)\n", currtest);
+        ERR_print_errors(err);
+    } else {
+        printf("SSLv2 CONF test: PASSED\n");
+    }
+
+    ERR_free_strings();
+    ERR_remove_thread_state(NULL);
+    EVP_cleanup();
+    CRYPTO_cleanup_all_ex_data();
+    CRYPTO_mem_leaks(err);
+    BIO_free(err);
+
+    return testresult ? EXIT_SUCCESS : EXIT_FAILURE;
+}
diff --git a/test/Makefile b/test/Makefile
index b180971..e566bab 100644
--- a/test/Makefile
+++ b/test/Makefile
@@ -70,6 +70,7 @@ HEARTBEATTEST=  heartbeat_test
 CONSTTIMETEST=  constant_time_test
 VERIFYEXTRATEST=	verify_extra_test
 CLIENTHELLOTEST=	clienthellotest
+SSLV2CONFTEST = 	sslv2conftest
 
 TESTS=		alltests
 
@@ -83,7 +84,7 @@ EXE=	$(BNTEST)$(EXE_EXT) $(ECTEST)$(EXE_EXT)  $(ECDSATEST)$(EXE_EXT) $(ECDHTEST)
 	$(EVPTEST)$(EXE_EXT) $(EVPEXTRATEST)$(EXE_EXT) $(IGETEST)$(EXE_EXT) $(JPAKETEST)$(EXE_EXT) $(SRPTEST)$(EXE_EXT) \
 	$(ASN1TEST)$(EXE_EXT) $(V3NAMETEST)$(EXE_EXT) $(HEARTBEATTEST)$(EXE_EXT) \
 	$(CONSTTIMETEST)$(EXE_EXT) $(VERIFYEXTRATEST)$(EXE_EXT) \
-	$(CLIENTHELLOTEST)$(EXE_EXT)
+	$(CLIENTHELLOTEST)$(EXE_EXT) $(SSLV2CONFTEST)$(EXE_EXT)
 
 # $(METHTEST)$(EXE_EXT)
 
@@ -97,7 +98,7 @@ OBJ=	$(BNTEST).o $(ECTEST).o  $(ECDSATEST).o $(ECDHTEST).o $(IDEATEST).o \
 	$(BFTEST).o  $(SSLTEST).o  $(DSATEST).o  $(EXPTEST).o $(RSATEST).o \
 	$(EVPTEST).o $(EVPEXTRATEST).o $(IGETEST).o $(JPAKETEST).o $(ASN1TEST).o $(V3NAMETEST).o \
 	$(HEARTBEATTEST).o $(CONSTTIMETEST).o $(VERIFYEXTRATEST).o \
-	$(CLIENTHELLOTEST).o
+	$(CLIENTHELLOTEST).o  $(SSLV2CONFTEST).o
 
 SRC=	$(BNTEST).c $(ECTEST).c  $(ECDSATEST).c $(ECDHTEST).c $(IDEATEST).c \
 	$(MD2TEST).c  $(MD4TEST).c $(MD5TEST).c \
@@ -108,7 +109,7 @@ SRC=	$(BNTEST).c $(ECTEST).c  $(ECDSATEST).c $(ECDHTEST).c $(IDEATEST).c \
 	$(BFTEST).c  $(SSLTEST).c $(DSATEST).c   $(EXPTEST).c $(RSATEST).c \
 	$(EVPTEST).c $(EVPEXTRATEST).c $(IGETEST).c $(JPAKETEST).c $(SRPTEST).c $(ASN1TEST).c \
 	$(V3NAMETEST).c $(HEARTBEATTEST).c $(CONSTTIMETEST).c $(VERIFYEXTRATEST).c \
-	$(CLIENTHELLOTEST).c
+	$(CLIENTHELLOTEST).c  $(SSLV2CONFTEST).c
 
 EXHEADER= 
 HEADER=	testutil.h $(EXHEADER)
@@ -152,7 +153,7 @@ alltests: \
 	test_gen test_req test_pkcs7 test_verify test_dh test_dsa \
 	test_ss test_ca test_engine test_evp test_evp_extra test_ssl test_tsa test_ige \
 	test_jpake test_srp test_cms test_ocsp test_v3name test_heartbeat \
-	test_constant_time test_verify_extra test_clienthello
+	test_constant_time test_verify_extra test_clienthello test_sslv2conftest
 
 test_evp: $(EVPTEST)$(EXE_EXT) evptests.txt
 	../util/shlib_wrap.sh ./$(EVPTEST) evptests.txt
@@ -361,6 +362,10 @@ test_clienthello: $(CLIENTHELLOTEST)$(EXE_EXT)
 	@echo $(START) $@
 	../util/shlib_wrap.sh ./$(CLIENTHELLOTEST)
 
+test_sslv2conftest: $(SSLV2CONFTEST)$(EXE_EXT)
+	@echo $(START) $@
+	../util/shlib_wrap.sh ./$(SSLV2CONFTEST)
+
 lint:
 	lint -DLINT $(INCLUDES) $(SRC)>fluff
 
@@ -538,6 +543,9 @@ $(VERIFYEXTRATEST)$(EXE_EXT): $(VERIFYEXTRATEST).o
 $(CLIENTHELLOTEST)$(EXE_EXT): $(CLIENTHELLOTEST).o
 	@target=$(CLIENTHELLOTEST) $(BUILD_CMD)
 
+$(SSLV2CONFTEST)$(EXE_EXT): $(SSLV2CONFTEST).o
+	@target=$(SSLV2CONFTEST) $(BUILD_CMD)
+
 #$(AESTEST).o: $(AESTEST).c
 #	$(CC) -c $(CFLAGS) -DINTERMEDIATE_VALUE_KAT -DTRACE_KAT_MCT $(AESTEST).c
 
@@ -848,6 +856,25 @@ ssltest.o: ../include/openssl/ssl3.h ../include/openssl/stack.h
 ssltest.o: ../include/openssl/symhacks.h ../include/openssl/tls1.h
 ssltest.o: ../include/openssl/x509.h ../include/openssl/x509_vfy.h
 ssltest.o: ../include/openssl/x509v3.h ssltest.c
+sslv2conftest.o: ../include/openssl/asn1.h ../include/openssl/bio.h
+sslv2conftest.o: ../include/openssl/buffer.h ../include/openssl/comp.h
+sslv2conftest.o: ../include/openssl/crypto.h ../include/openssl/dtls1.h
+sslv2conftest.o: ../include/openssl/e_os2.h ../include/openssl/ec.h
+sslv2conftest.o: ../include/openssl/ecdh.h ../include/openssl/ecdsa.h
+sslv2conftest.o: ../include/openssl/err.h ../include/openssl/evp.h
+sslv2conftest.o: ../include/openssl/hmac.h ../include/openssl/kssl.h
+sslv2conftest.o: ../include/openssl/lhash.h ../include/openssl/obj_mac.h
+sslv2conftest.o: ../include/openssl/objects.h ../include/openssl/opensslconf.h
+sslv2conftest.o: ../include/openssl/opensslv.h ../include/openssl/ossl_typ.h
+sslv2conftest.o: ../include/openssl/pem.h ../include/openssl/pem2.h
+sslv2conftest.o: ../include/openssl/pkcs7.h ../include/openssl/pqueue.h
+sslv2conftest.o: ../include/openssl/safestack.h ../include/openssl/sha.h
+sslv2conftest.o: ../include/openssl/srtp.h ../include/openssl/ssl.h
+sslv2conftest.o: ../include/openssl/ssl2.h ../include/openssl/ssl23.h
+sslv2conftest.o: ../include/openssl/ssl3.h ../include/openssl/stack.h
+sslv2conftest.o: ../include/openssl/symhacks.h ../include/openssl/tls1.h
+sslv2conftest.o: ../include/openssl/x509.h ../include/openssl/x509_vfy.h
+sslv2conftest.o: sslv2conftest.c
 v3nametest.o: ../e_os.h ../include/openssl/asn1.h ../include/openssl/bio.h
 v3nametest.o: ../include/openssl/buffer.h ../include/openssl/conf.h
 v3nametest.o: ../include/openssl/crypto.h ../include/openssl/e_os2.h
diff --git a/util/mk1mf.pl b/util/mk1mf.pl
index 973ad75..2629a1c 100755
--- a/util/mk1mf.pl
+++ b/util/mk1mf.pl
@@ -290,6 +290,7 @@ $cflags.=" -DOPENSSL_NO_HW"   if $no_hw;
 $cflags.=" -DOPENSSL_FIPS"    if $fips;
 $cflags.=" -DOPENSSL_NO_JPAKE"    if $no_jpake;
 $cflags.=" -DOPENSSL_NO_EC2M"    if $no_ec2m;
+$cflags.=" -DOPENSSL_NO_WEAK_SSL_CIPHERS"   if $no_weak_ssl;
 $cflags.= " -DZLIB" if $zlib_opt;
 $cflags.= " -DZLIB_SHARED" if $zlib_opt == 2;
 
@@ -1205,6 +1206,7 @@ sub read_options
 		"no-jpake" => \$no_jpake,
 		"no-ec2m" => \$no_ec2m,
 		"no-ec_nistp_64_gcc_128" => 0,
+		"no-weak-ssl-ciphers" => \$no_weak_ssl,
 		"no-err" => \$no_err,
 		"no-sock" => \$no_sock,
 		"no-krb5" => \$no_krb5,