[openssl-dev] [openssl-team] Discussion: design issue: async and -lpthread

Paul Dale paul.dale at oracle.com
Tue Nov 24 00:49:26 UTC 2015


On Mon, 23 Nov 2015 11:11:37 PM Alessandro Ghedini wrote:
> Is this TLS connections?

Yes, this is just measuring the TLS handshake.  Renegotiations predominately.
We deliberately didn't test the bulk symmetric crypto phase of the connection.


> I'd like to know more...

The data are a bit rough and ready but I've included what I can.  I wasn't directly involved in taking these measurements, so Chinese whispers are entirely possible.  I've been tasked with trying to find some performance enhancements.

    The TLS stack results are:

    stack         CPU %  connections/s
    OpenSSL         85      11,935
    atomic patch    22      16,465    proof of concept only, the stack is broken elsewhere
    NSS             47      46,507    !!!!!




    ----------------------------------------------------------------------
    The top ten bottlenecks identified by Sun Studio profiler.
    There are before my atomic change.

    Lock % 16   CPU % 41
     +-   EVP_MD_CTX_cleanup
       +-    EVP_PKEY_CTX_free
        +-    EVP_PKEY_free
         +-  CRYPTO_add_lock
          +-   locking_function
           +-  pthread_mutex_lock


    Lock % 21   CPU % 50
    +-   EVP_MD_CTX_copy_ex
      +-    EVP_PKEY_CTX_dup
       +-    CRYPTO_add_lock
        +-  locking_function
         +-     pthread_mutex_lock


    Lock % 5   CPU % -
    +- EVP_PKEY_CTX_dup
      +-   CRYPTO_add_lock
       +-     locking_function
        +-     pthread_mutex_lock


    Lock % 3   CPU % 3
      +-     EVP_PKEY_CTX_new
         +-    CRYPTO_add_lock
          +-   locking_function
           +-    pthread_mutex_lock


    Lock % -  CPU % 3
    EVP_PKEY_free


    Lock % 2   CPU % 2
    pkey_hmac_copy


    Lock % -   CPU % 1
     +-   Connection::destroy()
        +-  Connection::close()
          +-   NZIO_Close
             +-   nzos_Destroy_Ctx
                +-  SSL_free
                  +-   ssl_cert_free
                    +-   ssl_cert_clear_certs


    Lock % -   CPU % 2
    +-     ERR_clear_error
      +-  ERR_get_state
        +-    int_thread_get_item
          +-    lh_retrieve


    Locl % -   CPU % 1
    X509_check_private_key


    Lock % -   CPU % 1
    sha1_block_data_order_ssse3

    ----------------------------------------------------------------------

    After making all CRYPTO_add calls atomic, which breaks things elsewhere in
    the TLS stack, we obtained the above performance improvement and these new
    bottlenecks -- take these with more of a grain of salt:

    Lock % 6   CPU % 13
    SSL_new	
     +-  46.000   (7%)    nzos_Create_Ctx
      +-  38.530   (6%)    SSL_new
      +-  35.370   (6%)    CRYPTO_new_ex_data
     +-  35.310   (6%)    int_new_ex_data
      +-  34.360   (5%)    def_get_class
     +-  34.100   (5%)    CRYPTO_lock
    +-  34.020   (5%)    locking_function
    +-  32.620   (5%)    pthread_mutex_lock


    Lock % 6   CPU % 12
    SSL_SESSION_new
     +-  40.090   (6%)    d2i_SSL_SESSION
     +-  34.420   (6%)    SSL_SESSION_new
     +-  33.930   (5%)    CRYPTO_new_ex_data
     +-  33.880   (5%)    int_new_ex_data
     +-  32.630   (5%)    def_get_class
     +-  32.390   (5%)    CRYPTO_lock
     +-  32.370   (5%)    locking_function
     +-  30.860   (5%)    pthread_mutex_lock


    Lock % 9   CPU %22
    BIO_new	
    +  BIO_new                  1%
    +  BIO_set                  4%
    +  CRYPTO_new_ex_data       5%
    +  int_new_ex_data 
    +  def_get_class 
    +  CRYPTO_lock 
    +  locking_function 
    +  pthread_mutex_lock


    Lock % 11   CPU % 26
    BIO_free	
    +-  BIO_free
    +-  CRYPTO_free_ex_data
    +-  int_free_ex_data
    +-  def_get_class
    +-  CRYPTO_lock
    +-  locking_function
    +-  pthread_mutex_lock


    Lock % 17
    tls1_PRF	 
    +- tls1_PRF
    +-  tls1_P_hash

    called from

    +-  62.400   (10%)    ssl3_get_finished
      +-  62.120   (10%)    ssl3_get_message
     +-  44.940   (7%)    ssl3_read_bytes
    +-  28.530   (5%)    ssl3_do_change_cipher_spec
    +-  23.410   (4%)    tls1_final_finish_mac

    or

      +-  16.300   (3%)    ssl3_take_mac
       +-  16.260   (3%)    tls1_final_finish_mac

    or

     +-  17.430   (3%)    ssl3_send_finished
     +-  15.240   (2%)    tls1_final_finish_mac

    or

    +-  65.480   (10%)    tls1_setup_key_block
      +-  62.890   (10%)    tls1_generate_key_block


    Lock % 6
    ERR_clear_error
    +-   ERR_clear_error
    +-   ERR_get_state
    +-  int_thread_get_item
    +-    lh_retrieve
     +-  getrn


    CPU % 18.93
    SSL_free


    CPU % 5.5
    SSL_SESSION_free



> As Matt mentioned, I'm curently working exactly on this. Would it be possible
> for you to share your testing method and code? I'd like to verify that my
> patches actually go in the right direction before having them merged (or maybe
> you could do your tests on top of my patches, they should mostly work fine even
> if the work is not complete yet, I think).

I'm not sure I can share code and associated infrastructure at this point, we'd like to but we need approvals through the various internal channels.
It might be possible for us to run these tests against your patches and mail you the results (less than ideal but probably workable), I'd have to ask the engineer who did this to see if they can justify the time involved.


The bottom line is OpenSSL wants for finer grain locking, which the atomic operations provided.  Having locks in the various contexts would achieve the same result and be less platform/compiler specific.


For reference, my proof of concept atomic patch was:
diff --git a/include/openssl/crypto.h b/include/openssl/crypto.h
index 56afc51..803d7b7 100644
--- a/include/openssl/crypto.h
+++ b/include/openssl/crypto.h
@@ -220,8 +220,13 @@ extern "C" {
         CRYPTO_lock(CRYPTO_LOCK|CRYPTO_READ,type,__FILE__,__LINE__)
 #   define CRYPTO_r_unlock(type)   \
         CRYPTO_lock(CRYPTO_UNLOCK|CRYPTO_READ,type,__FILE__,__LINE__)
-#   define CRYPTO_add(addr,amount,type)    \
-        CRYPTO_add_lock(addr,amount,type,__FILE__,__LINE__)
+#   if defined(__GNUC__) || defined(__INTEL_COMPILER)
+#    define CRYPTO_add(addr,amount,type)    \
+         __sync_add_and_fetch(addr, amount)
+#   else
+#    define CRYPTO_add(addr,amount,type)    \
+         CRYPTO_add_lock(addr,amount,type,__FILE__,__LINE__)
+#   endif
 #  endif
 # else
 #  define CRYPTO_w_lock(a)
This should never be applied, it breaks things and is quick and ugly.


Regards,

Pauli

-- 
Oracle
Dr Paul Dale | Cryptographer | Network Security & Encryption 
Phone +61 7 3031 7217
Oracle Australia

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mta.openssl.org/pipermail/openssl-dev/attachments/20151124/130f6126/attachment-0001.html>


More information about the openssl-dev mailing list