[openssl-dev] [openssl-team] Discussion: design issue: async and -lpthread
Paul Dale
paul.dale at oracle.com
Tue Nov 24 00:49:26 UTC 2015
On Mon, 23 Nov 2015 11:11:37 PM Alessandro Ghedini wrote:
> Is this TLS connections?
Yes, this is just measuring the TLS handshake. Renegotiations predominately.
We deliberately didn't test the bulk symmetric crypto phase of the connection.
> I'd like to know more...
The data are a bit rough and ready but I've included what I can. I wasn't directly involved in taking these measurements, so Chinese whispers are entirely possible. I've been tasked with trying to find some performance enhancements.
The TLS stack results are:
stack CPU % connections/s
OpenSSL 85 11,935
atomic patch 22 16,465 proof of concept only, the stack is broken elsewhere
NSS 47 46,507 !!!!!
----------------------------------------------------------------------
The top ten bottlenecks identified by Sun Studio profiler.
There are before my atomic change.
Lock % 16 CPU % 41
+- EVP_MD_CTX_cleanup
+- EVP_PKEY_CTX_free
+- EVP_PKEY_free
+- CRYPTO_add_lock
+- locking_function
+- pthread_mutex_lock
Lock % 21 CPU % 50
+- EVP_MD_CTX_copy_ex
+- EVP_PKEY_CTX_dup
+- CRYPTO_add_lock
+- locking_function
+- pthread_mutex_lock
Lock % 5 CPU % -
+- EVP_PKEY_CTX_dup
+- CRYPTO_add_lock
+- locking_function
+- pthread_mutex_lock
Lock % 3 CPU % 3
+- EVP_PKEY_CTX_new
+- CRYPTO_add_lock
+- locking_function
+- pthread_mutex_lock
Lock % - CPU % 3
EVP_PKEY_free
Lock % 2 CPU % 2
pkey_hmac_copy
Lock % - CPU % 1
+- Connection::destroy()
+- Connection::close()
+- NZIO_Close
+- nzos_Destroy_Ctx
+- SSL_free
+- ssl_cert_free
+- ssl_cert_clear_certs
Lock % - CPU % 2
+- ERR_clear_error
+- ERR_get_state
+- int_thread_get_item
+- lh_retrieve
Locl % - CPU % 1
X509_check_private_key
Lock % - CPU % 1
sha1_block_data_order_ssse3
----------------------------------------------------------------------
After making all CRYPTO_add calls atomic, which breaks things elsewhere in
the TLS stack, we obtained the above performance improvement and these new
bottlenecks -- take these with more of a grain of salt:
Lock % 6 CPU % 13
SSL_new
+- 46.000 (7%) nzos_Create_Ctx
+- 38.530 (6%) SSL_new
+- 35.370 (6%) CRYPTO_new_ex_data
+- 35.310 (6%) int_new_ex_data
+- 34.360 (5%) def_get_class
+- 34.100 (5%) CRYPTO_lock
+- 34.020 (5%) locking_function
+- 32.620 (5%) pthread_mutex_lock
Lock % 6 CPU % 12
SSL_SESSION_new
+- 40.090 (6%) d2i_SSL_SESSION
+- 34.420 (6%) SSL_SESSION_new
+- 33.930 (5%) CRYPTO_new_ex_data
+- 33.880 (5%) int_new_ex_data
+- 32.630 (5%) def_get_class
+- 32.390 (5%) CRYPTO_lock
+- 32.370 (5%) locking_function
+- 30.860 (5%) pthread_mutex_lock
Lock % 9 CPU %22
BIO_new
+ BIO_new 1%
+ BIO_set 4%
+ CRYPTO_new_ex_data 5%
+ int_new_ex_data
+ def_get_class
+ CRYPTO_lock
+ locking_function
+ pthread_mutex_lock
Lock % 11 CPU % 26
BIO_free
+- BIO_free
+- CRYPTO_free_ex_data
+- int_free_ex_data
+- def_get_class
+- CRYPTO_lock
+- locking_function
+- pthread_mutex_lock
Lock % 17
tls1_PRF
+- tls1_PRF
+- tls1_P_hash
called from
+- 62.400 (10%) ssl3_get_finished
+- 62.120 (10%) ssl3_get_message
+- 44.940 (7%) ssl3_read_bytes
+- 28.530 (5%) ssl3_do_change_cipher_spec
+- 23.410 (4%) tls1_final_finish_mac
or
+- 16.300 (3%) ssl3_take_mac
+- 16.260 (3%) tls1_final_finish_mac
or
+- 17.430 (3%) ssl3_send_finished
+- 15.240 (2%) tls1_final_finish_mac
or
+- 65.480 (10%) tls1_setup_key_block
+- 62.890 (10%) tls1_generate_key_block
Lock % 6
ERR_clear_error
+- ERR_clear_error
+- ERR_get_state
+- int_thread_get_item
+- lh_retrieve
+- getrn
CPU % 18.93
SSL_free
CPU % 5.5
SSL_SESSION_free
> As Matt mentioned, I'm curently working exactly on this. Would it be possible
> for you to share your testing method and code? I'd like to verify that my
> patches actually go in the right direction before having them merged (or maybe
> you could do your tests on top of my patches, they should mostly work fine even
> if the work is not complete yet, I think).
I'm not sure I can share code and associated infrastructure at this point, we'd like to but we need approvals through the various internal channels.
It might be possible for us to run these tests against your patches and mail you the results (less than ideal but probably workable), I'd have to ask the engineer who did this to see if they can justify the time involved.
The bottom line is OpenSSL wants for finer grain locking, which the atomic operations provided. Having locks in the various contexts would achieve the same result and be less platform/compiler specific.
For reference, my proof of concept atomic patch was:
diff --git a/include/openssl/crypto.h b/include/openssl/crypto.h
index 56afc51..803d7b7 100644
--- a/include/openssl/crypto.h
+++ b/include/openssl/crypto.h
@@ -220,8 +220,13 @@ extern "C" {
CRYPTO_lock(CRYPTO_LOCK|CRYPTO_READ,type,__FILE__,__LINE__)
# define CRYPTO_r_unlock(type) \
CRYPTO_lock(CRYPTO_UNLOCK|CRYPTO_READ,type,__FILE__,__LINE__)
-# define CRYPTO_add(addr,amount,type) \
- CRYPTO_add_lock(addr,amount,type,__FILE__,__LINE__)
+# if defined(__GNUC__) || defined(__INTEL_COMPILER)
+# define CRYPTO_add(addr,amount,type) \
+ __sync_add_and_fetch(addr, amount)
+# else
+# define CRYPTO_add(addr,amount,type) \
+ CRYPTO_add_lock(addr,amount,type,__FILE__,__LINE__)
+# endif
# endif
# else
# define CRYPTO_w_lock(a)
This should never be applied, it breaks things and is quick and ugly.
Regards,
Pauli
--
Oracle
Dr Paul Dale | Cryptographer | Network Security & Encryption
Phone +61 7 3031 7217
Oracle Australia
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mta.openssl.org/pipermail/openssl-dev/attachments/20151124/130f6126/attachment-0001.html>
More information about the openssl-dev
mailing list