Saturday, February 25, 2017

MySQL Support Engineer's Chronicles, Issue #5

A lot of time passed since my previous post in this series. I was busy with work, participating in FOSDEM, blogging about profilers and sharing various lists of MySQL bugs. But I do not plan to stop writing about my usual weeks of doing support engineer's job. So, time for the next post in this series, based on my random notes taken during the week here and there.

This week started for me with checking recent MySQL bug reports (actually I do it every day). I noted recent report by Roel, Bug #85065. Unfortunately it was quickly marked as "Won't fix", and I tend to agree with Laurynas Biveinis that this was probably a wrong decision. See a discussion here. Out of memory conditions happen in production and it would be great for MySQL to process them properly, not just crash randomly. Roel does a favor to all of us by checking debug builds with additional unusual failure conditions introduced, and this work should not be ignored based on some formal approach.

It's a common knowledge already that I try to avoid not only all kind of clusters, but all kinds of connection pools as well, by all means. Sometimes I still fail, and when I find myself in unlucky situation of dealing with connection pool in Apache Tomcat, I consider this reference useful.

This week I had a useful discussion on the need for xtrabackup (version 2.4.6 was released this week) on Windows (customers ask about it once in a while and some even try to use older binaries of version 1.6.x or so from Percona) and any alternatives. I was pointed out to this blog post by Anders Karlsson. I remember reading something about using Volume Snapshot Service on Windows to backup MySQL back in Oracle in 2012, and really have to just try how it works based on the blog above and this reference. But I still think that, at least without a command line wrapper like mylvmbackup, this approach is hardly easy to follow for average Windows user (like me) and is not on pair with xtrabackup for ease of use etc.

I spent some time building new releases of Percona Server, MariaDB and MySQL 5.6 from Facebook on my Fedora 25 box (this is also one of the first things I do every morning, for software that got updates on GitHub), so just to remind myself, here is my usual cmake command line for 5.7-based builds:

cmake . -DCMAKE_INSTALL_PREFIX=/home/openxs/dbs/5.7 -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_CONFIG=mysql_release -DFEATURE_SET=community -DWITH_EMBEDDED_SERVER=OFF -DDOWNLOAD_BOOST=1 -DWITH_BOOST=/home/openxs/boost
I do not have it automated, so sometimes the exact command line gets lost in the history of bash... Never happens with MySQL from Facebook or MariaDB (as we see new commits almost every day), but easily happens with MySQL 5.7 (because they push commit only after the official releases) or Percona Server (as I do not care that much about it for a year already...). This time I noted that one actually needs numactl-devel package to build Percona Server 5.7. Was not the case last time I tried.

I also paid attention to blog posts on two hot topics this week. The first one is comparing ProxySQL (here and there) to recently released MaxScale 2.1. I probably have to build and test both myself, but ProxySQL seems to rock, still, and I have no reasons NOT to trust my former colleague René Cannaò, who had provided a lot of details about his tests.

Another topic that is hot in February, 2017 is group replication vs Galera. Przemek's recent post on this topic caused quite a hot discussion here. I still consider the post as a good, useful and sincere attempt to compare these technologies and highlight problems typical experienced Galera user may expect, suspect or note in the group replication in MySQL 5.7.17. Make sure you read comments to the blog post, as they help to clarify the implementation details, differences and strong features of group replication. IMHO, this post, along with great series by LeFred and comments from community, really helps to set proper expectations and setup proper understanding of both technologies.

The last but not the least, this week Mark Callaghan had confirmed that InnoDB is still faster than MyRocks on all kinds of read-only all-in-memory workloads (he used sysbench 1.0 tests) on "small server"/older hardware. I've noted this on a specific use case of Bug #68079 almost a month ago...

This week I worked on several interesting customer issues (involving Galera, CONNECT engine, security concerns, MySQL 5.1 and the limit of 64 secondary indexes per InnoDB table etc) that are not yet completely resolved, so I expect a lot of fun and useful links noted next week. Stay tuned!

Thursday, February 16, 2017

Fun with Bugs #48 - Group Replication Bugs and Missing Features as of MySQL 5.7.17

It seems recent post on Group Replication by Vadim caused an interesting discussion on Facebook. I am NOT going to continue it here, but decided to present some facts, specifically, list of public bug reports and feature requests for Group Replication (mostly "Verified", that is, accepted by Oracle engineers as valid) as of MySQL 5.7.17 (where the feature is promoted as GA), with just few comments to some of the bugs.

The goal is to double check this post when next Oracle MySQL 5.7.x release appears, to find out how much Oracle carews to fix the problems already identified by MySQL Community.

So, here are the bugs and feature requests in MySQL Server: Group Replication category (I've included only those reported for 5.7.17 into this post), as of today:
  • Bug #84315 - "add mysql group replication bootstrap option to init script". Fair enough, we have a way to boottrap a cluster for a node in all implementations of Galera cluster. This feature request is still "Open".
  • Bug #84329 - "Some Class B private address is not permitted automatically". Manual seems to say it should be.
  • Bug #84337 - "Configure mgr with out root user encounter an error". Seems to be a duplicate of Bug #82687, that is, known for quite a some time before GA.
  • Bug #84367 - "START GROUP_REPLICATION should be possible without error". You get error if it was already started. Not a big deal, but some optional clause like IF NOT STARTED may help (in a similar way to DROP TABLE ... IF EXISTS) to code scripts. Still "Open".
  • Bug #84710 - "group replication does not respect localhost IP". may be a problem for some MySQL Sandbox setups in a hope to test group replication.
  • Bug #84727 - "GR: Partitioned Node Should Get Updated Status and not accept writes". Writes on partitioned node are accepted and hang (forever?). As a side note, I think Kenny Gryp deserves a special recognition as an early adopter of Group Replication feature who cared to report many problems noted in the process.
  • Bug #84728 - "Not Possible To Avoid MySQL From Starting When GR Cannot Start". We need this fixed to avoid split brain situations by default.
  • Bug #84729 - "Group Replication: Block Reads On Partitioned Nodes". In Galera reads are blocked by default when node is not considered a member of cluster.
  • Bug #84730 - "Feature: Troubleshoot Group Replication Transaction Rollbacks". You can get a lot of information about conflicts in case of Galera. Sorry that I have to compare, but when discussing the ways of dealing with common problems in cluster environments related to MySQL one can not ignore existing solutions (NDB clusters and Galera clusters), so I just picked up a (somewhat similar) technology that I know and used (a bit). I think readers will do the same, try to base their conclusions on known examples.
  • Bug #84731 - "Group Replication: mysql client connections hang during group replication start". There is no reason to hang more than needed. We should just make sure reads and writes are NOT accepted until the node is a member of cluster and in sync.
  • Bug #84733 - "Cannot Start GR with super_read_only=1". But this setting may be needed to make sure that there is only one master node in cluster, no matter what happens ot them...
  • Bug #84773 - "Flow control slows down throughput while a node join the cluster". Let me quote: "This is done this way by design, that is, request group to slow down while the member is fetching and applying data.
  • Bug #84774 - "Performance drop every 60 seconds". Now this sounds like a performance problem to work on, maybe by adding some tuning options.
  • Bug #84784 - "Group Replication nodes do not rejoin cluster after network connectivity issues". It would be really nice for nodes to try to re-join the cluster in case of short term connectivity issues. Galera nodes do not give up that fast. The bug is still not verified.
  • Bug #84785  - "Prevent Large Transactions in Group Replication". Galera somehow allows to limit transaction size. Not that there were no related bugs, but still options exist.
  • Bug #84792 - "Idle GR Cluster: Member CPU hog". Not yet verified, but it seems in some cases node can use a notable share of CPU time for no clear/good reason.
  • Bug #84794 - "Cannot kill query inside GR". Weel, you can do STOP GROUP_REPLICATION, but then it can be dangerous, because...
  • Bug #84795 - "STOP GROUP_REPLICATION sets super_read_only=off" - the node with stopped replication may allow to change the data...
  • Bug #84798 - "Group Replication can use some verbosity in the error log". Galera cluster nodes are too verbose, one gets kilometers of log records about everything, anything and nothing. Still, better to skip some usual outputs in the logs than get no evidence at all on what was going on...
  • Bug #84900 - "Getting inconsistent result on different nodes". Now, this is really unfortunate (for a "cluster") and somewhat similar problem was reported by Vadim before, see Bug #82481 and was supposed to be fixed. Anyway, the inconsistency is repatable and looks somewhat scary.
  • Bug #84901 - "Can't execute transaction on second node". Best practice is to write on one and only one node (see Bug #83218). An attempt to write on the other node may fail...
Now, make your own conclusions about the maturity of Group Replication in MySQL 5.7.17. I manage to avoid it so far, as I try to keep myself as far from any kinds of "clusters" as possible... Had not worked well with Galera, unfortunately - I have to read its verbose logs on a daily basis.

Wednesday, February 8, 2017

More on MyRocks Performance for Bug #68079 Case

My previous post on MyRocks was intended to provide some background details for a couple of slides for my FOSDEM talk on profiling MySQL. The results and performance demonstrated by MyRocks vs InnoDB from MySQL 5.7.17 were not really important to show how perf helps to understand where the time was spent while executing of one specific query vs the other (with the same results, but different plan), but they still caused a lot of comments from people who care, so I decided to double check and clarify few more details.

First of all, it was noted that one may get the same execution plan for both queries (even in INNER JOIN case we may end up with filesort and access only via PRIMARY keys), like this:
# bin/mysql test -e 'explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category\G'
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: task
type: index
possible_keys: PRIMARY
key: PRIMARY
key_len: 96
ref: NULL
rows: 8092
Extra: Using index; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: incident
type: eq_ref
possible_keys: PRIMARY,incident_category
key: PRIMARY
key_len: 96
ref: test.task.sys_id
rows: 1
Extra: NULL
Indeed, I've ended up with this plan easily, sometimes, after executing ANALYZE for the tables involved. But this was not always the case (here we see for InnoDB) and next ANALYZE or OPTIMIZE may change the plan:

mysql> analyze table task;
+-----------+---------+----------+----------+
| Table     | Op      | Msg_type | Msg_text |
+-----------+---------+----------+----------+
| test.task | analyze | status   | OK       |
+-----------+---------+----------+----------+
1 row in set (0.06 sec)

mysql> analyze table incident;
+---------------+---------+----------+----------+
| Table         | Op      | Msg_type | Msg_text |
+---------------+---------+----------+----------+
| test.incident | analyze | status   | OK       |
+---------------+---------+----------+----------+
1 row in set (0.15 sec)

mysql> show table status like 'task'\G
*************************** 1. row ***************************
           Name: task
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 8452
 Avg_row_length: 60
    Data_length: 507904
Max_data_length: 0
   Index_length: 0
      Data_free: 0
 Auto_increment: NULL
    Create_time: 2017-01-31 12:00:38
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

mysql> show table status like 'incident'\G
*************************** 1. row ***************************
           Name: incident
         Engine: InnoDB
        Version: 10
     Row_format: Dynamic
           Rows: 7786
 Avg_row_length: 204
    Data_length: 1589248
Max_data_length: 0
   Index_length: 507904
      Data_free: 4194304
 Auto_increment: NULL
    Create_time: 2017-01-31 12:00:35
    Update_time: NULL
     Check_time: NULL
      Collation: utf8_general_ci
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.00 sec)

mysql> explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+------------+--------+---------------------------+-------------------+---------+----------------------+------+----------+-------------+
| id | select_type | table    | partitions | type   | possible_keys             | key               | key_len | ref                  | rows | filtered | Extra       |
+----+-------------+----------+------------+--------+---------------------------+-------------------+---------+----------------------+------+----------+-------------+
|  1 | SIMPLE      | incident | NULL       | index  | PRIMARY,incident_category | incident_category | 123     | NULL                 | 7786 |   100.00 | Using index |
|  1 | SIMPLE      | task     | NULL       | eq_ref | PRIMARY                   | PRIMARY           | 96      | test.incident.sys_id |    1 |   100.00 | Using index |
+----+-------------+----------+------------+--------+---------------------------+-------------------+---------+----------------------+------+----------+-------------+
2 rows in set, 1 warning (0.02 sec)
Surely I can always force the plan needed, this way or that:

mysql> explain select count(*), category from task inner join incident force index(primary) on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
| id | select_type | table    | partitions | type   | possible_keys             | key     | key_len | ref              | rows | filtered | Extra                                        |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
|  1 | SIMPLE      | task     | NULL       | index  | PRIMARY                   | PRIMARY | 96      | NULL             | 8452 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | incident | NULL       | eq_ref | PRIMARY,incident_category | PRIMARY | 96      | test.task.sys_id |    1 |   100.00 | NULL                                         |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
mysql> explain select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
| id | select_type | table    | partitions | type   | possible_keys             | key     | key_len | ref              | rows | filtered | Extra                                        |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
|  1 | SIMPLE      | task     | NULL       | index  | PRIMARY                   | PRIMARY | 96      | NULL             | 8452 |   100.00 | Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | incident | NULL       | eq_ref | PRIMARY,incident_category | PRIMARY | 96      | test.task.sys_id |    1 |   100.00 | NULL                                         |
+----+-------------+----------+------------+--------+---------------------------+---------+---------+------------------+------+----------+----------------------------------------------+
2 rows in set, 1 warning (0.00 sec)
I was interested more in comparing scalability and performance of these different ways to execute the same query, just because usually it's considered good to have covering secondary index used instead of creating temporary table and filesort for this kind of queries... The reality shows this is not always the case.

I've got suggestion to maybe tune something for MyRocks case or use jemalloc library. I am open to any tuning suggestions from real experts based on the fact that these are default settings in my build:

[openxs@fc23 fb56]$ bin/mysql -uroot -e"show variables like 'rocks%'"
+-------------------------------------------------+----------------------+
| Variable_name                                   | Value                |
+-------------------------------------------------+----------------------+
| rocksdb_access_hint_on_compaction_start         | 1                    |
| rocksdb_advise_random_on_open                   | ON                   |
| rocksdb_allow_concurrent_memtable_write         | OFF                  |
| rocksdb_allow_mmap_reads                        | OFF                  |
| rocksdb_allow_mmap_writes                       | OFF                  |
| rocksdb_background_sync                         | OFF                  |
| rocksdb_base_background_compactions             | 1                    |
| rocksdb_block_cache_size                        | 536870912            || rocksdb_block_restart_interval                  | 16                   |
| rocksdb_block_size                              | 4096                 || rocksdb_block_size_deviation                    | 10                   |
| rocksdb_bulk_load                               | OFF                  |
| rocksdb_bulk_load_size                          | 1000                 |
| rocksdb_bytes_per_sync                          | 0                    |
| rocksdb_cache_index_and_filter_blocks           | ON                   |
| rocksdb_checksums_pct                           | 100                  |
| rocksdb_collect_sst_properties                  | ON                   |
| rocksdb_commit_in_the_middle                    | OFF                  |
| rocksdb_compact_cf                              |                      |
| rocksdb_compaction_readahead_size               | 0                    |
| rocksdb_compaction_sequential_deletes           | 0                    |
| rocksdb_compaction_sequential_deletes_count_sd  | OFF                  |
| rocksdb_compaction_sequential_deletes_file_size | 0                    |
| rocksdb_compaction_sequential_deletes_window    | 0                    |
| rocksdb_create_checkpoint                       |                      |
| rocksdb_create_if_missing                       | ON                   |
| rocksdb_create_missing_column_families          | OFF                  |
| rocksdb_datadir                                 | ./.rocksdb           |
| rocksdb_db_write_buffer_size                    | 0                    |
| rocksdb_deadlock_detect                         | OFF                  |
| rocksdb_debug_optimizer_no_zero_cardinality     | ON                   |
| rocksdb_default_cf_options                      |                      |
| rocksdb_delete_obsolete_files_period_micros     | 21600000000          |
| rocksdb_disabledatasync                         | OFF                  |
| rocksdb_enable_2pc                              | ON                   |
| rocksdb_enable_bulk_load_api                    | ON                   |
| rocksdb_enable_thread_tracking                  | OFF                  |
| rocksdb_enable_write_thread_adaptive_yield      | OFF                  |
| rocksdb_error_if_exists                         | OFF                  |
| rocksdb_flush_memtable_on_analyze               | ON                   |
| rocksdb_force_flush_memtable_now                | OFF                  |
| rocksdb_force_index_records_in_range            | 0                    |
| rocksdb_hash_index_allow_collision              | ON                   |
| rocksdb_index_type                              | kBinarySearch        |
| rocksdb_info_log_level                          | error_level          |
| rocksdb_is_fd_close_on_exec                     | ON                   |
| rocksdb_keep_log_file_num                       | 1000                 |
| rocksdb_lock_scanned_rows                       | OFF                  |
| rocksdb_lock_wait_timeout                       | 1                    |
| rocksdb_log_file_time_to_roll                   | 0                    |
| rocksdb_manifest_preallocation_size             | 4194304              |
| rocksdb_max_background_compactions              | 1                    |
| rocksdb_max_background_flushes                  | 1                    |
| rocksdb_max_log_file_size                       | 0                    |
| rocksdb_max_manifest_file_size                  | 18446744073709551615 |
| rocksdb_max_open_files                          | -1                   |
| rocksdb_max_row_locks                           | 1073741824           |
| rocksdb_max_subcompactions                      | 1                    |
| rocksdb_max_total_wal_size                      | 0                    |
| rocksdb_merge_buf_size                          | 67108864             |
| rocksdb_merge_combine_read_size                 | 1073741824           |
| rocksdb_new_table_reader_for_compaction_inputs  | OFF                  |
| rocksdb_no_block_cache                          | OFF                  |
| rocksdb_override_cf_options                     |                      |
| rocksdb_paranoid_checks                         | ON                   |
| rocksdb_pause_background_work                   | OFF                  |
| rocksdb_perf_context_level                      | 0                    |
| rocksdb_persistent_cache_path                   |                      |
| rocksdb_persistent_cache_size                   | 0                    |
| rocksdb_pin_l0_filter_and_index_blocks_in_cache | ON                   |
| rocksdb_print_snapshot_conflict_queries         | OFF                  |
| rocksdb_rate_limiter_bytes_per_sec              | 0                    |
| rocksdb_read_free_rpl_tables                    |                      |
| rocksdb_records_in_range                        | 0                    |
| rocksdb_seconds_between_stat_computes           | 3600                 |
| rocksdb_signal_drop_index_thread                | OFF                  |
| rocksdb_skip_bloom_filter_on_read               | OFF                  |
| rocksdb_skip_fill_cache                         | OFF                  |
| rocksdb_skip_unique_check_tables                | .*                   |
| rocksdb_stats_dump_period_sec                   | 600                  |
| rocksdb_store_row_debug_checksums               | OFF                  |
| rocksdb_strict_collation_check                  | ON                   |
| rocksdb_strict_collation_exceptions             |                      |
| rocksdb_table_cache_numshardbits                | 6                    |
| rocksdb_table_stats_sampling_pct                | 10                   |
| rocksdb_tmpdir                                  |                      |
| rocksdb_trace_sst_api                           | OFF                  |
| rocksdb_unsafe_for_binlog                       | OFF                  |
| rocksdb_use_adaptive_mutex                      | OFF                  |
| rocksdb_use_direct_reads                        | OFF                  |
| rocksdb_use_direct_writes                       | OFF                  |
| rocksdb_use_fsync                               | OFF                  |
| rocksdb_validate_tables                         | 1                    |
| rocksdb_verify_row_debug_checksums              | OFF                  |
| rocksdb_wal_bytes_per_sync                      | 0                    |
| rocksdb_wal_dir                                 |                      |
| rocksdb_wal_recovery_mode                       | 2                    |
| rocksdb_wal_size_limit_mb                       | 0                    |
| rocksdb_wal_ttl_seconds                         | 0                    |
| rocksdb_whole_key_filtering                     | ON                   |
| rocksdb_write_disable_wal                       | OFF                  |
| rocksdb_write_ignore_missing_column_families    | OFF                  |
| rocksdb_write_sync                              | OFF                  |
+-------------------------------------------------+----------------------+
and hardware details of my QuadCore box with HDD were shared previously. I tend to think no tuning is needed for such a small tables and corner read-only use case (if maybe READ COMMITTED isolation level, that I plan to study separately, as initial attempts had not shown much difference).

Just for completeness, this is the minimal configuration file I've used and the commit I've built from (after creating a new clone yesterday, this time the default branch was fb-mysql-5.6.35):
[openxs@fc23 fb56]$ cat ~/fb56.cnf
[mysqld]
rocksdb
default-storage-engine=rocksdb
skip-innodb
default-tmp-storage-engine=MyISAM

log-bin
binlog-format=ROW

#transaction-isolation=READ-COMMITTED
#rocksdb_max_open_files=-1
#rocksdb_block_cache_size=128M

[openxs@fc23 mysql-5.6]$ git log -1commit 95dd650dcfb07b91971b20cbac82d61f244a05a9
Author: Gunnar Kudrjavets <gunnarku@fb.com>
Date:   Fri Feb 3 10:07:25 2017 -0800
...
You can guess that I tried to add some more options along the lines of the fine manual, but they had not changes things much to better, so I've commented them out.

So, with this fresh build based on MySQL 5.6.35 I've decided to study the influence of jemalloc for both queries with 4, 8 and 16 threads (most interesting settings for the QuadCore). I've used the --malloc-lib option for mysqld_safe to force use of jemalloc, and checked pmap ouput for mysqld process for jemalloc entries to confirm that it was really preloaded:



[openxs@fc23 fb56]$ ps aux | grep mysqld
openxs   20444  0.0  0.0 119164  3264 pts/0    S    11:49   0:00 /bin/sh bin/mysqld_safe --defaults-file=/home/openxs/fb56.cnf --malloc-lib=/usr/lib64/libjemalloc.so
openxs   20630  0.8  0.2 170704 24284 pts/0    Sl   11:49   0:00 /home/openxs/dbs/fb56/bin/mysqld --defaults-file=/home/openxs/fb56.cnf --basedir=/home/openxs/dbs/fb56 --datadir=/home/openxs/dbs/fb56/data --plugin-dir=/home/openxs/dbs/fb56/lib/plugin --log-error=/home/openxs/dbs/fb56/data/fc23.err --pid-file=/home/openxs/dbs/fb56/data/fc23.pid
openxs   20750  0.0  0.0 118520   904 pts/0    S+   11:49   0:00 grep --color=auto mysqld

[openxs@fc23 fb56]$ sudo pmap 20630 | grep jemalloc
00007fb8637db000    268K r-x-- libjemalloc.so.2
00007fb86381e000   2044K ----- libjemalloc.so.2
00007fb863a1d000     12K r---- libjemalloc.so.2
00007fb863a20000      4K rw--- libjemalloc.so.2
So, here are the raw results of mysqlslap runs with jemalloc preloaded after I made sure queries use different plans (STRAIGHT_JOIN uses PRIMARY keys and filesort, while INNER JOIN uses covering secondary index and avoids filesort):

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.695 seconds
        Minimum number of seconds to run all queries: 19.627 seconds
        Maximum number of seconds to run all queries: 19.769 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.332 seconds
        Minimum number of seconds to run all queries: 19.306 seconds
        Maximum number of seconds to run all queries: 19.349 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.276 seconds
        Minimum number of seconds to run all queries: 19.225 seconds
        Maximum number of seconds to run all queries: 19.309 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.553 seconds
        Minimum number of seconds to run all queries: 18.336 seconds
        Maximum number of seconds to run all queries: 18.748 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.227 seconds
        Minimum number of seconds to run all queries: 18.198 seconds
        Maximum number of seconds to run all queries: 18.269 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 17.941 seconds
        Minimum number of seconds to run all queries: 17.886 seconds
        Maximum number of seconds to run all queries: 18.000 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62


Now, the same but with default memory allocator picked up (no jemalloc):

[openxs@fc23 fb56]$ ps aux | grep mysqld
openxs   27696  0.0  0.0 119164  3372 pts/0    S    13:50   0:00 /bin/sh bin/mysqld_safe --defaults-file=/home/openxs/fb56.cnf
openxs   27834  0.3  0.2 607012 23096 pts/0    Sl   13:50   0:00 ./bin/mysqld --defaults-file=/home/openxs/fb56.cnf --basedir=. --datadir=./data --plugin-dir=./lib/plugin --log-error=./data/fc23.err --pid-file=./data/fc23.pid
openxs   28029  0.0  0.0 118520   880 pts/0    S+   13:50   0:00 grep --color=auto mysqld
[openxs@fc23 fb56]$ pmap 27834 | grep jemalloc
[openxs@fc23 fb56]$

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 20.207 seconds
        Minimum number of seconds to run all queries: 19.833 seconds
        Maximum number of seconds to run all queries: 20.543 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.746 seconds
        Minimum number of seconds to run all queries: 19.500 seconds
        Maximum number of seconds to run all queries: 20.402 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 19.150 seconds
        Minimum number of seconds to run all queries: 19.084 seconds
        Maximum number of seconds to run all queries: 19.292 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=4 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.692 seconds
        Minimum number of seconds to run all queries: 18.516 seconds
        Maximum number of seconds to run all queries: 18.889 seconds
        Number of clients running queries: 4
        Average number of queries per client: 250

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=8 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.290 seconds
        Minimum number of seconds to run all queries: 18.250 seconds
        Maximum number of seconds to run all queries: 18.342 seconds
        Number of clients running queries: 8
        Average number of queries per client: 125

[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=16 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task straight_join incident on task.sys_id=incident.sys_id group by incident.category'
Benchmark
        Average number of seconds to run all queries: 18.155 seconds
        Minimum number of seconds to run all queries: 18.066 seconds
        Maximum number of seconds to run all queries: 18.397 seconds
        Number of clients running queries: 16
        Average number of queries per client: 62


Finally, this is how it looks like on a chart:


I can make the following conclusions:
  1. Using jemalloc helps to get a bit better throughput for both queries (I tend to consider the INNER case with 16 threads a fluctuation for now), but not much.
  2. New build based on fb-mysql-5.6.35 branch as of yesterday's morning demonstrates worse performance for STRAIGHT_JOIN case (access via PRIMARY keys + filesort) comparing to the previous build from January 30, 2017 based on webscalesql-5.6.27.75 branch. I am yet to find out why is it so, as INNER JOIN case performs a bit better.
  3. I am open to any MyRocks runing suggestions for my old QuadCore box, as for now MySQL 5.7.17 performs notably better on this same hardware and data in the tables, with --no-defaults. No wonder, a lot of efforts were spent by Oracle on Bug #68079, while all I've got so far for MyRocks is public "Can't repeat" of a kind, plus simple tuning advices with limited impact for my case.
I am sure I am doing something wrong with MyRocks, just tell me what exactly...

Tuesday, January 31, 2017

Profiling MyRocks with perf: Good Old Bug #68079 Use Case

Almost a year ago I've got really interested in MyRocks and built MySQL from Facebook that provides it from source. Since that time I build it from fresh sources few times per week (as I've described in that post) and once in a while try to work with it and study some details or use cases. Today I'd like to discuss one of them that I've recently studied with perf profiler.

This is not only because I am going to talk about applying profilers to all kinds and forks of MySQL at FOSDEM 2017 MySQL & Friends Devroom this week. It seems profilers is the only way to study or troubleshoot MyRocks performance problems (if any) at the moment (when provided status variables and SHOW ENGINE ROCKSDB STATUS etc are not enough), as Facebook's MySQL does NOT have Performance Schema compiled in by default:

mysql> show engines;
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| Engine     | Support | Comment                                                        | Transactions | XA   | Savepoints |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
| ROCKSDB    | DEFAULT | RocksDB storage engine                                         | YES          | YES  | YES        || MRG_MYISAM | YES     | Collection of identical MyISAM tables                          | NO           | NO   | NO         |
| MyISAM     | YES     | MyISAM storage engine                                          | NO           | NO   | NO         |
| BLACKHOLE  | YES     | /dev/null storage engine (anything you write to it disappears) | NO           | NO   | NO         |
| CSV        | YES     | CSV storage engine                                             | NO           | NO   | NO         |
| MEMORY     | YES     | Hash based, stored in memory, useful for temporary tables      | NO           | NO   | NO         |
| ARCHIVE    | YES     | Archive storage engine                                         | NO           | NO   | NO         |
| FEDERATED  | NO      | Federated MySQL storage engine                                 | NULL         | NULL | NULL       |
| InnoDB     | NO      | Supports transactions, row-level locking, and foreign keys     | NULL         | NULL | NULL       |
+------------+---------+----------------------------------------------------------------+--------------+------+------------+
9 rows in set (0.00 sec)

This may change when MyRocks builds from Percona or MariaDB will appear as ready for common use, but for now perf is a way to go, and I like it!

I do not remember how exactly I've come up with the idea to compare MyRocks scalability to InnoDB scalability on my old QuadXeon box (now running Fedora 25):
[openxs@fc23 mysql-server]$ pt-summary
# Percona Toolkit System Summary Report ######################
        Date | 2017-01-30 11:32:53 UTC (local TZ: EET +0200)
    Hostname | fc23
      Uptime |  4:02,  3 users,  load average: 3.47, 2.12, 1.27
    Platform | Linux
     Release | Fedora release 25 (Twenty Five)
      Kernel | 4.9.5-200.fc25.x86_64

Architecture | CPU = 64-bit, OS = 64-bit
   Threading | NPTL 2.24
     SELinux | Enforcing
 Virtualized | No virtualization detected
# Processor ##################################################
  Processors | physical = 1, cores = 4, virtual = 4, hyperthreading = no
      Speeds | 4x2499.000
      Models | 4xIntel(R) Core(TM)2 Quad CPU Q8300 @ 2.50GHz
      Caches | 4x2048 KB

# Memory #####################################################
       Total | 7.8G
        Free | 2.9G
        Used | physical = 1.0G, swap allocated = 0.0, swap used = 0.0, virtual = 1.0G
      Shared | 176.0M
     Buffers | 3.9G
      Caches | 6.2G
       Dirty | 208 kB
     UsedRSS | 2.5G
  Swappiness | 60
 DirtyPolicy | 20, 10
 DirtyStatus | 0, 0
...
for the famous use case presented in Bug #68079. Scalability problems identified there cause a lot of work by Oracle engineers, and a lot of changes in MySQL 5.6.x and 5.7.x (so natable that MariaDB decided to use only those from 5.7 in the upcoming MariaDB 10.2, see MDEV-10476). I've already mentioned this bug probably in every my public talk about MySQL since 2013, so one day I decided to check how the fix helped for recent InnoDB and then compare to MyRocks performance with all default settings on the same hardware for different number of concurrent threads.

So, one day in April 2016 I took a dump of tables involved from the bug, replaced InnoDB with ROCKSDB there, added explicit CHARSET=utf8 COLLATE=utf8_bin clauses, and loaded the dump into Facebook's MySQL 5.6 built from source. I've ended up with the following:
mysql> show create table task\G
*************************** 1. row ***************************
       Table: task
Create Table: CREATE TABLE `task` (
  `sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
  `u_root_cause` char(32) COLLATE utf8_bin DEFAULT NULL,
  `u_business_impact_description` mediumtext COLLATE utf8_bin,
  `u_business_impact_category` mediumtext COLLATE utf8_bin,
  PRIMARY KEY (`sys_id`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

mysql> show create table incident\G
*************************** 1. row ***************************
       Table: incident
Create Table: CREATE TABLE `incident` (
  `sys_id` char(32) COLLATE utf8_bin NOT NULL DEFAULT '',
  `category` varchar(40) COLLATE utf8_bin DEFAULT NULL,
  PRIMARY KEY (`sys_id`),
  KEY `incident_category` (`category`)
) ENGINE=ROCKSDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

mysql> show table status like 'task'\G
*************************** 1. row ***************************
           Name: task
         Engine: ROCKSDB
        Version: 10
     Row_format: Dynamic
           Rows: 8292
 Avg_row_length: 69
    Data_length: 573991
...
1 row in set (0.00 sec)

mysql> show table status like 'incident'\G
*************************** 1. row ***************************
           Name: incident
         Engine: ROCKSDB
        Version: 10
     Row_format: Dynamic
           Rows: 8192
 Avg_row_length: 87
    Data_length: 719091
Max_data_length: 0
   Index_length: 956624
...
1 row in set (0.00 sec)
The tables are simple and small. Then I tried to study how well this query scales when number of threads it is running from is increasing:
mysql> explain select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category;
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
| id | select_type | table    | type   | possible_keys             | key               | key_len | ref                  | rows | Extra       |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
|  1 | SIMPLE      | incident | index  | PRIMARY,incident_category | incident_category | 123     | NULL                 | 8192 | Using index |
|  1 | SIMPLE      | task     | eq_ref | PRIMARY                   | PRIMARY           | 96      | test.incident.sys_id |    1 | Using index |
+----+-------------+----------+--------+---------------------------+-------------------+---------+----------------------+------+-------------+
2 rows in set (0.00 sec)
For this I've used mysqlslap and commands like the following:
[openxs@fc23 fb56]$ bin/mysqlslap -uroot --iterations=10 --concurrency=1 --create-schema=test --no-drop --number-of-queries=1000 --query='select count(*), category from task inner join incident on task.sys_id=incident.sys_id group by incident.category'
with different --concurrency values: 1, 2, 4, 8, 10, 16, 32 (on a box with 4 cores). Before the fix InnoDB had scalability problems even with concurrency N, where N is the number of cores. After the fix it scaled way better in this test.

You can find the summary of results in this Google spreadsheet, but here is a chart from it:


The X axis has log scale and shows number of concurrent threads running. The Y axis is the average number of problematic queries executed per second. "inner" and "straight" refer to INNER JOIN vs STRAIGHT_JOIN queries presented above.

"MyRocks" on this chart means ROCKSDB in MySQL from Facebook built from source corresponding to the following commit:
[openxs@fc23 mysql-5.6]$ git log -1
commit 6b4ba98698182868800f31e0d303f1e16026efdf
...
Date:   Sat Jan 28 13:51:47 2017 -0800
while "InnoDB 5.7" means InnoDB from Oracle MySQL 5.7.17, the following commit specifically:
[openxs@fc23 mysql-server]$ git log -1
commit 23032807537d8dd8ee4ec1c4d40f0633cd4e12f9
...
Date:   Mon Nov 28 16:48:20 2016 +0530
For this post I wonder why the difference in favor of "straight" query is so big starting from 4 concurrent threads in case of MyRocks? I've profiled both queries running via mysqlslap as follows:
[openxs@fc23 ~]$ perf --version
perf version 4.9.6.200.fc25.x86_64.g51a0
[openxs@fc23 ~]$ sudo perf record -a
[sudo] password for openxs:
^C[ perf record: Woken up 335 times to write data ]
[ perf record: Captured and wrote 84.913 MB perf.data (1825548 samples) ]
Profiler was started, then mysqlslap command executed and Ctrl-C was pressed when it produced the results.

The result for STRAIGHT_JOIN queries was the following (up to 1% of "overhead"):
[openxs@fc23 ~]$ sudo perf report --stdio | more...
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 62748127726390
#
# Overhead  Command          Shared Object                     Symbol...
     5.32%  my-oneconnectio  libc-2.24.so                      [.] __memcmp_sse4_1
     4.68%  my-oneconnectio  libc-2.24.so                      [.] __memcpy_ssse3
     4.04%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::Seek
     3.33%  my-oneconnectio  mysqld                            [.] my_strnxfrm_unicode
     3.11%  my-oneconnectio  mysqld                            [.] rocksdb::BlockBasedTable::Get
     2.96%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_pk_comparator::Compare
     2.54%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::BinarySeek
     2.14%  my-oneconnectio  mysqld                            [.] rocksdb::InternalKeyComparator::Compare
     2.00%  my-oneconnectio  mysqld                            [.] rocksdb::StatisticsImpl::recordTick
     1.99%  my-oneconnectio  mysqld                            [.] rocksdb::MergingIterator::Next
     1.95%  my-oneconnectio  mysqld                            [.] rocksdb::HistogramStat::Add
     1.51%  my-oneconnectio  mysqld                            [.] join_read_key
     1.51%  my-oneconnectio  mysqld                            [.] myrocks::rdb_unpack_utf8_str
     1.41%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_key_def::unpack_record
     1.36%  my-oneconnectio  mysqld                            [.] sub_select
     1.32%  my-oneconnectio  mysqld                            [.] my_uni_utf8
     1.29%  my-oneconnectio  mysqld                            [.] rocksdb::Version::Get
     1.22%  my-oneconnectio  libc-2.24.so                      [.] _int_malloc
     1.22%  my-oneconnectio  mysqld                            [.] rocksdb::TableCache::Get
     1.22%  my-oneconnectio  mysqld                            [.] rocksdb::BlockIter::Next
     1.21%  my-oneconnectio  mysqld                            [.] rocksdb::ThreadLocalPtr::Get
     1.18%  my-oneconnectio  mysqld                            [.] rocksdb::DBIter::FindNextUserEntryInternal
     1.18%  my-oneconnectio  mysqld                            [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
     1.12%  my-oneconnectio  libc-2.24.so                      [.] malloc
     1.07%  my-oneconnectio  mysqld                            [.] evaluate_join_record
     1.07%  my-oneconnectio  mysqld                            [.] myrocks::Rdb_key_def::pack_index_tuple
     1.01%  my-oneconnectio  mysqld                            [.] key_restore
The result for INNER JOIN queries was the following (up to 1% of "overhead"):
[openxs@fc23 ~]$ sudo perf report --stdio | more
...
# Total Lost Samples: 0
#
# Samples: 1M of event 'cycles:p'
# Event count (approx.): 162704428520300
#
# Overhead  Command          Shared Object                  Symbol...
#
     5.90%  my-oneconnectio  libc-2.24.so                   [.] __memcpy_ssse3
     4.38%  my-oneconnectio  libpthread-2.24.so             [.] pthread_mutex_lock     3.69%  my-oneconnectio  libc-2.24.so                   [.] __memcmp_sse4_1
     3.58%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::Seek
     2.53%  my-oneconnectio  libpthread-2.24.so             [.] pthread_mutex_unlock     2.47%  my-oneconnectio  mysqld                         [.] rocksdb::StatisticsImpl::recordTick
     2.35%  my-oneconnectio  mysqld                         [.] rocksdb::BlockBasedTable::Get
     2.28%  my-oneconnectio  mysqld                         [.] my_strnxfrm_unicode
     2.08%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::BinarySeek
     1.84%  my-oneconnectio  mysqld                         [.] rocksdb::HistogramStat::Add
     1.84%  my-oneconnectio  mysqld                         [.] rocksdb::Version::Get
     1.71%  my-oneconnectio  mysqld                         [.] myrocks::rdb_unpack_binary_or_utf8_varchar_space_pad     1.69%  my-oneconnectio  mysqld                         [.] rocksdb::LRUCacheShard::Lookup
     1.67%  my-oneconnectio  mysqld                         [.] my_uni_utf8
     1.61%  my-oneconnectio  mysqld                         [.] rocksdb::InternalKeyComparator::Compare
     1.52%  my-oneconnectio  mysqld                         [.] myrocks::Rdb_pk_comparator::Compare
     1.47%  my-oneconnectio  mysqld                         [.] rocksdb::(anonymous namespace)::FilePicker::GetNextFile
     1.23%  my-oneconnectio  mysqld                         [.] rocksdb::BlockIter::Next
     1.15%  my-oneconnectio  mysqld                         [.] rocksdb::MergingIterator::Next
     1.10%  my-oneconnectio  mysqld                         [.] join_read_key
     1.09%  my-oneconnectio  mysqld                         [.] rocksdb::Block::NewIterator
     1.04%  my-oneconnectio  mysqld                         [.] rocksdb::TableCache::Get
I am not an expert in MyRocks at all, so I will not even try to interpret these results (maybe in some later post, one day. I've just highlighted what I consider a real difference in the second (INNER JOIN) case. Note pthread_mutex_lock/pthread_mutex_unlock calls, for example that are NOT present in "more than 1%" output for STRAIGHT_JOIN case.

I'd be happy to get any comments on the above. I also plan to share more/raw data from perf some day later, when I decide where to put them for public access (we speak about text outputs megabytes in size).

I've captured profiles with callgraphs as well:
[openxs@fc23 ~]$ sudo perf record -ag
^C[ perf record: Woken up 565 times to write data ]
[ perf record: Captured and wrote 142.410 MB perf.data (644176 samples) ]
Again, full trees are huge and require real expert to be analyzed properly.

New perf report option (-g flat, available since kernels 4.4+) gave nice backtraces like this:
           40.20%
                handle_one_connection
                do_handle_one_connection
                dispatch_command
                mysql_parse
                mysql_execute_command
                execute_sqlcom_select
                handle_select
                mysql_select
                JOIN::exec
                sub_select
                evaluate_join_record
                sub_select
                join_read_key
                myrocks::ha_rocksdb::index_read_map_impl
                myrocks::ha_rocksdb::get_row_by_rowid
                myrocks::Rdb_transaction_impl::get
                rocksdb::TransactionBaseImpl::Get
                rocksdb::WriteBatchWithIndex::GetFromBatchAndDB
                rocksdb::DBImpl::Get
                rocksdb::DBImpl::GetImpl
                rocksdb::Version::Get
that show how data are read in MyRocks. You can get similar backtraces with pt-pmp as well, but you'll see only how many times specific backtrace was noted, not what was the related overhead. Also, performance impact from pt-pmp would be very notable in this case.

Time to stop it seems, the post is already long. One day I'll continue, as there are many phenomena highlighted by this very simple, but famous, use case. Stay tuned!