It has been some time since my last post giving tips on the hardware and software requirements for installing and running Elasticsearch on Azure. This post is a further series of gotchas and tips that I’ve picked up on during my journey using Elasticsearch.
It is split into 3 broad categories covering various configuration items, designing your cluster and queries, and generally interesting features to consider using.
– You can and should change the elasticsearch cluster name (in the elasticsearch.yml configuration file). It uses this for node discovery. For example if a few developers start elasticsearch on their machines in same network, with the default settings, it’s likely all those nodes will form one cluster, which might not be what you intended.
– Similarly, you can set specific node names. For some time I was using the default auto-generated Marvel character names, which can be very amusing. But restarting nodes meant a lot of checking which node was running on which virtual machine in Azure. They could be set to something more static e.g. something that makes sense for your infrastructure, or just to keep the Marvel names static per node!
– Data stored on disk is located in the elasticsearch installation folder tree by default. This could be set to a separate folder, or entirely different disk. This could be useful if you have different disks attached to a VM in Azure, or when upgrading to a new elasticsearch version.
– Change setting ‘action.auto_create_index’ to false in production, to avoid any accidents!
– Similarly, disable dynamic mapping in production (unless you actually want these dynamic behaviours, of course).
– If on Linux, Elasticsearch recommend using the mlockall setting to disable swapping. Elasticsearch relies on the right data being in RAM and won’t work as well if the OS is swapping data out to disk. I don’t believe this setting works on Windows. It is possible to do something similar on Windows for a specific user account using the local policy option for “Lock Pages in Memory”, but so far haven’t done this and have not seen a need to. (Happy to hear from anyone with experience of this on Windows!)
– The OS default file descriptors limit is often too low. It could be increased to 32k or 64k. This could manifest as write errors if there are too many files open. Max and current visible in _nodes/stats API.
– Chef and Puppet can be used for centralised configuration management. We’ve used PowerShell and DSC, for pushing out configuration file changes.
– Some items in the elasticsearch.yml file are also configurable using the elasticsearch API.
– Your data, indexes and processes should be designed so that you are able to re-index at any time, because certain actions can only be done at index creation time (changing shard count, changes to analyser settings etc.).
– If multiple nodes are lost causing the loss of all replicas of a shard, writes to that shard would fail, reads would return partial results. Your calling client should account for this, e.g. should partial results be allowed, or is it really going to be an error in your situation? The response contains the total number of shards, the number successfully queried and the number that failed, so this is straight-forward to check.
– Use aliases. You can point at an alias in exactly the same way as an index, however an alias can be changed to point at a different index, or a set of indexes. This is useful for seamlessly migrating your clients between indexes when required. For example, this could be necessary because you’ve changed the way data is indexed, or it could be because you have time based data and roll your current index over every day.
– It is possible update one node at a time to a new version, but don’t run different versions across nodes for an extended time. Elasticsearch reserve the right to break compatibility between different versions on major releases (like going 1.x to 2).
– In AWS, the recommended minimum VM instance is m3.xlarge (4 vCPU with 15GB mem), as it suffers less from noisy neighbours (other VM instances that are intensively using some shared resources e.g. underlying hardware like shared disks). The equivalent sizing in Azure would be a D3 instance (4 cores, 14GB mem). We have been using D3 and A3 (half the memory of a D3) on Azure, for a 3 or 4 node cluster in production without problems, but do your own performance tests and continuous monitoring of performance.
– It is recommended to have as single type per index. Elasticsearch have actually discussed whether it should even be possible to have multiple types in an index, so this could change in the future.
– Consider having dedicated master nodes (3/5 is likely appropriate for almost all cluster sizes), with flexible number of data nodes. This means you can set the “minimum_master_nodes” setting to stable number, whilst scaling the data nodes depending on requirements. This is necessary to prevent the “split brain” problem.
– Design document relations in this order: denormalization, nested docs, parent/child, manage relations yourself.
– Prefer filters where possible to queries as filters can be cached, which improves performance. Filters in a filtered query apply to the whole data set and narrow the set that the query needs to run on. Bear in mind that scoring doesn’t apply to filters, only to queries.
– “Bool” filter should be used for BitSet-based filters (i.e. those based on the inverted index). And/Or/Not filter should be used for dynamically calculated filters like geo_* and scripted. This is important as it will dramatically alter performance.
– As well as using And/Or/Not for geo filter, prefer to place it as the last filter, as it is expensive so you want it to operate on the most filtered data set possible.
– Avoid faceting on fields that are analysed or have many possible values.
– Use multi fields to index a single field in multiple ways (e.g. a single field “myField” could be tokenized as normal, and used for full text searching, but have a sub property “myField.raw” or “myField.orig” that isn’t analysed for exact matches).
– Use post filters to show different results to those used for aggregations. Data can be thought of as multiple streams. For example you could run a query, perform some aggregations on that data set to get overall numbers, then further filter that data set to whatever full documents you want to see (perhaps for paging, but with stats across all results).
– Value_count (distinct count) metric aggregation is accurate but expensive, cardinality aggregation is not totally accurate but fast. Do you need exact counts, or is a reasonable approximation enough?
– You can return a subset of fields in a document. This is useful if you have large documents but only actually need to send a small amount of data back over-the-wire for a particular request.
Features to consider
– Consider if document versioning is appropriate. Versioning is off by default. When on, it will cause conflicts if the supplied version isn’t expected. When off, it tries to merge documents.
– It’s possible to use a Shingle Token Filter to do phrase matching but with the cost at index time instead of search time.
– Consider using warming API to populate caches on restarts, segment merges (so that first user doesn’t get the performance hit for the first request).
– Snapshots can be used for backups.
– Doc values vs field data. Doc values are essentially field data persisted to disk, possibly doc values will be the default as of 2.0, and they have better performance as of 1.4
– Consider using a custom cache key if a filter is very large (as by default the key is computed based on filter contents)
– Percolate can be used for saved queries. With 1.0 and later it can scale as isn’t confined to a single node. Explicit call required as of recent versions (decoupled for a reason). Can add metadata fields to saved queries to enable filtering/querying when percolating! Searching and percolating could be in different indices with different scaling requirements.
If you’ve seen anything particularly interesting in the tips above that you’d like a deeper dive into, let me know!