Elasticating Examine

Heads Up!

This article is several years old now, and much has happened since then, so please keep that in mind while reading it.

For those of you that are unfamiliar with Examine, it's the Umbraco wrapper for Lucene that automatically indexes your content and provides you with an API to search against it out of the box. You can configure how it indexes things and write advanced queries to search the data. It does this by creating and maintaining file-based indexes in your App_Data folder. This can sometimes cause problems with corrupted or locked files, which can be increased in cloud-based environments (though Umbraco have done a lot of work to improve this).

I was involved in a project last year whereby the client wanted to be able to do some pretty sophisticated searching over a large amount of content. The tech-lead went with Elasticsearch to try and avoid the aforementioned Examine issues; Elasticsearch has a centralised, RESTful interface (with C# nuget packages available) which usually sits on top of a cluster of Elasticsearch nodes. It's meant to be very reliable, fast and has the added bonus of avoiding the file-based issues that Examine can sometimes have. It's (a little bit) like the search equivalent of using Azure Blob storage to centralise your media library when you have more than one web server in the mix. For the project, the developer hooked into the Umbraco ContentService Events to add and remove data from Elasticsearch. I wasn't heavily involved in the Elasticsearch side on that project, but recently I've been working on an MVC project where I've had the opportunity to get my hands dirty setting up an Elasticsearch index.  This made me wonder...

Is it possible to implement an Examine provider that uses Elasticsearch?

Examine isn't just used for site search, it's used for indexing internal data too. If I could implement a provider, then it might be possible to negate some of the issues Examine faces in some scenarios. So, I decided to do what any sane developer would do: I googled it to see if someone else had already done it. And, to my surprise, I found pretty much nothing...I couldn't even find anything about implementing the Examine providers!

So I started with what little I knew: the ExamineSettings.config was where I'd come across the Examine providers previously:

<?xml version="1.0"?>
<Examine>
    <ExamineIndexProviders>
        <providers>
            <add name="InternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"
                 supportUnpublished="true"
                 supportProtected="true"
                 analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

            <add name="InternalMemberIndexer" type="UmbracoExamine.UmbracoMemberIndexer, UmbracoExamine"
                 supportUnpublished="true"
                 supportProtected="true"
                 analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net"/>

            <!-- default external indexer, which excludes protected and unpublished pages-->
            <add name="ExternalIndexer" type="UmbracoExamine.UmbracoContentIndexer, UmbracoExamine"/>

        </providers>
    </ExamineIndexProviders>

    <ExamineSearchProviders defaultProvider="ExternalSearcher">
        <providers>
            <add name="InternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
                 analyzer="Lucene.Net.Analysis.WhitespaceAnalyzer, Lucene.Net"/>

            <add name="InternalMemberSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine"
                 analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" enableLeadingWildcard="true"/>

            <add name="ExternalSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" />

        </providers>
    </ExamineSearchProviders>

</Examine>

The default ExamineSettings.config file

This is split into 2 sections:

  • <ExamineIndexProviders /> - for configuring how data is stored in the index using either the UmbracoContentIndexer or the UmbracoMemberIndexer
  • <ExamineSearchProviders /> - for configuring how you'd like to search the data in the index using the UmbracoExamineSearcher

There is also the ExamineIndex.config file:

<?xml version="1.0"?>
<ExamineLuceneIndexSets>
    <!-- The internal index set used by Umbraco back-office - DO NOT REMOVE -->
    <IndexSet SetName="InternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/Internal/"/>

    <!-- The internal index set used by Umbraco back-office for indexing members - DO NOT REMOVE -->
    <IndexSet SetName="InternalMemberIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/InternalMember/">
        <IndexAttributeFields>
            <add Name="id" />
            <add Name="nodeName"/>
            <add Name="updateDate" />
            <add Name="writerName" />
            <add Name="loginName" />
            <add Name="email" />
            <add Name="nodeTypeAlias" />
        </IndexAttributeFields>
    </IndexSet>

    <!-- Default Indexset for external searches, this indexes all fields on all types of nodes-->
    <IndexSet SetName="ExternalIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/External/" />
</ExamineLuceneIndexSets>

The default ExamineIndex.config file

This contains a single <ExamineLuceneIndexSets /> section which defines "index sets" for indexes.  These configure what fields should be indexed and where on the file system the index files should be stored.

A single index's configuration is spread across the three configuration sections using naming conventions; "IndexNameIndexer", "IndexNameSearcher" and "IndexNameIndexSet". Out of the box, three indexes are defined by default; the "Internal" & "InternalMember" indexes are for back-office use and the "External" index, which automatically indexes all fields on all site content nodes, so that developers have a starting point for any site search they might want to implement.

The Plan

From what I could see above, it looked like I would need to:

  1. Implement the indexer - I would need to write my own Umbraco content indexer - DefaultPublishedContentIndexer
  2. Implement the searcher - I would need to write my own Umbraco content searcher - DefaultPublishedContentSearcher
  3. Configure Umbraco to use my new providers in ExamineSettings.config and ExamineIndex.config

Implementing the Indexer

After Examine-ing (pun intended) the source (and hitting a number of dead-ends), I settled on implementing the UmbracoExamine.BaseUmbracoIndexer abstract class and overriding its public methods as this was the one Umbraco needed in order to interact with it properly. These are the methods I needed to implement, which were very straight-forward using the Elasticsearch NEST client:

  • Initialize() - Initialize my NEST client - this should be a singleton.
  • IndexExists() - Check if an index exists in Elasticsearch.
  • RebuildIndex() & IndexAll() - get all published content in Umbraco and send it to be indexed in Elasticsearch
  • DeleteFromIndex() - remove a content item from the index
  • ReIndexNode() - adds/updates a content item in the index

Here are the bare-bones of the class (full implementation available here):

namespace Test.ElasticExamineProvider.Indexers
{
    public class DefaultPublishedContentIndexer : UmbracoExamine.BaseUmbracoIndexer
    {
        // let Umbraco know that we only support content (as opposed to media, etc.)
        private static List<string> _supportedTypes = new List<string>() { UmbracoExamine.IndexTypes.Content };

        protected override IEnumerable<string> SupportedTypes => _supportedTypes;
        

        public override void Initialize(string name, NameValueCollection config);
        {
            base.Initialize(name, config);

            // Initialize Elastic client, read in the name of the index, etc...
        }

        public override bool IndexExists()
        {
            // check Elasticsearch to see if the current index exists
        }

        public override void IndexAll(string type)
        {
            // index all documents of the "type" specified (can be "content", "media", etc..)
        }

        public override void RebuildIndex()
        { 
            // rebuild the entire index
        }

        public override void DeleteFromIndex(string nodeId)
        {
            // delete the specified node from the index
        }
        
        public override void ReIndexNode(XElement node, string type)
        {
            // add/update the specified node in the index
        }
    }
}

This took a few attempts to get working, but once I did I could see that Umbraco was checking the index on startup (by writing lines to the log file) and calling RebuildIndex() which was, in turn, pushing everything into Elasticsearch!

There are a couple of points to note about this:

  • Umbraco was checking to see whether the Examine index files were available when it started up. As the files weren't getting created it was triggering RebuildIndex() on every startup which I definitely didn't want! I added RebuildOnAppStart="false" to the root node in ExamineSettings.config which stopped this, but this stops the application doing any kind of full rebuild when it's starting up. That's not such a big issue as Elasticsearch should rarely need to be fully re-indexed and it can easily be manually triggered by going into the "Examine Management" tab in the Developers section (NB - when you do this the spinner goes forever, but I could see from the logs that the re-indexing was being triggered and completed).
  • The base indexer has a field on it called IndexerData which has the values from the configuration specified in ExamineIndex.config. Ideally I should be respecting this configuration and only including the document types/fields specified; however, for the purposes of this experiment I ignored it as I just wanted to prove the idea.

To index the content, I created a class called PublishedContent to represent the Umbraco content which I then passed into the Elasticsearch NEST client to be indexed. It's a simple class which has some basic document metadata and a Dictionary<string,string> list of key/value pairs to represent all of the Umbraco node's properties:

using System;
using System.Collections.Generic;
using Umbraco.Core.Models;
using Nest;

namespace Test.ElasticExamineProvider.DocumentTypes
{
    [ElasticsearchType(Name = DocumentTypeName)]
    public class PublishedContent
    {
        public const string DocumentTypeName = "content";  
        public int Id { get; set; }
        public string DocumentTypeAlias { get; set; }
        public int DocumentTypeId { get; set; }
        public DateTime CreateDate { get; set; }
        public DateTime UpdateDate { get; set; }
        public string Url { get; set; }
        public Dictionary<string, string> Properties { get; set; } = new Dictionary<string, string>();

        public PublishedContent()
        {

        }

        public PublishedContent(IPublishedContent item)
        {
            Id = item.Id;
            DocumentTypeAlias = item.DocumentTypeAlias;
            DocumentTypeId = item.DocumentTypeId;
            CreateDate = item.CreateDate;
            UpdateDate = item.UpdateDate;
            Url = item.Url;

            foreach(var prop in item.Properties)
            {
                if (prop == null || string.IsNullOrWhiteSpace(prop.PropertyTypeAlias) || !prop.HasValue)
                    continue;

                Properties.Add(prop.PropertyTypeAlias, prop.DataValue?.ToString());
            }
        }
    }
}

The NEST client will serialize this to JSON and store it in Elasticsearch for indexing (Elasticsearch can use schema-less JSON in its index), so all fields on all documents get indexed this way.

Implementing the Searcher

Now that I had some content indexed in Elasticsearch, it was time to implement my searcher. Initially I was going to implement the Examine.Providers.BaseSearchProvider which needed me to fully implement CreateSearchCriteria() methods which handled the generation of fluent API queries, as well as Search() methods...which I wasn't going to have time to do. Since Examine is based on Lucene I made the assumption that the default UmbracoExamineSearcher would, eventually, need to convert its fluent search query into a string-based Lucene query. As Elasticsearch is also based on Lucene, I took a guess that if I could extract the query string being generated by Examine from the fluent query I could pass it straight into Elasticsearch, and save a whole load of effort for the purposes of my experiment. With this in mind I created my searcher class by inheriting from the default UmbracoExamineSearcher implementation (which implemented the CreateSearchCritera() methods) so that I only had to override the Search() methods:

  • Initialize() - initalizing the NEST client, although this time I grabbed the singleton instance that was initialized in the indexer.
  • GetSearchFields() - this returns a list of available fields to search. Rather than attempt to query Elasticsearch for this, I just hard-coded it to return _all, which is a special field name for Elasticsearch which lets it search across all indexed fields.
  • Search(ISearchCriteria searchParams) - this is an implementation of search which passes in a fluent API query. I'm certainly not advanced enough to convert that to an NEST fluent query so I cheated. I extracted the Lucene string term from the ISearchCriteria field and passed the search term straight into the NEST client. I don't know how well this would work for a complex query, but it worked well enough for the purposes of this experiment.
  • Search(string searchText, ...) - an implementation which allows search text to be passed in. For this, I generated a simple Elasticsearch query to search across all fields for whatever was passed in.

After looking (yet again!) through the Examine source-code to try and figure out how to extract the Lucene search string, here are the bare-bones of a searcher implementation (full implementation here):

namespace Test.ElasticExamineProvider.Searchers
{
    public class DefaultPublishedContentSearcher : UmbracoExamine.UmbracoExamineSearcher
    {	
        public override void Initialize(string name, NameValueCollection config)
        {
            base.Initialize(name, config);
            
			// Initialize Elastic client, read in the name of the index, etc...
        }
		
        protected override string[] GetSearchFields();
        {
			// provide a list of all searchable fields - NB - "_all" will search all fields within the Elastic document
            return new[] { "_all" };
        }

        public override ISearchResults Search(ISearchCriteria searchParams, int maxResults)
        {
            // implement the search using fluent syntax, starting with ISearchCriteria
        }

        public override ISearchResults Search(string searchText, bool useWildcards, string indexType)
        {
            // implement the search using a the raw search text
        }
    }
}

I tested this by going to "Developer Section" > "Examine Management" and scrolling down to "Searchers" area - this has a sort of test harness to let you search a given index using its search provider using a "Lucene" search (which would use my hacky "fluent" methods) or a "Text" search. After a bit of fiddling and debugging, I managed to get it to work; both options were searching and returning results across all fields on all content!

Configuration

To pull all of this together I configured a new index called "Elastic" and updated ExamineSettings.config and ExamineIndex.config as follows:

<!-- ExamineSettings.config -->
<Examine RebuildOnAppStart="false">
    <ExamineIndexProviders>
        <providers>
            ...
            <!-- Elastic Indexer-->
            <add name="ElasticIndexer" type="Test.ElasticExamineProvider.Indexers.DefaultPublishedContentIndexer, Test.ElasticExamineProvider"/>
            ...
        </providers>
    </ExamineIndexProviders>

    <ExamineSearchProviders defaultProvider="ElasticSearcher">
        <providers>
            ...
            <!-- Elastic Searcher-->
            <add name="ElasticSearcher" type="Test.ElasticExamineProvider.Searchers.DefaultPublishedContentSearcher, Test.ElasticExamineProvider"/>
            ...
        </providers>
    </ExamineSearchProviders>
</Examine>

<!-- ExamineIndex.config -->
<ExamineLuceneIndexSets>
    ...
    <IndexSet SetName="ElasticIndexSet"  IndexPath="~/App_Data/TEMP/ExamineIndexes/Elastic/"/>
    ...
</ExamineLuceneIndexSets>

There's not much to explain here other than saying that I added the RebuildOnAppStart attribute at the top of the ExamineSettings.config file (to stop it detecting the non-existent Examine files). I also had to specify an IndexPath against the ElasticIndexSet - this is because, if I didn't, Umbraco would complain with an instant yellow screen of death!

Once I'd finished configuring the new providers I tried creating, deleting, publishing and unpublishing to make sure the Elasticsearch index updated properly...which it did! Searching in the test harness in the developer area worked, but I didn't manage to get around to using the new search provider with a complex fluent query, so I don't know how resilient/compatible my convert-to-Lucene-text hack really is.

Where Do We Go From Here?

I think it could be in the realms of possibility to write some kind of package which could fully replace Examine with Elasticsearch, but there are certainly some considerations for this to be package-able and production-ready:

  • The index provider I've written currently only has one document structure which indexes all document types & properties with the default analyzer - for more sophisticated searching, work would need to be done to read the configuration out of the configuration files, as well as creating overridable methods to allow people to specify their own document types/indexes/Elasticsearch configuration.
  • The search provider is really, really basic and work/testing would need to be done to make sure it works with complex fluent API queries.
  • Indexers/searchers would need to be implemented to replace the default internal searchers that Umbraco uses to index media, members & unpublished content.
  • The singleton NEST client would need to be a bit more centralised so it could be shared amongst all types of indexers/searchers.
  • Ensuring that it works well in the cloud and on multiple web servers - only brief consideration has been given to this only allowing the "rebuild all" methods to be called by a master web server which can be configured in the web.config.
  • All the usual dependency issues that come with making a package; the versions of .NET, Umbraco and other nuget packages that it's built against. To add more complications into the mix, there are multiple versions of Elasticsearch that pair up with different versions of the NEST nuget package!

Final Thoughts

Overall, I'm quite pleased with how the experiment went - I managed to replace the default Examine index and search providers with Elasticsearch implementations, generate an index of all of my site content and query it in a basic way. I've also learned a lot about how the Examine providers work in terms of their interaction with Umbraco. That's about as much as I could have hoped for in a couple of days worth of hacking! It's certainly shown me that it's straight-forward enough to write your own Examine provider to index site search content in a different search provider. If I were working on another project which combined Umbraco and Elasticsearch, I would definitely prefer to implement the index provider myself so that Examine can manage keeping the index up to date.  Then I would probably query the data directly using the NEST client rather than fully implement the search provider myself; however, that wouldn't allow replacement of the two internal indexes upon which Umbraco relies.  That means there's definitely a gap whereby this could be packaged up to replace Examine and, for any queries the package couldn't handle, the NEST client could be called directly.  If a proper implementation of the provider were created, there's also the possibility of hot-swapping the provider out in existing Umbraco installs that want to switch to Elasticsearch.

Resources

Tristan Thompson

Tristan is on Twitter as