lucene Archives - Tales of a Code Monkey

LINQ with Lucene.Net.ObjectMapping

Last time I mentioned that I started to work on supporting LINQ with Lucene.Net.ObjectMapping. That includes LINQ queries like the following:

using (Searcher searcher = new IndexSearcher(directory))
{
    IQueryable<BlogPost> posts =
        from post in searcher.AsQueryable<BlogPost>()
        where obj.Tag == "lucene"
        orderby obj.Timestamp descending
        select post;
}

Now granted, the above example is a very basic one. So here's a short list of other methods on IQueryable<T> that are already supported at this point: Any ^*, Count ^*, First ^*, FirstOrDefault ^*, OrderBy, OrderByDescending, Single ^*, SingleOrDefault ^*, Skip, Take, ThenBy, ThenByDescending, and finally Where.

^* Method is supported both with and without a filter predicate.

With this, it becomes easy to build paging based on objects you get back as a result of a query on Lucene.Net. I'm still working on improving the supported filter expressions (most of all for Where, but all the other filterable methods naturally profit too). For instance, with the default JSON-based object mapping it is already possible to search for entries in a dictionary that maps a string to another property or object. Say you have a set of classes, defined as follows.

public class MyClass
{
    public int Id { get; set; }
    public Dictionary<string, MyOtherClass> Map { get; set; }
}

public class MyOtherClass
{
    public string Text { get; set; }
    public int Sequence { get; set; }
    public DateTime Timestamp { get; set; }
}

Now you can actually search for instances of MyClass that satisfy certain conditions in the Map dictionary, like this:

var query = from c in searcher.AsQueryable<MyClass>()
            where c.Map["MyKey"].Sequence == 123
            select c;

Since the items in the dictionary are mapped to analyzed fields in the Lucene.Net document, we can search on them!

Delete and Update By Query

Now since I have this query expression binder to create Lucene.Net queries based on LINQ filter expressions, I've added an extension method to update and one to delete documents that match a query. So it is now possible to do this:

indexWriter.Delete<MyClass>(x => x.Id == 1234);
indexWriter.Update(myObject, x => x.Id == myObject.Id);

Call to Action

Now with all this said, I'm looking for volunteers to help me get more coverage on the LINQ queries, because that's definitely where the weak spot is right now. If you're interested, leave a comment here or on GitHub.

Improvements to Lucene.Net.ObjectMapping

I'd like to discuss some improvements to Lucene.Net.ObjectMapping which I published yesterday as a new version (1.0.3) to NuGet. In addition, I want to take this opportunity to give a quick outlook on what's to come next.

CRUD Operations

The library now comes with support for all of the CRUD operations. Let's look at them one by one, starting with Create.

Create / Add

In Lucene.Net terms, that would be AddDocument. Since the library does object to document mapping, this is simplified to an Add operation.

IndexWriter myIndexWriter = ...;
MyClass myObject = new MyClass(...);

myIndexWriter.Add(myObject);

Or, if you need a specific analyzer for the document the object gets mapped to, you can use the overload which accepts a second parameter of type Analyzer.

IndexWriter myIndexWriter = ...;
MyClass myObject = new MyClass(...);

myIndexWriter.Add(myObject, new MyOwnAnalyzer());

Retrieve / Query

The retrieve operation, or mapping of a document to an object hasn't changed since v1.0.0. There are examples for how to query and retrieve in my previous post. Of course, if you happen to know the ID of the document without a query, then you can just map that document to your class without going through a query. But since the document IDs can change over time, it's usually more practical to pivot off a query.

Update

Update is maybe the most interesting operation here. Since document IDs can change over time, there's really no good way to reliably update a specific document, without making a query. That's why the UpdateDocument method from the IndexReader asks you for a query/term to use to match the document to update. And that's why it's generally a good idea to bring your own unique identifier to the game. Suppose your class has a property of type Guid and name “Id”, which is used as your unique identifier for the objects of that type.

IndexWriter myIndexWriter = ...;
MyClass myObject = ...;

myObject.MyPropertyToUpdate = "new value";

myIndexWriter.Update(
    myObject,
    new TermQuery(new Term("Id", myObject.Id.ToString())));

Under the covers, this will find all the documents matching the query and matching the type (MyClass), delete them and then add a new document for the mapped myObject. If you need an analyzer, for the newly mapped document, you can use the second overload.

IndexWriter myIndexWriter = ...;
MyClass myObject = ...;

myObject.MyPropertyToUpdate = "new value";

myIndexWriter.Update(
    myObject,
    new TermQuery(new Term("Id", myObject.Id.ToString())),
    new MyOwnAnalyzer());

Delete

Just like the retrieve operation, the Delete operation is also supported since v1.0.0. I realize though that I haven't given any examples yet. But really, it's quite simple again. You give the type of objects you want to delete the mapped documents for, and you give a query to identify the objects to delete. No magic at all.

IndexWriter myIndexWriter = ...;
myIndexWriter.DeleteDocuments<MyClass>(
    new TermQuery(new Term("Tag", "deleted")));

Naturally, you can use any Query you want for the delete operation (as well as for updates). You can make them arbitrarily complex as long as they're still supported by Lucene.Net.

Summary and Outlook

That's it, CRUD with no magic, no tricks. Let me know if there's functionality you'd like to see added, either by commenting here or by opening a bug/enhancement/whatever on GitHub. I've started working on LINQ support for the ObjectMapping library too, with the goal that you can write LINQ queries like the following.

var query = from myObject in mySearcher.AsQueryable<MyClass>()
            where myObject.Tag == "history"
            select myObject;

It will likely take a little longer to get that stable, but I'll try to make a pre-release on NuGet in the next few weeks.

Search Mapped Objects in Lucene.Net

In my previous post (Lucene.Net Object Mapping) I introduced the Lucene.Net.ObjectMapping NuGet package. The post describes how the package can be used to map virtually any .Net object to a Lucene.Net Document and how to reconstruct the object from that same Document later. Now it's time to look at the search aspect of it, so how can you search mapped objects in Lucene.Net?

You already know Searcher

The Searcher class in Lucene.Net can be used to run queries on an index and retrieve documents matching that query. The Lucene.Net.ObjectMapping library comes with additional extensions to the Searcher class which help you search for Documents. There's a variety of different extensions, some which just return a TopDocs object with the number of results you've specified, and some which allow sorting, but more powerful are the ones which require you to specify a Collector to gather the results. Using a Collector makes it very easy to support paging over all the results for a specific query, and after all that's usually what you'd do today if you want to show search results. So let's look at an example of searching for Documents that contain mapped .Net objects using a Collector. Let's assume we're building a blog engine, for which we want to index the posts.

public class BlogPost
{
    public Guid Id { get; set; }
    public DateTime Created { get; set; }
    public string Title { get; set; }
    public string Body { get; set; }
    public string[] Tags { get; set; }
}

// ... as before, you'd store your BlogPost objects like this:
luceneIndexWriter.AddDocument(thePost.ToDocument());

Use a Collector for Paging

Creating an paged index of all your blog posts is very easy, really. You'll need a Searcher, a Collector (the TopFieldCollector will do for now) and that's about it. Let's look at some code.

private const int PageSize = 10;

public BlogPost[] GetPostsForPage(int page)
{
    // Sanitize the 'page' before doing anything with it.
    if (page < 0)
    {
        page = 0;
    }

    int start = page * PageSize;
    int end = start + PageSize;

    using (Searcher searcher = new IndexSearcher(myIndexReader))
    {
        TopFieldCollector collector = TopFieldCollector.Create(
            // Let's sort descending by create date.
            new Sort(new SortField("Created", SortField.LONG, true)),
            end, // Need to get the hits until 'end'.
            false,
            false,
            false,
            false);

        // Let's use the object mapping extensions for Search! This will
        // filter results to only those Documents which hold a BlogPost.
        searcher.Search<BlogPost>(new MatchAllDocsQuery(), collector);

        // At this point we know how many hits there are in total. So
        // let's check that the requested page is within range.
        if (start >= collector.TotalHits)
        {
            page = (collector.TotalHits - 1) / PageSize;
            start = page.Value * PageSize;
            end = start + PageSize;
        }

        TopDocs docs = collector.TopDocs(start, PageSize);
        List<BlogPost> posts = new List<BlogPost>();

        foreach (ScoreDoc scoreDoc in docs.ScoreDocs)
        {
            Document doc = searcher.Doc(scoreDoc.Doc);

            posts.Add(doc.ToObject<BlogPost>());
        }

        return posts.ToArray();
    }
}

That's it, no magic, no tricks. One thing you could do, instead of just returning a plain array with the results is to return an object which holds some more meta information, like for instance the number of total hits, or the actual page you're returning results for. But the core logic remains the same. You can play around with different ways to sort the results. Keep in mind though that tokenized/analyzed fields in Lucene.Net are sorted based on the tokens, not based on the actual string value. To help address this, I'm thinking about extending the object mappers to allow to specify not only to analyze a field (because you want to search it), but also to add a non-analyzed copy of the field for sorting purposes. That way, you have the advantage of being able to search and sort on the same logical field in the end. Keep in mind though that the index will grow since the data is indexed twice: once tokenized/analyzed, once as-is.

Lucene.Net Object Mapping

Today I finally took some time to turn a little library I've used for a while now into a NuGet package, called Lucene.Net.ObjectMapping. At the same time, I also uploaded the code to GitHub. But let's look at Lucene.Net Object Mapping in more detail.

How To Install

Since this is a NuGet package, installation is as simple as running the following command in the Package Manager Console

Install-Package Lucene.Net.ObjectMapping

Alternatively, you can just search for Lucene.Net.ObjectMapping in the package manager and you should find it.

How To Use It?

Using object mapping is as simple as calling two methods: ToDocument to convert an object into a document and ToObject to convert a Document (that was created with the ToDocument method) into the original object.

MyObject obj = ...;
Document doc = obj.ToDocument();
// Save the document to your Lucene.Net Index

// Later, load the document from the index again
Document docFromIndex = ...;
MyObject objFromDoc = docFromIndex.ToObject<MyObject>();

How does it work?

Under the covers, the library is JSON-serializing the object and stores the JSON in the actual Lucene.Net document. In addition, it stores some metadata like the actual and the static types of the object you stored, as well as the timestamp (ticks) of when the document was created. The type information is used when you search for documents that were created for a specific type. The static type is the type you pass in as the type parameter to ToDocument, whereas the actual type is the actual (dynamic) type of the object you're passing in. Since all this information is stored in the document too, there are no issues re-creating objects from an class hierarchy too.
In addition to storing the object information itself, the library also indexes the individual properties of the object you're storing, including nested properties. By default, it uses a mapper which works as follows.

Public properties/fields of objects are mapped to Lucene.Net fields with the same name; e.g. a property called “Id” is mapped to a field called “Id”.
Properties/fields that are arrays are mapped to multiple Lucene.Net fields, all with the same name (the name of the property that holds the array).
Nested properties/fields, i.e. objects from properties/fields, use the name of the property as a prefix for the properties/fields of the object.

Each field is created with the following mapping of field types:

Boolean properties are mapped to a numeric field (Int) with a value of 1 for true and 0 for false.
DateTime properties are mapped to a numeric field (Long) with the value being the Ticks property of the DateTime.
Float properties are mapped to a numeric field (Float) with the value being the float value.
Double and Decimal properties are mapped to a numeric field (Double) with the value being the double value.
Guid properties are mapped to string fields which are NOT_ANALYZED, i.e. you can search for the GUID as is.
Integer (also Long, Short, and Byte as well as their unsigned/signed counterparts) properties are mapped to a numeric field (Long) with the value being the integer value.
Null values are not mapped at all; thus, the absence of a field implies the corresponding property is null.
String properties are mapped to string fields which are ANALYZED.
TimeSpan properties are mapped to a numeric field (Long) with the value being the Ticks property of the TimeSpan.
Uri properties are mapped to string fields which are ANALYZED.

Example Mapping

Let's look at a simple example of an object and its mapping to a Lucene.Net Document. Consider the following object model.

public class MyObject
{
    public int Id { get; set; }
    public string Name { get; set; }
    public ObjectMeta Meta { get; set; }
}

public class ObjectMeta
{
    public DateTime LastModified { get; set; }
    public string ModifiedBy { get; set; }
    public string[] Modifications { get; set; }
}

// Create an instance of MyObject
MyObject obj = new MyObject()
{
    Id = 1234,
    Name = "My Lucene.Net mapped Object",
    Meta = new ObjectMeta()
    {
        LastModified = DateTime.UtcNow,
        ModifiedBy = "the dude",
        Modifications = new string[] { "changed a", "removed b", "added c" },
    },
};

Document doc = obj.ToDocument();

The mapping rules called out above will add the following fields for searching to the document. Please note that I'm not calling out the fields needed for the internal workings of the Lucene.Net.ObjectMapping library.

Field Name	Type	Value
Id	Numeric / Long	1234
Name	String / ANALYZED	My Lucene.Net mapped Object
Meta.LastModified	Numeric / Long	< the number of ticks at the current time >
Meta.ModifiedBy	String / ANALYZED	the dude
Meta.Modifications	String / ANALYZED	changed a
Meta.Modifications	String / ANALYZED	removed b
Meta.Modifications	String / ANALYZED	added c

The mapper is by no means complete. Ideas to extend it in the future exist, including functionality to

specify attributes on string properties (or properties mapped to string fields) to specify how to index the string (NO vs ANALYZED vs NOT_ANALYZED vs NOT_ANALYZED_NO_NORMS vs ANALYZED_NO_NORMS).
specify attributes on any properties to define how to map the field, e.g. by specifying a class which can map the field

I'll talk a little more on how to use this all when searching for documents in your Lucene.Net index. But as a sneak preview: the library also provides extension methods to the Searcher class from Lucene.Net that you can use to specify an object type to filter your documents on.

M	T	W	T	F	S	S
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30