Improving Lucene Search Performance

By Jayesh Gangadharan – Sr. Software Engineer – ADP Cobalt Social Media Team

Lucene Search

Lucene search api creates an Index of your data and lets you search on the indexed data instead of using a traditional relational database.  Lucene is primarily used for indexing document text but can be used as a mechanism to produce facets for tabular data.  This was our primary usage as producing facets and counts from a relational model is much more expensive.

Initial implementation 

We followed Master / Slave based configuration with Hibernate search (hibernate search uses lucene), so we maintain Index copy on each node and refresh the indexes every 300 seconds.   Hibernate Search provides MassIndexer api to create the index data from the datasource. We basically wipeoff and start the index from zero everytime the app starts up. Reindexer is used or updating the index data on any changes happening in the database.

Our Scalability Challenge

Under low load the index queries were performing adequately.  However when we started onboarding thousands of dealers with hundreds of thousands of records we saw significant degradation due to the amount of filters and range queries we were performing.  Our date range queries were taking ~250 milliseconds and caused the CPU usage to spike backing up thread usage, producing locks and causing downstream contention.

Range Queries

One example on how we create Range queries (We use JPA (hibernate) as a wrapper on top of lucence to make our life easier with persistence))

BooleanJunction junction = builder.bool();
junction.should(builder.range()
  .onField(PUBLISH_RULE_CREATED_DATE_ATTRIBUTE)
  .below(calculateDelayUntilDate(new Date(), spec, STANDARD_DELAY_DAYS))
  .createQuery());

Integer rating = calculateRatingThreshold(spec);
if (rating != null) {
  junction.should(NumericRangeQuery.newIntRange("rating", rating, null, false, true));
}

for (ExcludedSource source : ExcludedSource.values()) {
  junction.should(builder.keyword()
    .onField(SOURCE_ATTRIBUTE)
    .matching(source.toString())
    .createQuery());
}

How we diagnosed and fixed the issue

  • Remove unnecessary fields (@Fields) from the Index which we don’t use during search.
  • Remove data from the Index that we don’t use during search operation. For eg: Customers private review is not display publicly so we don’t need it in the Index
  • Add data into the index only when the search requires it. For eg: we don’t show reviews publicly for X business days from published date. Do not add everything from your database and start to filter the Index
  • Update MassIndexer / Reindexer to add or remove data to the Index only when the data becomes eligible for searching. For eg: it will add the customer review after X business days, so no range queries on the dates
  • If the data isn’t changing much then cache it (we all know that)

Results after our changes

Our search requests went from 250 millisecs to 10 millisecs. Yes that’s a 96% improvement. We were able to scale the service from 10 requests per second to 40 requests per second

Conclusion

Lucene search is good and flexible but knowing its limitations is very important.   Range queries were very poor from our experience.    Other api’s like Elastic search look promising in solving some of these issues and has quite a bit more functionality out of the box.

About collectivegenius
Everyone has a voice and great ideas come from anyone. At Sincro, we call it the collective genius. When technical depth and passion meets market opportunity, the collective genius is bringing its best to the table and our customers win.

One Response to Improving Lucene Search Performance

Leave a comment