Improving Lucene Search Performance
October 17, 2013 1 Comment
By Jayesh Gangadharan – Sr. Software Engineer – ADP Cobalt Social Media Team
Lucene Search
Lucene search api creates an Index of your data and lets you search on the indexed data instead of using a traditional relational database. Lucene is primarily used for indexing document text but can be used as a mechanism to produce facets for tabular data. This was our primary usage as producing facets and counts from a relational model is much more expensive.
Initial implementation
We followed Master / Slave based configuration with Hibernate search (hibernate search uses lucene), so we maintain Index copy on each node and refresh the indexes every 300 seconds. Hibernate Search provides MassIndexer api to create the index data from the datasource. We basically wipeoff and start the index from zero everytime the app starts up. Reindexer is used or updating the index data on any changes happening in the database.
Our Scalability Challenge
Under low load the index queries were performing adequately. However when we started onboarding thousands of dealers with hundreds of thousands of records we saw significant degradation due to the amount of filters and range queries we were performing. Our date range queries were taking ~250 milliseconds and caused the CPU usage to spike backing up thread usage, producing locks and causing downstream contention.
Range Queries
One example on how we create Range queries (We use JPA (hibernate) as a wrapper on top of lucence to make our life easier with persistence))
BooleanJunction junction = builder.bool(); junction.should(builder.range() .onField(PUBLISH_RULE_CREATED_DATE_ATTRIBUTE) .below(calculateDelayUntilDate(new Date(), spec, STANDARD_DELAY_DAYS)) .createQuery()); Integer rating = calculateRatingThreshold(spec); if (rating != null) { junction.should(NumericRangeQuery.newIntRange("rating", rating, null, false, true)); } for (ExcludedSource source : ExcludedSource.values()) { junction.should(builder.keyword() .onField(SOURCE_ATTRIBUTE) .matching(source.toString()) .createQuery()); }
How we diagnosed and fixed the issue
- Remove unnecessary fields (@Fields) from the Index which we don’t use during search.
- Remove data from the Index that we don’t use during search operation. For eg: Customers private review is not display publicly so we don’t need it in the Index
- Add data into the index only when the search requires it. For eg: we don’t show reviews publicly for X business days from published date. Do not add everything from your database and start to filter the Index
- Update MassIndexer / Reindexer to add or remove data to the Index only when the data becomes eligible for searching. For eg: it will add the customer review after X business days, so no range queries on the dates
- If the data isn’t changing much then cache it (we all know that)
Results after our changes
Our search requests went from 250 millisecs to 10 millisecs. Yes that’s a 96% improvement. We were able to scale the service from 10 requests per second to 40 requests per second
Conclusion
Lucene search is good and flexible but knowing its limitations is very important. Range queries were very poor from our experience. Other api’s like Elastic search look promising in solving some of these issues and has quite a bit more functionality out of the box.
Good one
LikeLike