Improving Lucene Search Performance

By Jayesh Gangadharan – Sr. Software Engineer – ADP Cobalt Social Media Team

Lucene Search

Lucene search api creates an Index of your data and lets you search on the indexed data instead of using a traditional relational database.  Lucene is primarily used for indexing document text but can be used as a mechanism to produce facets for tabular data.  This was our primary usage as producing facets and counts from a relational model is much more expensive.

Initial implementation 

We followed Master / Slave based configuration with Hibernate search (hibernate search uses lucene), so we maintain Index copy on each node and refresh the indexes every 300 seconds.   Hibernate Search provides MassIndexer api to create the index data from the datasource. We basically wipeoff and start the index from zero everytime the app starts up. Reindexer is used or updating the index data on any changes happening in the database.

Our Scalability Challenge

Under low load the index queries were performing adequately.  However when we started onboarding thousands of dealers with hundreds of thousands of records we saw significant degradation due to the amount of filters and range queries we were performing.  Our date range queries were taking ~250 milliseconds and caused the CPU usage to spike backing up thread usage, producing locks and causing downstream contention.

Range Queries

One example on how we create Range queries (We use JPA (hibernate) as a wrapper on top of lucence to make our life easier with persistence))

BooleanJunction junction = builder.bool();
junction.should(builder.range()
  .onField(PUBLISH_RULE_CREATED_DATE_ATTRIBUTE)
  .below(calculateDelayUntilDate(new Date(), spec, STANDARD_DELAY_DAYS))
  .createQuery());

Integer rating = calculateRatingThreshold(spec);
if (rating != null) {
  junction.should(NumericRangeQuery.newIntRange("rating", rating, null, false, true));
}

for (ExcludedSource source : ExcludedSource.values()) {
  junction.should(builder.keyword()
    .onField(SOURCE_ATTRIBUTE)
    .matching(source.toString())
    .createQuery());
}

How we diagnosed and fixed the issue

  • Remove unnecessary fields (@Fields) from the Index which we don’t use during search.
  • Remove data from the Index that we don’t use during search operation. For eg: Customers private review is not display publicly so we don’t need it in the Index
  • Add data into the index only when the search requires it. For eg: we don’t show reviews publicly for X business days from published date. Do not add everything from your database and start to filter the Index
  • Update MassIndexer / Reindexer to add or remove data to the Index only when the data becomes eligible for searching. For eg: it will add the customer review after X business days, so no range queries on the dates
  • If the data isn’t changing much then cache it (we all know that)

Results after our changes

Our search requests went from 250 millisecs to 10 millisecs. Yes that’s a 96% improvement. We were able to scale the service from 10 requests per second to 40 requests per second

Conclusion

Lucene search is good and flexible but knowing its limitations is very important.   Range queries were very poor from our experience.    Other api’s like Elastic search look promising in solving some of these issues and has quite a bit more functionality out of the box.

Troubleshooting Stuck Processes on Linux

By Douglas Leonard – Sr. Software Engineer – ADP Cobalt Inventory Services

If you have a batch job or process on Linux that doesn’t seem to be doing anything, it can be difficult to figure out where exactly it is stuck. You can run strace on the process to figure out if it is hung on a system call but that may not tell you as much as you would like. Lots of processes have filehandles, network sockets, pipes, or other connections open so you don’t always know what the bottleneck or blockage is. Here is an example of a process that was stuck doing nothing for a long period of time and how we found out what it was stuck on. Originally the problem was found because the logfile hadn’t been written to in 16+ hours. But what was wrong? Had it died? Was it stuck? Was it just failing to write out the log?

Get the process ID by querying the process table

ps -eo pid,args | grep garbage

Gives you the PID and the command run (filtered by grep to only get back what we care about.) Note: this will include the grep command sometimes. Excluded here due to irrelevance.

21129 perl /web/imageserver/bin/photo_garbage_collector.pl --mode=actor --noclean --throttle=233 --logfile=/web/imageserver/log/garbagecollector.type2.part2.log

The process is running. The way this process works it reads from a file, writes to a log file, does filesystem deletes against an NFS mounted filer, has a database connection, and is a child process of another job. Any one of those things could potentially be causing this job to run inordinately slowly.

Use strace on the process to see if it is doing any work.

strace -p 21129
Process 21129 attached - interrupt to quit
read(6,
Process 21129 detached

Nothing ever happened on this so I used Ctrl+c to break out of it. The process is definitely doing a whole lot of nothing from a system perspective.

Use netstat to see if the process has an open network socket.

netstat -p | grep 21129
(Not all processes could be identified, non-owned process info
will not be shown, you would have to be root to see it all.)
tcp 0 0 batch_machine.env.company:45950 db.env.company:ncube-lm ESTABLISHED 21129/perl

This machine does have a network connection established on port 45950 to db.env.company… I could have gotten more information but I didn’t include enough commandline switches to netstat. Look up some more info with different switches.

netstat -an | grep 45950
tcp 0 0 11.63.7.63:45950 11.63.21.32:1521 ESTABLISHED

That’s a database connection to the oracle database. But is it the stuck read(6 we saw via strace? At this point we don’t know. In fact we don’t even know what else it has open that it could be reading.

Get all of the file descriptors for the process

ls -l /proc/21129/fd
total 0
lr-x------ 1 inv users 64 Sep 13 09:16 0 -> pipe:[4230561338]
l-wx------ 1 inv users 64 Sep 13 09:16 1 -> /web/imageserver/log/garbagecollector.stdout
l-wx------ 1 inv users 64 Sep 13 09:16 2 -> /web/imageserver/log/garbagecollector.stdout
lr-x------ 1 inv users 64 Sep 13 09:16 3 -> /web/imageserver/data/garbagecollector.1.txt
lrwx------ 1 inv users 64 Sep 13 09:16 5 -> socket:[4231677600]
lrwx------ 1 inv users 64 Sep 13 09:16 6 -> socket:[4231677602]

This tells us that it has 6 different file descriptors open and that the 6 it is reading from in the strace command is a socket. But is it the database connection socket or something else?

Convert the port number to hex so it can be looked up and correlated with the socket id.

perl -e 'printf("%X\n", 45950)'
B37E

Look up the corresponding tcp entry by that port in hex

awk '$2~/:B37E/ {print $0}' < /proc/net/tcp
548: 3D08400A:B37E 1E16400A:05F1 01 00000000:00000000 00:00000000 00000000 432 0 4231677602 1 ffff8103162e1980 231 40 13 3 -1

Notice the 4231677602 in that line? That’s the socket id. That tells us that the process which is waiting on read(6 is waiting on a read from the database. That gives us enough information to go talk to the database administrators and figure out why the database call is taking forever.

Effects of Linux Context Switching on High Performance Web Applications

By Robert Dominy – Engineering Director – ADP Cobalt Display Advertising Platform

If you are writing a high performance web application and attempting to evaluate performance, one thing you should be aware of is the effect of Linux context switching.  Under Linux, the system scheduler allocates time slices to running processes and as processes exceed their allocated slice, they can be interrupted to give time to other processes.  These context switches can be expensive when you measuring performance in the tens of milliseconds or less.

Example
On a production server that is moderately busy (handling about 100 HTTP requests/sec), a series of timing tests were conducted.  The server is a virtualized system running CentOS 6.2 with 4 CPUs and 32GB of RAM allocated to the virtual.  It runs at loads ranging from about 0.5 to 1.5.

A simple loop was implemented in PHP where the timed task was basically this:

$limit = 7000;
$iterations = 0;
while ($iterations < $limit)
{
    $iterations++;
    $last=microtime(true);
}

Here is a histogram of the timed results (in milliseconds):
0-10 ms : 0
11-18 ms : 1294
19-24 ms : 373
25-50 ms : 152
51-75 ms : 6

Nominally, the test completes in about 14ms and yet why are there cases as high as 65ms?

Measuring Context Switches
The Linux function getrusage returns different metrics about process resource usage.  The items we are interested in are user CPU time (ru_utime), system CPU time  (ru_stime) and involuntary context switches (ru_nivcsw).

Here are the stats for a sample that took 65ms:
[elapsed] => 65ms
[userCPUTime] => 8.999ms
[systemCPUTime] => 9.999ms
[switches] => 7

Combining the user and system CPU time, the process consumed about 19ms.  During this time there were 7 context switches where the process was interrupted and other processes were allowed to run.  Those interruptions added another 46ms to the completion of the test.

Another important stat to monitor is voluntary context switches (ru_nvcsw).  Voluntary context switches can occur when your code calls various operating system functions, such as the sleep function or I/O functions.

A basic PHP class that runs as a command line script can be found here: https://gist.github.com/rdominy/6557280

Server Activity
Running on a machine that has less activity will result in significantly less context switches.  This is often the case when running and testing machines in a development environment and then later testing them in production under real load.  As an example, compare the results of a production server vs. a development laptop:

Time Range Development Production
11 – 18 ms 0.0% 90.6%
19 – 24 ms 0.4% 7.6%
25 – 50 ms 99.6% 1.8%

Even though the production server is significantly faster, it has much more variation in times due to context switching.  Even though the CPU load between the systems is similar, the production server is handling hundreds of requests per second causing NIC interrupts and other CPU activity.

Memory Footprint
The amount of memory can also impact the cost of context switching.  Here are three tests running with foot prints of 1MB, 500MB and 900MB:

Time Range 1 MB 500MB 900MB
11 – 18 ms 84.6% 0.0% 0.0%
19 – 24 ms 12.1% 68.5% 70.2%
25 – 50 ms 3.2% 30.5% 29.3%
51 – 75 ms 0.1% 0.9% 0.4%
76 – 100 ms 0.0% 0.2% 0.1%

Simply having a larger process size increased the time of the same test by almost twofold.

Caveats
The elapsed time for these tests uses the PHP microtime function.  It is subject to limitations of floating point accuracy, clock drift caused by NTP corrections and probably several other things I am not considering.  I did monitor the NTP logs and averaged about 1 NTP correction hour, so NTP corrections were not likely influential on the test results.

Conclusion
Additional tests of varying complexity, typically running a real application algorithm, yielded similar results. Ultimately the slower your application is and the more memory it uses, the more costly it is for context switching.  Super fast sub-millisecond operations in processes with a small footprint will have much less interruptions than functions that take tens of milliseconds to complete.

Future posts will look at ways of mitigating these costs.

%d bloggers like this: