Skip to: Site menu | Main content

Drools

Java Rules Engine

Large Datasets and performance Print

Large Datasets and Runtime performance

Not everyone needs to load 1 million rows of data, but when you do, it's important to handle it efficiently. There are a couple of basic tips with handling large datasets. Large datasets typically fit into a couple categories.

  • Reasoning over Atomic rows.
  • Calculation rules
  • A combination of atomic rows and Calculation rules

Let's use a regulation rule to illustrate these three cases:

The exposure to any issuer must not exceed 5% for the firm

The intent of the rule is measuring a company's exposure to all the issuers its' customers invest in. On the surface, it would appear the application needs to load all the positions (aka investments) into the rule engine. In practice, problems like these can be broken into smaller tasks. In the compliance case, a company may have 10 million positions in the database, but the list of distinct issuers is only a small fraction. Rather than load all 10 million rows into the database, it is better to get the distinct list of issuers and run the compliance validation on each issuer. The pseudo SQL query might look like this:

select distinct issuer from positions.

Once we have the distinct list of issuers, it is easy to rewrite the rule. One possible approach to rewriting the rule might look like this.

Issuer exposure:

if

  • issuer.id != null
  • aggregate.entityId = issuer.id
  • companyInvestments.total > 0
  • (aggregate.total / companyInvestments.total) > 5%

then

  • create a compliance violation report

Query rule:

if

  • issuer.id != null

then

  • query the database (select sum() total from positions where issuer.id=<id>)
  • assert aggregate to working memory

By using this approach, the application need not load all the positions to perform mathematic calculations. The primary strength of RETE is pattern matching. Although rule engines like JESS, CLIPS and Blaze Advisor provide the ability to perform calculations, it is generally a good idea to externalize those operations as functions.

In this specific example, the rule engine needs to reason over an atomic position to determine if additional reasoning is needed. Once the rule engine detects the exposure of a single issuer needs to be validated, the rule engine should reason over the aggregate result. Although one could write a rule to calculate the aggregate, it is not efficient and limits scalability. Very rarely does a rule engine need to load all 10 million rows. One can improve the runtime performance by using materialized views to create summary tables. This means the rule can simply query for the calculated value, instead of calculating it in real-time. In the case of real-time systems, summary tables may not be appropriate, which means running the query on demand.

In cases where an application really needs to load all the data, the only option is to get bigger hardware and max out the RAM. Vertical scaling is expensive, so try to exhaust all other options before loading millions of rows of data into the rule engine.