This example demonstrates a relatively complete implementation of natural language interface to a SQL database. This is a non-trivial example, spanning over a thousand lines of code, and it can act as a basis for production-ready implementation.
Note that a significant part of the implementation is not directly related to NLPCraft but rather deals with SQL statement construction. We specifically decided not to use any 3rd party libraries for it to show what can be done "from scratch".
Source code: GitHub
Many of modern existing natural language-to-SQL implementations use variations of deep learning approach where you first train the neural network on a pre-created training set and then get to use the trained network to provide (infer) probabilistic answers for the new input sentences. Although the latest attempts to implement this approach are getting into 90% percentile of accuracy - they remain largely unusable for the vast swath of enterprise applications where non-deterministic nature of such systems renders them practically useless. The apparent problem with non-deterministic systems like that is that the user never knows whether a given answer is correct or incorrect. In these business and enterprise use cases users can't tolerate the fact that in 5-10% of the cases the answer will be invalid. Try that for your HR or revenue reporting system, prescriptive analytics systems, and so on...
It's important to note, however, that for many non-mission-critical systems such non-determinism does not pose a significant problem. We happily accept such imprecision when asking for direction on our mobile devices, unlock our phones using face or fingerprint recognition, when performing sentiment analysis or trying to detect faces of our friends across thousands of photographs. Cost of retries, as well as the cost of initial errors, is insignificant in these cases. The same cost, however, in many business and enterprise applications can be too significant to tolerate.
As you may have learned by now, NLPCraft uses a fully deterministic approach in an attempt to match the user input against a defined set of intents. If it finds the matching intent - it guarantees that match for a given intent declaration. If no matching intent can be found - it returns the negative result without any ambiguity. In other words, if the answer is given, it is guaranteed to be correct. Another positive side-effect of this approach is the fact that such matching logic is easily traceable, i.e. the user can easily see why given user input was matched against a certain intent (and not any other). Such traceability of the comprehension logic (or explainability vs "black box" approach resulting from deep learning techniques) is often critical for many real-life business applications.
On the flip side, one of the downsides of such an approach in general is the need to have a detailed, domain-specific semantic model for each SQL database (when dealing with SQL databases). Building such a model can be a non-trivial and time consuming experience. That is where NLPCraft brings a lot of built-in tooling and machinery to simplify this task dramatically.
You can create new Java project in many different ways - we'll use Maven archetype generation for that. In your home folder run the following command:
mvn archetype:generate -DgroupId=examples -DartifactId=my-app -DarchetypeVersion=1.4 -DinteractiveMode=false
This will create
my-app folder with the following default maven project structure:
├── pom.xml └── src ├── main │ └── java │ └── examples │ └── App.java └── test └── java └── examples └── AppTest.java
Note that this setup is same for all examples. Note also that you can use any other tools for creating and managing Java project with or without Maven.
For our example we'll use JetBrain's IntelliJ IDEA. Create new IDEA project from this source folder (make sure to pick JDK 8 or later JDK and language support). Let's also delete auto-generated files
AppTest.java from our project as we won't be using them.
Next we need to add NLPCraft dependency to our new project. Open
pom.xml file and replace
dependencies section with the following code:
<dependencies> <dependency> <groupId>org.apache.nlpcraft</groupId> <artifactId>nlpcraft</artifactId> <version>0.7.0</version> </dependency> </dependencies>
Also make sure that you have correct JDK version (11 or above) for the maven compiler plugin:
<properties> <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding> <maven.compiler.source>11</maven.compiler.source> <maven.compiler.target>11</maven.compiler.target> </properties>
IDEA should automatically reload the project with newly updated
pom.xml file and we should be ready now to develop our data model.
We are going to be building natural language interface against slightly modified "Northwind" sample database from Microsoft SQL Server:
You can find SQL script creating this database and populating it with the sample data at
db/northwind.sql file in the root of the example.
In our example we'll be using H2 database instance. For convenience, example provides a command line application
db/SqlServer.java that automatically starts local H2 database instance with default configuration (localhost on port 9092) and initializes it using
Creating data model manually for the existing SQL schema can be a daunting task. NLPCraft provides the tool that can scan SQL schema and create working stub of such model automatically. All you have to do then is to add any necessary modifications to it.
SQL Model Generation is a Java-based utility that takes JDBC configuration, reads database schema using it and creates initial JSON or YAML stub for the data model. This stub then can be used as is or be extended further. Read more about this tool here.
Although the example already comes with pre-generated file, you can run this utility yourself by starting H2 database (using
SqlServerRunner application in
db/SqlServer.java file) and running SQL Model Generator as follows:
java -cp apache-nlpcraft-incubating-0.7.0-all-deps.jar org.apache.nlpcraft.model.tools.sqlgen.NCSqlModelGenerator -r jdbc:h2:tcp://localhost:9092/nlp2sql -d org.h2.Driver -s PUBLIC -o src/main/scala/org/apache/nlpcraft/examples/sql/sql_model_init.yaml
Examples comes with two pre-generated files:
sql_model_init.yaml- the initial file that was generated straight out of SQL Model Generator (see above for the instructions).
sql_model.yaml- the modified version of the same file. It is recommended to run a diff utility of some sort to see the difference between two files which will indicate the necessary changes that were made to the auto-generated
Implementation mainly consists of two main files:
SqlModel.java- the code behind the data model that loads YAML-defined model and defines all intents.
db/Builder.java- the main utility that takes object model provided by SQL Model Generation and builds a SQL query to execute. Note that we elected to build this functionality from scratch to illustrate how it can be done. You are free, of course, to use many of the existing libraries to help achieve this goal.
If you are using the unit test that comes with this example you do not need to start the data probe standalone as this unit test uses embedded probe mode. In this mode, the unit test will automatically start and stop the data probe from within the test itself.
If using unit test below - skip this step, you only need to start the server.
NLPCraft data models get deployed into data probe. Let's start data probe with our newly created data model. To start data probe we need to configure Run Configuration in IDEA with the following parameters:
NOTE: instead of supplying a full configuration file we just use the default configuration and override one configuration property using configuration override via environment variables.
Start this run configuration and make sure you have positive console output indicating that our model has been successfully loaded and probe started.
If you are using the unit test that comes with this example you do not need to start the H2 database standalone as this unit test will start it automatically.
If using unit test below - skip this step, you only need to start the server.
To start H2 database server we need to configure Run Configuration in IDEA with the following parameters:
Start this run configuration and make sure you have positive console output indicating H2 database server is running:
Jul-05|08:40:47|INFO | H2 server start parameters: -baseDir /Users/minkovski/nlpcraft-examples/h2 -tcpPort 9092 -tcpAllowOthers Jul-05|08:40:47|INFO | H2 server status: TCP server running at tcp://localhost:9092 (others can connect) Jul-05|08:40:47|INFO | Database 'jdbc:h2:tcp://localhost:9092/nlp2sql' is NOT initialized because data already exists. To re-initialize - delete files in '/Users/minkovski/nlpcraft-examples/h2' folder and start again.
REST server listens for requests from client applications and routes them to the requested data models via connected data probes. REST server starts the same way as the data probe. Configure new Run Configuration in IDEA with the following parameters:
Once started ensure that your REST server console output shows that data probe is connected and the REST server is listening on the default
At this point we've developed our data model, deployed it into the data probe, and started the REST server. To test it, we'll use the built-in test framework that allows you to write convenient unit tests against your data model.
NLPCraft comes with easy to use test framework for data models that can be used with any unit testing framework like JUnit or ScalaTest. It is essentially a simplified version of Java REST client that is custom designed for data model testing.
Unit test for this example available in
SqlTest.java file. Note that this test provides additional utility of testing each input sentence against the result SQL statement.
Embedded Probe & H2 Autostart
This test uses embedded probe which automatically starts and stops the data probe from within the tests itself. This test also starts H2 database automatically.
NOTE: when using this test you don't need to start data probe and H2 database standalone in the previous steps.
You've created a data model for fairly complete natural language interface to SQL database, deployed it into the data probe, started the REST server and tested this model using JUnit 5 and the built-in test framework.