(String) Prefix Trees (5 Part Series)
1 Implementing Tries in Java
2 Ternary Search Trees
3 Ternary Search Tree: Core Methods (Java Implementation)
4 Prefix Search with Ternary Search Trees (Java Implementation)
5 Java Profiling: comparing BSTs, tries and TSTs
In the last post in the series, we went over the core methods (insert, delete, search) for Ternary Search Trees; before that, we had already discussed how tries work. Now, it’s time to get to the money, and discuss what makes both these data structures extremely valuable in a lot of areas: prefix search.
There are (at least) two complementary ways to decline this operation:
- Finding the longest prefix of a given string
s
that is stored in our container; - Finding all the keys stored in a container that starts with a given prefix.
Both operations can be implemented in any container, of course: trivially, we can just iterate through all the entries, and select the result, or results.
The difference is that ternary search trees, like tries and radix trees, allow to perform this operation much more efficiently.
longestPrefix
Let’s start with the first operation: we have a string s
(containing m
characters), and a container T
storing a certain number of keys – let’s say n
keys.
We would like to know what’s the longest prefix p
of s
that is stored in T
.
This operation is pretty useful in many practical situations; even without bringing up bioinformatics (where efficiently finding the longest common prefix of DNA sequences is a must) it can be used in routers, where given an IP address, we need to find the longest prefix of that address for which there are routing information available, but also in network data clustering and telephone network management.
But it can also be used with suffix trees to implement efficient substring search, and as such is very useful in various fields, from full-text search in text-oriented search engines, to bioinformatics.
A suffix tree is a key-value trie where keys are all the suffixes of a given text, and the values stored by each key are the positions of each suffix in the text.
In abstract, the way the algorithm works is simple: we scan the input string key
character by character, and in parallel traverse the tree, starting from the root.
At any given point, we will have reached a node n
in the tree, after successfully matching a substring s
of the input string.
Assuming he next character in the input string is c
, we check if in the tree (or, rather, in the subtree rooted at current node) there is a path of left/right links to that character – if we can’t find it, then there is no prefix of the input string that includes c
, and s
is the longest prefix of the input string key
for which there is a path in the tree.
Now, there are two ways we might consider this result, which leads to two different versions of the algorithm.
- You don’t care if the prefix is stored or not, you need to know what’s the longest shared part between two strings: this is the case, for instance, in routers, if for an address like
192.168.1.1
you are OK with a result like192.16
; - You might want only the longest prefix that is actually stored in the tree: this might be the case if we are looking for valid words that are prefixes to a longer word or, to go back to the example of routers, you can imagine forcing only full-block matches, and so
192
or192.168
would be acceptable results, but not192.16
.
Below I implemented the second version, where the prefix must be stored in the tree, but it’s easy to change it to return just the longest path traversed, or take a parameter to decide which version is preferred (hint: look for the two places where this.storesKey()
is checked).
Another nit: if you remember, in the last post we had discussed how it would be more efficient to pass the index of the next character to check in the input string, rather than resorting to calling substring
after every match. For this method, this solution also makes it easier to return the final result, and so it’s the one I’m presenting.
public String longestPrefixOf(String key) {
return this.longestPrefixOf(key, 0);
}
private String longestPrefixOf(String key, int charIndex) {
if (charIndex >= key.length()) {
return null;
}
String result = null;
Character c = key.charAt(charIndex);
if (c.equals(this.character)) {
if (charIndex == key.length() - 1) { // case A.1
return this.storesKey() ? key : null;
} else { // case A.2
result = middle == null
? null / case A.2.a
: middle.longestPrefixOf(key, charIndex + 1); // case A.2.b
if (result == null && this.storesKey()) {
result = key.substring(0, charIndex + 1); // case A.2.c
}
}
} else if (c.compareTo(this.character) < 0) {
result = left == null
? null // case B.1
: left.longestPrefixOf(key, charIndex); // case B.2
} else {
result = right == null
? null // case C.1
: right.longestPrefixOf(key, charIndex); // case C.2
}
return result;
}
Enter fullscreen mode Exit fullscreen mode
Let’s see an example to clarify how the method works. Suppose we are looking for the longest prefix of string shop
.
We start from the root, looking for the first character, 's'
:
- (A) the root node doesn’t match, and we need to take a right, because it holds character
'o'
, which is less than's'
; - (B) at the second attempt we are luckier, we match the searched char, so we can traverse the middle link and advance the pointer in the string: now we’ll be looking for character
'h'
.
- (C) Another match, so again we traverse the middle link and at the same time move to the next character in the input string,
'o
`. - (D) This time we are not in luck,
'e' < 'o'
, so we traverse the right link, still looking for the next'o'
in the subtree.
- (E) Another match, so again we traverse the middle link and at the same time move to the next character in the input string,
'p
`. - (F) Again, a mismatch, but this case is interesting: the character held by this node,
'r'
is larger than'p
`, but there is no left link to traverse, so the search stops here and the method starts backtracking.
With current tree, and if we look for the longest prefix actually stored in the tree, the method would return null (no prefix of "shop"
is stored in the tree, and in fact the only key node traversed in the one corresponding to "she"
, that’s not a prefix of the input).
For the sake of illustrating a more interesting scenario, let’s now slightly pivot and consider a tree where the key "sh"
is also stored, illustrated in the following graphic.
So, let’s start from the end: in (F) we had a mismatch on the last node visited, and that node had no children; we thus are in case B.1 in the code above, and we’ll return null
. The recursion back-traces to the previous call (shown in (E)), where we had a match on 'o'
, but since we are requesting the longest prefix to be stored in the tree, this call as well will return null
(case A.2).
In (D) we were in case C.2, so we’ll follow through the result of the recursive call, which is still null
; finally, when we get back to the call shown in (C), we hit case A.2.b, and return the string 'sh'
as the longest common prefix.
keysWithPrefix
The second method that we are going to discuss takes a string and returns an Iterable
(a collection, in more generic terms) of strings; this collection can of course be empty, if no element stored in the tree starts with the input string.
This method is very useful, for instance, to create autocomplete systems: while a user is typing (well, ideally not after every keystroke, but that’s a story for a different time) we need to perform quick searches on the container to suggest some text that starts with what was already typed – so this search ideally should be quite fast, and trie-like containers are paramount.
The method is, at a high level, very easily described: we need to search for the prefix (in particular, for the node at which the path corresponding to the prefix ends, regardless of if it’s storing a key or not), and then return all the keys stored in the subtree rooted at that node.
public Iterable<String> keysWithPrefix(String prefix) {
// Invariant: prefix is not empty
TstNode node = this.search(prefix);
return node == null
? new HashSet<>()
: node.keys().stream()
// All keys in node.keys already include the last character in prefix
.map(s -> prefix.substring(0, prefix.length() - 1) + s)
.collect(Collectors.toSet());
}
Enter fullscreen mode Exit fullscreen mode
The code above assumes we have two methods already implemented: a search method (see the previous post in the series) and a method returning all keys stored in a subtree, which we’ll describe next.
The only step that needs carefulness in the code above is mapping the strings returned by method keys
: first of all, why we need it? Well, if we don’t start collecting keys from the root of the tree, but from a random node, then we’ll return only the suffixes of the keys actually stored (unless we tweak the tree structure or the method to remember the full string that a node corresponds to in the full tree, but this would complicate and disrupt the inductive definition of TSTs).
There is another thing that we need to be careful about: if we start gathering keys from node n
, holding character c
, all the the results will start with c
, and so we need to prepend them with characters in the input prefix
up to, and not including, c
.
Both aspects are more clearly illustrated in the following graphic.
Now, we have one more task left: implementationing the method that collects all keys stored in a (sub-)tree.
In this case as well, the high-level design of the method is not particularly convoluted: we just need to traverse all branches, all nodes of the tree, and add the keys we find to a list.
To make our life simpler, we can write a public method that takes care of initializing this list and starting the traversal, and a private method that takes this list (that Java automatically passes by reference) and add keys to it as soon as they are found; the alternative (merging lists from the recursive calls to a node’s children) would be much more expensive.
public List<String> keys() {
List<String> keys = new ArrayList<>();
this.keys("", keys);
return keys;
}
private void keys(String currentPath, List<String> keys) {
if (this.storesKey()) {
keys.add(currentPath + this.character); // case 0
}
// For left and right branches, we must not add this node's character to the path
if (left != null) {
left.keys(currentPath, keys); // case A
}
if (right != null) {
right.keys(currentPath, keys); // case B
}
// For the middle child, instead, this node's character is part of the path forward
if (middle != null) {
middle.keys(currentPath + character, keys); //case C
}
}
Enter fullscreen mode Exit fullscreen mode
Another chance for optimization would be replacing the string concatenation in recursive calls,
currentPath + character
: we could pass currentPath as a list of characters, that would get a new element only when we follow a middle link (case (C) in the code above – just know that it should then be popped after the call returns!), and a String would be constructed from the list only when we find a key stored in the tree (case (0) above).
This, however, would make the code less clean and potentially troublesome, if this code is changed to spread recursive calls over different threads.
If we are going to implement autocomplete, chances are that we only need a handful of results to be shown; we might have a preferred criteria for the keys returned (for instance, the longest results first, or choosing them using lexicographic order), but in most cases, we might avoid traversing the whole tree, and stop as soon as we have enough results to show.
This is saving important resources especially on the first key strokes, where the prefix is shorter and the number of matching results can be huge: there is no point in returning a thousand possible values, if only 5 or 6 can be shown to the user – that would be just a waste of computational resources and, potentially, of bandwidth (but in that case there would also be other errors in the design of the backend and server-client communication).
Changing the code shown above to take into account a threshold for the number of results returned is not too complicated: we just need to check the size of the list after adding a new element – whenever it’s filled with enough results, we can just return, without traversing the subtree rooted at current node.
private void keys(String currentPath, List<String> keys, int maxResults) {
if (this.storesKey()) {
keys.add(currentPath + this.character);
if (keys.size >= maxResults) {
return;
}
}
if (left != null) {
left.keys(currentPath, keys, maxResults);
}
if (right != null) {
right.keys(currentPath, keys, maxResults);
}
if (middle != null) {
middle.keys(currentPath + character, keys, maxResults);
}
}
Enter fullscreen mode Exit fullscreen mode
Conclusions
Prefix search is the real “competitive advantage” of tries and TSTs over other data structures.
But you don’t have to take my word for granted: in the next post, we’ll use a profiler to show the difference in running time between these two data structures and binary search trees , as well as their impact on RAM.
(String) Prefix Trees (5 Part Series)
1 Implementing Tries in Java
2 Ternary Search Trees
3 Ternary Search Tree: Core Methods (Java Implementation)
4 Prefix Search with Ternary Search Trees (Java Implementation)
5 Java Profiling: comparing BSTs, tries and TSTs
原文链接:Prefix Search with Ternary Search Trees (Java Implementation)
暂无评论内容