Pere Torán
21 Agosto, 2018
Soluciones técnicas
In one of our current projects our client told us their need to improve the search capabilities of the platform by providing proximity searches (with wildcards and without taking order into account) on some text fields. The platform consists on a web application and some services published as a RESTful API, with Hibernate as the provider for the persistence layer, backed with a MySQL DB
The application is constantly evolving as the client’s needs grow, and it’s running in production since over a year and a half ago.
Even though MySQL offers full text searches, they are not supported by Hibernate Criteria (neither HQL) and we always tend to avoid coding plain SQL sentences as much as possible. Why are you using an ORM if you end up coding plain SQL sentences, right?
After doing some research we decided to use the Hibernate Search module, which is built on top of Lucene and offers full-text search support for objects stored by Hibernate ORM. Among other functionalities, it handles the creation of a Lucene Index which is synchronized with the relational database, sorting results by fields (not just by score, which was important for us since we already offered sorting to the user and didn’t want to loose it), pagination, a lucene query builder similar to Criteria,…
The first approach was focused on taking advantage of all these features. So we started simple, we configured Hibernate Search (we use Hibernate 4.1.8 and Hibernate Search 4.1.1), we mapped the few text fields we needed using defaults and launched the MassIndexer to create the Lucene Index based on the stored MySQL data. Luke showed up the index was correct and had all the registers we needed, things couldn’t start better!
Time to start building Lucene queries, let’s use the Hibernate Search query DSL, it’s easy! The first thing we tried was a boolean search made of multiple keyword terms with wildcards enabled, which worked like a charm, something like this:
Query luceneQuery = b.bool() .must(b.keyword().wildcard().onField("title").matching("*term1*").createQuery()) .must(b.keyword().wildcard().onField("title").matching("*term2*").createQuery()) .createQuery();
While this was working for concrete fields like “title”, this wasn’t useful with longer fields like “description” where term1 and term2 can appear together but that doesn’t mean anything if they are far from each other, they need to be close to be meaningful. This is where phrase queries come in. Unfortunately, Hibernate Search doesn’t appear to provide a way of building wildcard phrase queries, so we had to build Lucene queries by ourselves. Something like this:
private static boolean isAnalyzed(String field, Class rootClass){ boolean analyzed = true; Class entityClass = rootClass; String myField = field; if(myField.contains(".")){ String myClass = myField.substring(0, myField.lastIndexOf(".")); entityClass = ReflectionUtils.getPropertyClass(rootClass, myClass); //This allows us to look for properties of nested classes, thanks to anitilia myField = myField.substring(myField.lastIndexOf(".")+1); } SearchMapping searchMapping = (SearchMapping)HibernateUtil.getConfig().getProperties().get("hibernate.search.model_mapping"); PropertyDescriptor propertyDescriptor = searchMapping.getEntityDescriptor(entityClass).getPropertyDescriptor(myField, ElementType.METHOD); if(propertyDescriptor == null) return analyzed; for(Map<String, Object> fieldsMap : propertyDescriptor.getFields()){ if(fieldsMap.containsKey("name") && !fieldsMap.get("name").equals(field)) continue; for(Entry<String, Object> entry : fieldsMap.entrySet()){ if(entry.getKey().equals("analyze")){ analyzed = entry.getValue().toString().equalsIgnoreCase(Analyze.YES.name()); } } } return analyzed; }
And since our titles are no longer than 10 words, this would work for both title and description fields. Good! Well not really… The first approach was to just split the search terms by spaces, and build the query. This gave us inconsistent results as we applied the StandardAnalyzer analyzer at indexing time, which among other things lowercases the terms (we lost case insensitive queries because of this), and removes stopwords. Ok, we learned that you need to apply the same analyzer both at indexing and searching time.
So, since we wanted to build a small generic Lucene search engine, the problem now was knowing when a field was analyzed or not before performing the search. After some research, we found nothing so we started digging in the Hibernate Search code in order to know how to get the SearchMapping object, wich is an internal Hibernate class that modelize the fields mappings (the field names, whether they are analyzed, indexed and stored or not, etc) and we end up with this small utility:public static Query createPhraseQuery(String[] phraseWords, String field) { SpanQuery[] queryParts = new SpanQuery[phraseWords.length]; for (int i = 0; i < phraseWords.length; i++) { WildcardQuery wildQuery = new WildcardQuery(new Term(field, "*"+phraseWords[i]+"*")); queryParts[i] = new SpanMultiTermQueryWrappe<WildcardQuery>(wildQuery); } return new SpanNearQuery(queryParts, //words 10, //max distance false //exact order ); }
Basically, this tells us when a field has been analyzed or not. The only thing we needed to do now was to manually apply the analyzer (or not, depending on the field) to the search terms before constructing the query. That’s what the following code does:
private static List getTermValues(String field, String value){ List termValues = new ArrayList(); if(isAnalyzed(field)){ Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31); //Important since StandardAnalyzer has a different behaviour depending on the Lucene version TokenStream stream = analyzer.tokenStream(field, new StringReader(value)); CharTermAttribute charTermAttribute = stream.addAttribute(CharTermAttribute.class); try { stream.reset(); while (stream.incrementToken()) { String term = charTermAttribute.toString(); termValues.add(term); } analyzer.close(); } catch (IOException e) { e.printStackTrace(); } }else termValues.add(value); return termValues; }
And thats all, folks! :)