Search – Rahul 's Blog

Synopsis

When the user performs a search in a ecommerce site, the format of the searchTerm largely decides the relevancy of the search results. To achieve a better relevancy of the search results a series of tokenization can be applied to the fields across which the search is performed.

Use Case:

If the user performs a search for a product number, the result displays the product with the matching part number .

Ex:searchTerm = “USTXBK350”, the application performs a search against the partNumber field and returns the product as search result.

The searchTerm “USTXBK350” matches the partNumber of the Rec1 and returns it as search result.

If the user doesn’t remember the complete partNumber and performs a search with “BK35” ,this might not work as expected.

Solution:

To achieve a partial match , the following options are available:

Perform a wildcard search– This might fix the problem with a performance trade-off. This might also lead to perform a wildcard search against other fields and qualify as search results.

REC1:
 Name: TestBK350
 partNumber : TX1234

REC2:
 Name:TestProduct
 partNumber:USTXBK350

REC3:
 Name : TestProd3
 partNumber: TEST300

If the fields “name” and “partNumber” are made searchable and enabled for wildcard search.The searchTerm would match and return all the records (REC1,REC2 and REC3).Though the products are returned as part of the search results , the relevancy of the search results is compromised.

To fix this we can use tokenization on the partNumber filed.

The following tokenization ways are :

AlphaNumeric : With this approach we split the partNumber into two parts .For ex: BK350 will be divided into BK and 350.

partNum.split("(?<=\\D)(?=\\d)").

NGram: the various N-gram approach followed are

generateNGram(partNumbers[i], 3, partNumberTokens);
generateNGram(partNumbers[i], 4, partNumberTokens);
generateNGram(partNumbers[i], 5, partNumberTokens);
generateNGram(partNumbers[i], 6, partNumberTokens);
generateNGram(partNumbers[i], 7, partNumberTokens);

We split the partNumber into sequence of characters.

Ex: BK350 with nGram 3 will create a sequence of tokens : BK3,K35,350

With the above 2 tokenization ,the Record will have the following fields and the new filedpartNumberToken is made searchable.

REC1:
 Name: TestBK350
 partNumber : TX1234
 partNumberToken: TX 
 partNumberToken 1234 
 partNumberToken: TX12
 partNumberToken:X123
 partNumberToken:1234

REC2:
 Name:TestProduct
 partNumber:USTXBK350
 partNumberToken:USTXBK
 partNumberToken:350
 partNumberToken:USTX
 partNumberToken:STXB
 partNumberToken:TXBK
 partNumberTokenXBK3
 partNumberToken:BK35
 partNumberToken:K350

REC3:
 Name : TestProd3
 partNumber: TEST300
 partNumberToken:TESTBK
 partNumberToken:300
 partNumberToken:TEST
 partNumberTokenEST3
 partNumberTokenST30
 partNumberToken:T300

So , now when the search is performed using the searchTerm “BK35” (part of the partNumber) the REC2 finds a match with partNumberToken field and returns the exact product as the result.

Category: Search

Using tokenizer for improving search relevancy in Endeca.