Build a search engine: Searching the Yelp dataset using SOLR , Angular and Node

Published in

ITNEXT

8 min readJul 26, 2019

In this post we’ll get through the process of creating a search engine app with Angular. We will check out how solr works and the simplest way to configure our index based on our dataset.

You can browse the code of the project while reading through this blog HERE.

Let’s start by defining all the slang we need to know.

What is an index?

An index is a list of data. It is typically saved in plain text so that can be quickly accessed by a search algorithm. This significantly speeds up searching. Indexes often include information about each item in the list, such as metadata or keywords, this allows the data to be searched via the index instead of reading through each file individually.

And what about SOLR?

Solr is an open-source search platform, written in Java. Solr runs as a standalone full-text search server. It uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it usable from most popular programming languages.

…yeah that’s nice but where are the data?

As mentioned before we’re gonna use the yelp dataset that can be found here.

The dataset contains many files that we can use but we are gonna need only three of them.

“business.json, review.json and tip.json”

Below we can see all the fields they contain.

Preparing the data

We will use only a portion of the dataset. The script bellow will choose 100 random categories and select businesses until we reach 10000 of them.

Now let’s continue by finding out all the reviews users left for the 10000 businesses we chose.

Finally we will merge businesses, reviews and tips in a final.json file so we can index it without much trouble.

The fields of the final.json file can be seen below.

The reviews field is a json list with every json containing this info: stars, text, date. Also the attributes field references the tips file that we mentioned in the beginning.

In the dataset i found an inaccuracy on the review_count field. The review_count field of business.json is not the exact number of reviews for some of the businesses(reviews.json file). There is a function in the business_filter.py that proves this.

Setting up SOLR

You can easily get apache solr from here

Fun fact: SOLR stands for “Searching On Lucene with Replication”.

Solr creates the so-called search cores that are separate index pieces. For example, if we have the core named core1, a typical search would be

http://localhost:8983/solr/core1/select?Q =’query’

To run queries or to create a new core the solr server has to be fired up.

./bin/solr start

For our app we will use a core name demo. Cd into the solr files and create it like this.

./bin/solr create -c demo

Configuring SOLR

Solr uses 2 configuration files. The managed-schema and solrconfig.xml. The solrconfig.xml file includes settings for the data directory location, cache parameters, request handlers etc. The managed-schema determines what kind of fields will be included in the search, which of these fields will be used as a single / primary key, which fields are mandatory and also how to search and index in each field.

A primary key is a unique identifier for a record.

For the solrconfig.xml file we will use a minimal version of the config.

For the managed-schema file, we will use the default configuration sets given by the solr. We will also erase several lines of code used for languages other than English and only add the appropriate indexing fields we want to search.

reminder: all these files exist in the github repo provided in the beginning

We can see very easily that the fields in the managed-schema are the same with the fields in the final.json file that we are going to use. The business_id field is used as a unique key. The rest of them are going to define how the indexing will be made.

“names, categories, reviews, stars, review_count, address, city, state, and attributes”

Loading data to SOLR

Once we have finished with the solr settings, we can load data into our core with the command

SOLR queries

We are now ready to ask questions on the Solr API. Questions are made based on the port that solr runs, the name of the core, and the fields of indexing. We can see a search example based on the name field and reviews with the keyword ‘chinese’.

http://localhost:8983/solr/demo/select?q=name%3Achinese%20AND%20reviews%3Achinese

where %3Α is the browser code for “ : ” and %20 for the space character.

..yeah that’s also nice but how that solr thing works?

When a user searches on Solr, the query is processed by a request handler. The request handler is a plugin that defines the logic to be used when Solr processes a request. Solr supports a variety of request handlers. Some are designed to process search queries while others handle tasks such as index replication.

Search applications select the Solr’s default request handler. To process a query, the request handler calls an analyser, the analyser interprets the terms and parameters of a query. Solr’s default query analyser is known as the Standard Query Parser or known as the “lucene” standard analyser. Solr also includes the DisMax query analyser and Extended DisMax (eDisMax) query analyser.

Standard Query Parser syntax allows high accuracy searches, but the DisMax query analyser is much more tolerant of errors. DisMax is designed to provide a similar experience with popular search engines like Google, that rarely show syntax errors to users.

Solr needs to talk to the app, the Node API

In order to establish a communication between solr and the application we need an intermediate server that will communicate with both. It is especially necessary for the Cross-Origin Resource Sharing (CORS) functionality in browsers. We will use the express and cors libraries. The API will consists of 3 endpoints that generate questions based on what the user has chosen in the search engine.

you have to install node and also

npm install express
npm install cors

The first endpoint serves basic types of questions. The second and third serve boolean questions. The first endpoint code makes an http call with the baseUrl parameter which is the local Solr address by matching the parameters given to it by the application, ie the keyword and the type of search. The other two endpoints are implemented accordingly.

Lets get to the App

The application will be built using Angular and Angular Material available here.
(Don’t forget any step in the angular material installation guide).

The first screen looks something like this.

All the HTML code lies inside the app.component.html file. Two divs contain the whole view. One is the search view and the other is the search results view. with the *ngIf=”searchPageViewActive” we can navigate between the two divs.

Now the searchPageViewActive boolean variable will become false as a progress bar indicates that a query is being made and the result page will pop up after we have the data back.

The result page will look like this.

The results are ordered exactly as solr returned them. We can change that order according to the stars a business have. If there is a tie the review count will be considered as a second ordering parameter.

There is also a dialogue box that opens when we click the number of reviews that displays the full info of every review.

The code of the dialogue can be seen here.

Before checking our functions lets get to the Data Service

The DataService.ts file serves as an intermediate between our app and our Node API. Remember the three different query types from the api? We will using three different calls from our application to the API whenever is needed.

Now the search functions

Clicking the search button will invoke the search function that will validate the search keywords and syntax. If the syntax doesn’t make it, some snackbars will pop up displaying a warning. If the query is good to go, the search function will determine which of the three types the user chose to search, format the query to be solr ready with the cleanQuestion function and finally call the API through the DataService. Finally the cleanData function will prepare the data for display proposes to the Angular material table.

Bellow we can see how the dialogue handles and cleans the review data.

Some Extra stuff..

let’s do some word highlighting. We will create a pipe that takes a list of words as an input and highlights them in every place they appear on the html view.

Why not add a second way of ordering the results?

let’s keep track every click users made so we know which business caught their eyes. This will be done every time a review dialogue opens. Google firestore will be used for storing the data.

Firestore has detailed documentation and a lot of tutorials are available online, you should check them out.

Summary

At first we prepared the data with some python scripts. We then proceeded to set up and configure. Finally we created an API so our front-end app can talk to the server, prepare the data and display it to the search results view.
Building a complete application from scratch can be quite tedious , this forces you to learn a lot in the process. Even with a simple app you have to work with many tools before ending up with the desired outcome.