[SOLVED] CS Hive Java information retrieval database Lab script 1

$25

File Name: CS_Hive_Java_information_retrieval_database_Lab_script_1.zip
File Size: 527.52 KB

5/5 - (1 vote)

Lab script 1

It is now time to get some practical experience with information retrieval tools.Elasticsearchis one of the most powerful and most widely used open-source search engines. It is simple, scalable, and highly efficient and can manage structured as well as unstructured data. The examples look a bit like database examples but that is because the data comes with some structure. You will see that the pre-processing steps that you have come across in this weeks lecture will actually be performed on each field individually (check theguideto find out more).

Please do not stop working with Elasticsearch when you finish this lab session. Keep playing around with it, install it on your own computer, use it in your own project, submit code, and join the community, among others.Hereis a starting point to explore the framework before you approach the steps below.

Installation
This lab assumes that you are using the Linux command line. Therefore, feel free to leave the Windows environment and reboot to Ubuntu on your machine. You can also install Elasticsearch on other operating systems, but Linux nicely fits the settings of Lab 1 and Lab 2. The instructions below assume that the command line shell you are using is eitherbashorsh. Therefore, feel free to leave the Windows environment and reboot to Ubuntu on your machine. You can use Terminal on Ubuntu.
NOTE: Throughout the labs we are using version 6.5.1 of Elasticsearch (and Kibana) as a reference point because we know that this works with the CSEE Lab settings. Later versions have however been released and offer a lot of new and improved features. Feel free to install the latest release (at least when you install the software on your own computer). With every new release there might be come variations in the commands, so please feel free to explore which commands would work on your release by searching for the appropriate supported commands online.

Lets set up a folder to work in. Were creating a temporary install to play with. In a full server environment this would likely be different. There are also services which offer Elasticsearch setup and ready to run.
To start, create a new directory in one of your folders and change to that directory.
mkdir search_exercise
cd search_exercise

We are going to need to download Elasticsearch (the search engine we will be using):
curl -L -Ohttps://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.5.1.tar.gz
Well be using a tool called curl a lot in this lab. curl is a very handy tool for communicating with various servers including HTTP. The-Labove follows HTTP redirects, and the-Otells curl to write the file locally using the same filename it has on the server (elasticsearch-6.5.1.tar.gz). The default behaviour is to print the contents of the file directly to the console.
Were also going to need the Java 8 runtime environment (not the SDK, we wont be compiling anything).
curl -L -o jre8.tgz http://javadl.oracle.com/webapps/download/AutoDL?BundleId=216422

Now we can extract both of these archives:
tar xf elasticsearch-6.5.1.tar.gz
tar xf jre8.tgz

To run Elasticsearch:
JAVA_HOME=/tmp/search_exercise/jre1.8.0_111 PATH=$JAVA_HOME/bin:$PATH elasticsearch-6.5.1/bin/elasticsearch -d

JAVA_HOMElets Elasticsearch know where to find our Java 8 install, andPATHlets the shell know where to find executables, so it can findjava. The-dlaunches Elasticsearch as a daemon, so it will run in the background.

Indexing Documents
Grab a collection of documents from the Elasticsearch examples:
curl -L -o accounts.json https://github.com/elastic/elasticsearch/blob/master/docs/src/test/resources/accounts.json?raw=true

Take a peek inside the accounts.json file. Its in a JSON format.
There is an index line that specifies the id of the document, followed by the document itself.
Lets post this collection of accounts into Elasticsearch:
curl -H Content-Type: application/json -XPOST localhost:9200/bank/account/_bulk?pretty&refresh data-binary @accounts.json

Now check to see what indices Elasticsearch has:
curl localhost:9200/_cat/indices?v

You should see a table including a bank index containing 1000 documents. If plenty of content is printed, feel free to pipe the content using (|) and usingmore. More detailshere.

Searching
Lets start with a query that matches all the documents.
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: { match_all: {} }
}

You can see from the hits.total field we matched 1000 documents and, by default, the first 10 are shown. You can view this using pipe (|) andmorein Linux.
If you find manipulating the large queries in the terminal using curl a little unwieldy and getting lots of errors feel free to try Kibana. The install is very quick and there are instructions at the end of the lab sheet. The rest of the lab will show only the query section, you are free to choose how to connect.

Pagination
The query we performed only showed 10 documents. We can show the next 10 as follows:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: { match_all: {} },
from: 10,
size: 10
}

fromsets from which document we will start,sizesets how many documents are shown.

Querying Full Text
A full text query will take multiple words, and search for all of them giving each document a score based on how close it was. Lets try an example:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
match : {
address : national street
}
}
}

Theaddressbit of the query tells us which field we will be matching, this can be substituted by_allto match any field.
Looking at the results from this query it seems like we searched wrong. There isnt a National Street, theres a National Drive though. Notice how results that contained both national and street were returned.matchdefaults to being anorquery, so it will match documents containing either of the two terms. If we change the operator to and our national street search will return 0 results, because the terms national and street are not present together in any address field.
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
match: {
address: {
query: national street,
operator: and
}
}
}
}

However, lets try National Drive with and:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
match: {
address: {
query: drive national,
operator: and
}
}
}
}

Ive deliberately reversed the terms. Note how the search still works. Its considered the terms independently, in any order, but they must both be in the address field for the document to be a hit.
But what if we really wanted to match drive national exactly.

Matching Exact Phrases
match_phrasematches National Drive exactly. This gives only 1 result.
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
match_phrase: {
address: national drive
}
}
}

Reversing the terms as we did in the previous example does not work here. This matches the exact phrase. Sometimes however, you only have part of the phrase.

Matching Part of Phrases
This type of search matches a phrase with a wildcard. An example, lets try and use this to make an autocomplete/search suggest. When the user starts typing, we could suggest what they may want to type.
For example, try searching for a firstname:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
match_phrase_prefix: {
firstname: Jo
}
}
}

You will notice you are shown lots of records with firstnames that start with Jo includingJosephine,Josephina,Josie, and many others.

Matching Multiple Fields
Its common in a search engine, which you would want to match multiple fields with your query type. Lets say for example we typically search by lastname when looking up customer accounts, but sometimes we get given a name and we dont know whether it is a firstname, or a lastname. To improve our recall we want to search both fields.
To achieve this we can use a multimatch:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
multi_match: {
query: Francis,
fields: [firstname,lastname]
}
}
}

This hasnt quite worked though. Francis Beck came before Kelli Francis. We canboostthe lastname field in this search to make it more important:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
multi_match: {
query: Francis,
fields: [firstname,lastname^2]
}
}
}

Now Kelli Francis comes first.

Sorting
The query below sorts the results in descending order (desc) bybalance.
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: { match_all: {} },
sort: { balance: { order: desc } }
}

Try repeating the search suggest exercise from earlier sorted alphabetically by firstname.

Filtering
Filtering uses bool queries. These are queries that have scores of either 0 or 1. We can extend our earlier auto complete example.
Lets pretend we have a bank office in the state of Florida, so we are only interested in our search showing those records:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
bool: {
must: {
match_phrase_prefix: {
firstname: Jo
}
},
filter: {
term: {
state.keyword: FL
}
}
}
}
}

Now there are only two results.
It isnt just terms we can filter by, we can filter by numeric ranges. Lets pretend we are searching for someone with a name that starts with Jo, but we are in the mortgages department in the bank, and only process customers that hold a balance of over 11,000 as the bank says we arent allowed to mortgage unless they have less.
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
bool: {
must: {
match_phrase_prefix: {
firstname: Jo
}
},
filter: {
range: {
balance: {
from: 11000
}
}
}
}
}
}

Rather than 10 results, we now have 7, excluding the results that had a balance less than 11,000.

Exercises
Exercise 1
The LA office has bought you in as a consultant. They have lots of company accounts and most often customers call up quoting their company name/employer. They want to be able to search by that, firstname, and lastname. Then they would like all the results returned in alphabetical order by the company name. They dont want to see results from any other offices though.
Exercise 2
The bank HQ marketing department wants to run a promotion. Theyre really interested in marketing to their under 30 high income customers. Theyd like a report that shows only customers under 30, in descending order of balance.
Exercise 3
The customer records department are having a problem with their existing system. It keep check of all the addresses for customers. When they search Clay, the system does the following search:
curl -XGET localhost:9200/bank/_search?pretty -H Content-Type: application/json -d
{
query: {
match : {
_all : Clay
}
}
}

But that brings up someone with the name Clay first. They would like to change it so that anything with thecityClay, or Clay in theaddressis shown before anyone with the name Clay.

Kibana Install (in your own time)
There is a visualation tool called Kibana that comes with, amongst other things, a dev console that can be used to connect to your Elasticsearch instance.
Installing and running this is very similar to Elasticsearch:
curl -L -O https://artifacts.elastic.co/downloads/kibana/kibana-5.1.2-linux-x86_64.tar.gz
tar xf kibana-5.1.2-linux-x86_64.tar.gz
JAVA_HOME=/tmp/search_exercise/jre1.8.0_111 PATH=$JAVA_HOME/bin:$PATH kibana-5.1.2-linux-x86_64/bin/kibana &

Kibana doesnt come with a built in switch to run as a daemon, so we have just added the&to the end of the command to run it in the background while we carry on working.
You can then open Kibana in your browser athttp://localhost:5601/.
Further Ideas
This has been a quick introduction into the install and use of Elasticsearch. The next step is making Elasticsearch part of your wider application for your specific use. There are a wide range of libraries for different languages and frameworks that can assist you in passing queries to Elasticsearch and retrieving data that you can then display to your users. You may want to pick a library or framework and experiment with displaying some data as part of a web application.

Reviews

There are no reviews yet.

Only logged in customers who have purchased this product may leave a review.

Shopping Cart
[SOLVED] CS Hive Java information retrieval database Lab script 1
$25