Home » Java » Enterprise Java » Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog

About Brian Du Preez

Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog

Continuing to chapter 4 of Programming Collection Intelligence  (PCI) which is implementing a search engine.

I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn’t necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.

Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.

Just a couple:

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.

package net.briandupreez.pci.chapter4;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.neo4j.kernel.impl.util.FileUtils;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class CreateDBFactory {

    private static GraphDatabaseService graphDb = null;
    public static final String RESOURCES_CRAWL_DB = "resources/crawl/db";

    public static GraphDatabaseService createInMemoryDB() {
        if (null == graphDb) {
            synchronized (GraphDatabaseService.class) {
                if (null == graphDb) {
                    final Map<String, String> config = new HashMap<>();
                    config.put("neostore.nodestore.db.mapped_memory", "50M");
                    config.put("string_block_size", "60");
                    config.put("array_block_size", "300");
                    graphDb = new GraphDatabaseFactory()
                            .newEmbeddedDatabaseBuilder(RESOURCES_CRAWL_DB)
                            .setConfig(config)
                            .newGraphDatabase();

                    registerShutdownHook(graphDb);
                }
            }
        }
        return graphDb;
    }

    private static void registerShutdownHook(final GraphDatabaseService graphDb) {
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                graphDb.shutdown();
            }
        });
    }


    public static void clearDb() {
        try {
            if(graphDb != null){
                graphDb.shutdown();
                graphDb = null;
            }
            FileUtils.deleteRecursively(new File(RESOURCES_CRAWL_DB));
        } catch (final IOException e) {
            throw new RuntimeException(e);
        }
    }

}

Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.

package net.briandupreez.pci.chapter4;


import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Neo4JWebCrawler extends WebCrawler {


    private final GraphDatabaseService graphDb;


    /**
     * Constructor.
     */
    public Neo4JWebCrawler() {
        this.graphDb = CreateDBFactory.createInMemoryDB();
    }


    @Override
    public boolean shouldVisit(final WebURL url) {
        final String href = url.getURL().toLowerCase();
        return !NodeConstants.FILTERS.matcher(href).matches();
    }

    /**
     * This function is called when a page is fetched and ready
     * to be processed by your program.
     */
    @Override
    public void visit(final Page page) {

        final String url = page.getWebURL().getURL();
        System.out.println("URL: " + url);

        final Index<Node> nodeIndex = graphDb.index().forNodes(NodeConstants.PAGE_INDEX);

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
            String text = htmlParseData.getText();
            //String html = htmlParseData.getHtml();
            List<WebURL> links = htmlParseData.getOutgoingUrls();
            Transaction tx = graphDb.beginTx();
            try {

                final Node pageNode = graphDb.createNode();
                pageNode.setProperty(NodeConstants.URL, url);
                nodeIndex.add(pageNode, NodeConstants.URL, url);

                //get all the words
                final List<String> words = cleanAndSplitString(text);
                int index = 0;
                for (final String word : words) {
                    final Node wordNode = graphDb.createNode();
                    wordNode.setProperty(NodeConstants.WORD, word);
                    wordNode.setProperty(NodeConstants.INDEX, index++);
                    final Relationship relationship = pageNode.createRelationshipTo(wordNode, RelationshipTypes.CONTAINS);
                    relationship.setProperty(NodeConstants.SOURCE, url);
                }

                for (final WebURL webURL : links) {
                    System.out.println("Linking to " + webURL);
                    final Node linkNode = graphDb.createNode();
                    linkNode.setProperty(NodeConstants.URL, webURL.getURL());
                    final Relationship relationship = pageNode.createRelationshipTo(linkNode, RelationshipTypes.LINK_TO);
                    relationship.setProperty(NodeConstants.SOURCE, url);
                    relationship.setProperty(NodeConstants.DESTINATION, webURL.getURL());
                }

                tx.success();
            } finally {
                tx.finish();
            }

        }
    }


    private static List<String> cleanAndSplitString(final String input) {
        if (input != null) {
            final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+");
            return Arrays.asList(dic);
        }
        return new ArrayList<>();
    }

}

After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic  / CommonJ work managers within the application server to perform the same task.

  final ExecutorService executorService = Executors.newFixedThreadPool(4);
  final String[] searchTerms = {"java", "spring"};

  List<Callable<TaskResponse>> tasks = new ArrayList<>();
  tasks.add(new WordFrequencyTask(searchTerms));
  tasks.add(new DocumentLocationTask(searchTerms));
  tasks.add(new PageRankTask(searchTerms));
  tasks.add(new NeuralNetworkTask(searchTerms));

  final List<Future<TaskResponse>> results = executorService.invokeAll(tasks);

I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.

Disclaimer: The Neural Network task, either didn’t have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I’ll return to it once I have completed the journey through the while PCI book.

The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.

From that it was then simple to implement my PageRank task of this exercise.

package net.briandupreez.pci.chapter4.tasks;

import net.briandupreez.pci.chapter4.NodeConstants;
import net.briandupreez.pci.chapter4.NormalizationFunctions;
import org.graphstream.algorithm.PageRank;
import org.graphstream.graph.Graph;
import org.graphstream.graph.implementations.SingleGraph;
import org.neo4j.cypher.javacompat.ExecutionEngine;
import org.neo4j.cypher.javacompat.ExecutionResult;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.Callable;

public class PageRankTask extends SearchTask implements Callable<TaskResponse> {

    public PageRankTask(final String... terms) {
        super(terms);
    }

    @Override
    protected ExecutionResult executeQuery(final String... words) {
        final ExecutionEngine engine = new ExecutionEngine(graphDb);
        final StringBuilder bob = new StringBuilder("START page=node(*) MATCH (page)-[:CONTAINS]->words ");
        bob.append(", (page)-[:LINK_TO]->related ");
        bob.append("WHERE words.word in [");
        bob.append(formatArray(words));
        bob.append("] ");
        bob.append("RETURN DISTINCT page, related");

        return engine.execute(bob.toString());
    }

    public TaskResponse call() {
        final ExecutionResult result = executeQuery(searchTerms);
        final Map<String, Double> returnMap = convertToUrlTotalWords(result);

        final TaskResponse response = new TaskResponse();
        response.taskClazz = this.getClass();
        response.resultMap = NormalizationFunctions.normalizeMap(returnMap, true);
        return response;
    }

    private Map<String, Double> convertToUrlTotalWords(final ExecutionResult result) {
        final Map<String, Double> uniqueUrls = new HashMap<>();

        final Graph g = new SingleGraph("rank", false, true);
        final Iterator<Node> pageIterator = result.columnAs("related");

        while (pageIterator.hasNext()) {
            final Node node = pageIterator.next();
            final Iterator<Relationship> relationshipIterator = node.getRelationships().iterator();
            while (relationshipIterator.hasNext()) {

                final Relationship relationship = relationshipIterator.next();
                final String source = relationship.getProperty(NodeConstants.SOURCE).toString();
                uniqueUrls.put(source, 0.0);
                final String destination = relationship.getProperty(NodeConstants.DESTINATION).toString();
                g.addEdge(String.valueOf(node.getId()), source, destination, true);

            }
        }


        computeAndSetPageRankScores(uniqueUrls, g);
        return uniqueUrls;
    }

    /**
     * Compute score
     *
     * @param uniqueUrls urls
     * @param graph      the graph of all links
     */
    private void computeAndSetPageRankScores(final Map<String, Double> uniqueUrls, final Graph graph) {
        final PageRank pr = new PageRank();
        pr.init(graph);
        pr.compute();

        for (final Map.Entry<String, Double> entry : uniqueUrls.entrySet()) {
            final double score = 100 * pr.getRank(graph.getNode(entry.getKey()));
            entry.setValue(score);
        }
    }


}

In between all of this I found a great implementation of sorting a map by values on Stackoverflow.

package net.briandupreez.pci.chapter4;

import java.util.*;


public class MapUtil {

    /**
     * Sort a map based on values.
     * The values must be Comparable.
     *
     * @param map       the map to be sorted
     * @param ascending in ascending order, or descending if false
     * @param <K>       key generic
     * @param <V>       value generic
     * @return sorted list
     */
    public static <K, V extends Comparable<? super V>> List<Map.Entry<K, V>> entriesSortedByValues(final Map<K, V> map, final boolean ascending) {

        final List<Map.Entry<K, V>> sortedEntries = new ArrayList<>(map.entrySet());
        Collections.sort(sortedEntries,
                new Comparator<Map.Entry<K, V>>() {
                    @Override
                    public int compare(final Map.Entry<K, V> e1, final Map.Entry<K, V> e2) {
                        if (ascending) {
                            return e1.getValue().compareTo(e2.getValue());
                        } else {
                            return e2.getValue().compareTo(e1.getValue());
                        }
                    }
                }
        );

        return sortedEntries;
    }

}

The Maven dependencies used to implement all of this

         <dependency>
               <groupId>com.google.guava</groupId>
               <artifactId>guava</artifactId>
               <version>14.0.1</version>
           </dependency>
           <dependency>
               <groupId>org.encog</groupId>
               <artifactId>encog-core</artifactId>
               <version>3.2.0-SNAPSHOT</version>
           </dependency>
           <dependency>
               <groupId>edu.uci.ics</groupId>
               <artifactId>crawler4j</artifactId>
               <version>3.5</version>
               <type>jar</type>
               <scope>compile</scope>
           </dependency>
           <dependency>
               <groupId>org.neo4j</groupId>
               <artifactId>neo4j</artifactId>
               <version>1.9</version>
           </dependency>
           <dependency>
               <groupId>org.graphstream</groupId>
               <artifactId>gs-algo</artifactId>
               <version>1.1.2</version>
           </dependency>

Now to chapter 5 on PCI… Optimisation.
 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

2 comments

  1. It’s very nice

  2. Is it free to use NEO4j

Leave a Reply

Your email address will not be published. Required fields are marked *

*


− 8 = zero

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Do you want to know how to develop your skillset and become a ...

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!
Get ready to Rock!
To download the books, please verify your email address by following the instructions found on the email we just sent you.

THANK YOU!

Close