Mini Search Engine – Just the basics, using Neo4j, Crawler4j, Graphstream and Encog

Continuing to chapter 4 of Programming Collection Intelligence  (PCI) which is implementing a search engine.

I may have bitten off a little more than I should of in 1 exercise. Instead of using the normal relational database construct as used in the book, I figured, I always wanted to have a look at Neo4J so now was the time. Just to say, this isn’t necessarily the ideal use case for a graph db, but how hard could to be to kill 3 birds with 1 stone.

Working through the tutorials trying to reset my SQL Server, Oracle mindset took a little longer than expected, but thankfully there are some great resources around Neo4j.

Just a couple:

Since I just wanted to run this as a little exercise, I decided to go for a in memory implementation and not run it as a service on my machine. In hindsight this was probably a mistake and the tools and web interface would have helped me visualise my data graph quicker in the beginning.

As you can only have 1 writable instance of the in memory implementation, I made a little double lock singleton factory to create and clear the DB.

package net.briandupreez.pci.chapter4;

import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.factory.GraphDatabaseFactory;
import org.neo4j.kernel.impl.util.FileUtils;

import java.io.File;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class CreateDBFactory {

    private static GraphDatabaseService graphDb = null;
    public static final String RESOURCES_CRAWL_DB = "resources/crawl/db";

    public static GraphDatabaseService createInMemoryDB() {
        if (null == graphDb) {
            synchronized (GraphDatabaseService.class) {
                if (null == graphDb) {
                    final Map<String, String> config = new HashMap<>();
                    config.put("neostore.nodestore.db.mapped_memory", "50M");
                    config.put("string_block_size", "60");
                    config.put("array_block_size", "300");
                    graphDb = new GraphDatabaseFactory()
                            .newEmbeddedDatabaseBuilder(RESOURCES_CRAWL_DB)
                            .setConfig(config)
                            .newGraphDatabase();

                    registerShutdownHook(graphDb);
                }
            }
        }
        return graphDb;
    }

    private static void registerShutdownHook(final GraphDatabaseService graphDb) {
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                graphDb.shutdown();
            }
        });
    }


    public static void clearDb() {
        try {
            if(graphDb != null){
                graphDb.shutdown();
                graphDb = null;
            }
            FileUtils.deleteRecursively(new File(RESOURCES_CRAWL_DB));
        } catch (final IOException e) {
            throw new RuntimeException(e);
        }
    }

}

Then using Crawler4j created a graph of all the URLs starting with my blog, their relationships to other URLs and all the words and indexes of the words that those URLs contain.

package net.briandupreez.pci.chapter4;


import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.parser.HtmlParseData;
import edu.uci.ics.crawler4j.url.WebURL;
import org.neo4j.graphdb.GraphDatabaseService;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.Transaction;
import org.neo4j.graphdb.index.Index;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

public class Neo4JWebCrawler extends WebCrawler {


    private final GraphDatabaseService graphDb;


    /**
     * Constructor.
     */
    public Neo4JWebCrawler() {
        this.graphDb = CreateDBFactory.createInMemoryDB();
    }


    @Override
    public boolean shouldVisit(final WebURL url) {
        final String href = url.getURL().toLowerCase();
        return !NodeConstants.FILTERS.matcher(href).matches();
    }

    /**
     * This function is called when a page is fetched and ready
     * to be processed by your program.
     */
    @Override
    public void visit(final Page page) {

        final String url = page.getWebURL().getURL();
        System.out.println("URL: " + url);

        final Index<Node> nodeIndex = graphDb.index().forNodes(NodeConstants.PAGE_INDEX);

        if (page.getParseData() instanceof HtmlParseData) {
            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
            String text = htmlParseData.getText();
            //String html = htmlParseData.getHtml();
            List<WebURL> links = htmlParseData.getOutgoingUrls();
            Transaction tx = graphDb.beginTx();
            try {

                final Node pageNode = graphDb.createNode();
                pageNode.setProperty(NodeConstants.URL, url);
                nodeIndex.add(pageNode, NodeConstants.URL, url);

                //get all the words
                final List<String> words = cleanAndSplitString(text);
                int index = 0;
                for (final String word : words) {
                    final Node wordNode = graphDb.createNode();
                    wordNode.setProperty(NodeConstants.WORD, word);
                    wordNode.setProperty(NodeConstants.INDEX, index++);
                    final Relationship relationship = pageNode.createRelationshipTo(wordNode, RelationshipTypes.CONTAINS);
                    relationship.setProperty(NodeConstants.SOURCE, url);
                }

                for (final WebURL webURL : links) {
                    System.out.println("Linking to " + webURL);
                    final Node linkNode = graphDb.createNode();
                    linkNode.setProperty(NodeConstants.URL, webURL.getURL());
                    final Relationship relationship = pageNode.createRelationshipTo(linkNode, RelationshipTypes.LINK_TO);
                    relationship.setProperty(NodeConstants.SOURCE, url);
                    relationship.setProperty(NodeConstants.DESTINATION, webURL.getURL());
                }

                tx.success();
            } finally {
                tx.finish();
            }

        }
    }


    private static List<String> cleanAndSplitString(final String input) {
        if (input != null) {
            final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+");
            return Arrays.asList(dic);
        }
        return new ArrayList<>();
    }

}

After the data was collected, I could query it and perform the functions of a search engine. For this I decided to use java futures as it was another thing I had only read about and not yet implemented. In my day to day working environment we use Weblogic  / CommonJ work managers within the application server to perform the same task.

  final ExecutorService executorService = Executors.newFixedThreadPool(4);
  final String[] searchTerms = {"java", "spring"};

  List<Callable<TaskResponse>> tasks = new ArrayList<>();
  tasks.add(new WordFrequencyTask(searchTerms));
  tasks.add(new DocumentLocationTask(searchTerms));
  tasks.add(new PageRankTask(searchTerms));
  tasks.add(new NeuralNetworkTask(searchTerms));

  final List<Future<TaskResponse>> results = executorService.invokeAll(tasks);

I then went about creating a task for each of the following counting the word frequency, document location, Page Rank and neural network (with fake input / training data) to rank the pages returned based on the search criteria. All the code is in my public github blog repo.

Disclaimer: The Neural Network task, either didn’t have enough data to be affective, or I implemented the data normalisation incorrectly, so it is currently not very useful, I’ll return to it once I have completed the journey through the while PCI book.

The one task worth sharing was the Page Rank one, I quickly read some of the theory for it, decided I am not that clever and went searching for a library that had it implemented. I discovered Graphstream a wonderful opensource project that does a WHOLE lot more than just PageRank, check out their video.

From that it was then simple to implement my PageRank task of this exercise.

package net.briandupreez.pci.chapter4.tasks;

import net.briandupreez.pci.chapter4.NodeConstants;
import net.briandupreez.pci.chapter4.NormalizationFunctions;
import org.graphstream.algorithm.PageRank;
import org.graphstream.graph.Graph;
import org.graphstream.graph.implementations.SingleGraph;
import org.neo4j.cypher.javacompat.ExecutionEngine;
import org.neo4j.cypher.javacompat.ExecutionResult;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Relationship;

import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.Callable;

public class PageRankTask extends SearchTask implements Callable<TaskResponse> {

    public PageRankTask(final String... terms) {
        super(terms);
    }

    @Override
    protected ExecutionResult executeQuery(final String... words) {
        final ExecutionEngine engine = new ExecutionEngine(graphDb);
        final StringBuilder bob = new StringBuilder("START page=node(*) MATCH (page)-[:CONTAINS]->words ");
        bob.append(", (page)-[:LINK_TO]->related ");
        bob.append("WHERE words.word in [");
        bob.append(formatArray(words));
        bob.append("] ");
        bob.append("RETURN DISTINCT page, related");

        return engine.execute(bob.toString());
    }

    public TaskResponse call() {
        final ExecutionResult result = executeQuery(searchTerms);
        final Map<String, Double> returnMap = convertToUrlTotalWords(result);

        final TaskResponse response = new TaskResponse();
        response.taskClazz = this.getClass();
        response.resultMap = NormalizationFunctions.normalizeMap(returnMap, true);
        return response;
    }

    private Map<String, Double> convertToUrlTotalWords(final ExecutionResult result) {
        final Map<String, Double> uniqueUrls = new HashMap<>();

        final Graph g = new SingleGraph("rank", false, true);
        final Iterator<Node> pageIterator = result.columnAs("related");

        while (pageIterator.hasNext()) {
            final Node node = pageIterator.next();
            final Iterator<Relationship> relationshipIterator = node.getRelationships().iterator();
            while (relationshipIterator.hasNext()) {

                final Relationship relationship = relationshipIterator.next();
                final String source = relationship.getProperty(NodeConstants.SOURCE).toString();
                uniqueUrls.put(source, 0.0);
                final String destination = relationship.getProperty(NodeConstants.DESTINATION).toString();
                g.addEdge(String.valueOf(node.getId()), source, destination, true);

            }
        }


        computeAndSetPageRankScores(uniqueUrls, g);
        return uniqueUrls;
    }

    /**
     * Compute score
     *
     * @param uniqueUrls urls
     * @param graph      the graph of all links
     */
    private void computeAndSetPageRankScores(final Map<String, Double> uniqueUrls, final Graph graph) {
        final PageRank pr = new PageRank();
        pr.init(graph);
        pr.compute();

        for (final Map.Entry<String, Double> entry : uniqueUrls.entrySet()) {
            final double score = 100 * pr.getRank(graph.getNode(entry.getKey()));
            entry.setValue(score);
        }
    }


}

In between all of this I found a great implementation of sorting a map by values on Stackoverflow.

package net.briandupreez.pci.chapter4;

import java.util.*;


public class MapUtil {

    /**
     * Sort a map based on values.
     * The values must be Comparable.
     *
     * @param map       the map to be sorted
     * @param ascending in ascending order, or descending if false
     * @param <K>       key generic
     * @param <V>       value generic
     * @return sorted list
     */
    public static <K, V extends Comparable<? super V>> List<Map.Entry<K, V>> entriesSortedByValues(final Map<K, V> map, final boolean ascending) {

        final List<Map.Entry<K, V>> sortedEntries = new ArrayList<>(map.entrySet());
        Collections.sort(sortedEntries,
                new Comparator<Map.Entry<K, V>>() {
                    @Override
                    public int compare(final Map.Entry<K, V> e1, final Map.Entry<K, V> e2) {
                        if (ascending) {
                            return e1.getValue().compareTo(e2.getValue());
                        } else {
                            return e2.getValue().compareTo(e1.getValue());
                        }
                    }
                }
        );

        return sortedEntries;
    }

}

The Maven dependencies used to implement all of this

         <dependency>
               <groupId>com.google.guava</groupId>
               <artifactId>guava</artifactId>
               <version>14.0.1</version>
           </dependency>
           <dependency>
               <groupId>org.encog</groupId>
               <artifactId>encog-core</artifactId>
               <version>3.2.0-SNAPSHOT</version>
           </dependency>
           <dependency>
               <groupId>edu.uci.ics</groupId>
               <artifactId>crawler4j</artifactId>
               <version>3.5</version>
               <type>jar</type>
               <scope>compile</scope>
           </dependency>
           <dependency>
               <groupId>org.neo4j</groupId>
               <artifactId>neo4j</artifactId>
               <version>1.9</version>
           </dependency>
           <dependency>
               <groupId>org.graphstream</groupId>
               <artifactId>gs-algo</artifactId>
               <version>1.1.2</version>
           </dependency>

Now to chapter 5 on PCI… Optimisation.
 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

JPA Mini Book

Learn how to leverage the power of JPA in order to create robust and flexible Java applications. With this Mini Book, you will get introduced to JPA and smoothly transition to more advanced concepts.

JVM Troubleshooting Guide

The Java virtual machine is really the foundation of any Java EE platform. Learn how to master it with this advanced guide!

Given email address is already subscribed, thank you!
Oops. Something went wrong. Please try again later.
Please provide a valid email address.
Thank you, your sign-up request was successful! Please check your e-mail inbox.
Please complete the CAPTCHA.
Please fill in the required fields.

Leave a Reply


+ 1 = nine



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close