Home » Java » Enterprise Java » Blog Categorisation using Encog, ROME, JSoup and Google Guava

About Brian Du Preez

Blog Categorisation using Encog, ROME, JSoup and Google Guava

Continuing with Programming Collection Intelligence  (PCI) the next exercise was using the distance scores to pigeonhole a list of blogs based on the words used within the relevant blog.

I had already found Encog as the framework for the AI / Machine learning algorithms, for this exercise I needed an RSS reader and a HTML parser.

The 2 libraries I ended up using were:

  1. ROME
  2. JSoup

For general other utilities and collection manipulations I used:

I kept the list of blogs short, included some of the software bloggers I follow, just to make testing quick, had to alter the %’s a little from the implementation in (PCI), but still got the desired result.

Blogs Used:

  • http://blog.guykawasaki.com/index.rdf
  • http://blog.outer-court.com/rss.xml
  • http://flagrantdisregard.com/index.php/feed/
  • http://gizmodo.com/index.xml
  • http://googleblog.blogspot.com/rss.xml
  • http://radar.oreilly.com/index.rdf
  • http://www.wired.com/rss/index.xml
  • http://feeds.feedburner.com/codinghorror
  • http://feeds.feedburner.com/joelonsoftware
  • http://martinfowler.com/feed.atom
  • http://www.briandupreez.net/feeds/posts/default

For the implementation I just went with a main class and a reader class:

package net.briandupreez.pci.data;

import com.google.common.base.Predicates;
import com.google.common.collect.Collections2;
import com.sun.syndication.feed.synd.SyndCategoryImpl;
import com.sun.syndication.feed.synd.SyndContent;
import com.sun.syndication.feed.synd.SyndEntryImpl;
import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.io.SyndFeedInput;
import com.sun.syndication.io.XmlReader;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.net.URL;
import java.util.*;

public class FeedReader {


    @SuppressWarnings("unchecked")
    public static Set<String> determineAllUniqueWords(final String url, final Set<String> blogWordList) {

        boolean ok = false;
        try {
            URL feedUrl = new URL(url);

            SyndFeedInput input = new SyndFeedInput();
            SyndFeed feed = input.build(new XmlReader(feedUrl));

            final List<SyndEntryImpl> entries = feed.getEntries();
            for (final SyndEntryImpl entry : entries) {
                blogWordList.addAll(cleanAndSplitString(entry.getTitle()));
                blogWordList.addAll(doCategories(entry));
                blogWordList.addAll(doDescription(entry));
                blogWordList.addAll(doContent(entry));
            }

            ok = true;
        } catch (Exception ex) {
            ex.printStackTrace();
            System.out.println("ERROR: " + url + "\n" + ex.getMessage());
        }

        if (!ok) {
            System.out.println("FeedReader reads and prints any RSS/Atom feed type.");
            System.out.println("The first parameter must be the URL of the feed to read.");
        }

        return blogWordList;

    }

    @SuppressWarnings("unchecked")
    private static List<String> doContent(final SyndEntryImpl entry) {
        List<String> blogWordList = new ArrayList<>();
        final List<SyndContent> contents = entry.getContents();
        if (contents != null) {
            for (final SyndContent syndContent : contents) {
                if ("text/html".equals(syndContent.getMode())) {
                    blogWordList.addAll(stripHtmlAndAddText(syndContent));
                } else {
                    blogWordList.addAll(cleanAndSplitString(syndContent.getValue()));
                }
            }
        }
        return blogWordList;
    }

    private static List<String> doDescription(final SyndEntryImpl entry) {
        final List<String> blogWordList = new ArrayList<>();
        final SyndContent description = entry.getDescription();
        if (description != null) {
            if ("text/html".equals(description.getType())) {
                blogWordList.addAll(stripHtmlAndAddText(description));
            } else {
                blogWordList.addAll(cleanAndSplitString(description.getValue()));
            }
        }
        return blogWordList;
    }

    @SuppressWarnings("unchecked")
    private static List<String> doCategories(final SyndEntryImpl entry) {
        final List<String> blogWordList = new ArrayList<>();
        final List<SyndCategoryImpl> categories = entry.getCategories();
        for (final SyndCategoryImpl category : categories) {
            blogWordList.add(category.getName().toLowerCase());
        }
        return blogWordList;
    }


    private static List<String> stripHtmlAndAddText(final SyndContent description) {
        String html = description.getValue();
        Document document = Jsoup.parse(html);
        Elements elements = document.getAllElements();
        final List<String> allWords = new ArrayList<>();
        for (final Element element : elements) {
            allWords.addAll(cleanAndSplitString(element.text()));
        }
        return allWords;
    }

    private static List<String> cleanAndSplitString(final String input) {
        if (input != null) {
            final String[] dic = input.toLowerCase().replaceAll("\\p{Punct}", "").replaceAll("\\p{Digit}", "").split("\\s+");
            return Arrays.asList(dic);
        }
        return new ArrayList<>();

    }

    @SuppressWarnings("unchecked")
    public static Map<String, Double> countWords(final String url, final Set<String> blogWords) {

        final Map<String, Double> resultMap = new TreeMap<>();
        try {
            URL feedUrl = new URL(url);
            SyndFeedInput input = new SyndFeedInput();
            SyndFeed feed = input.build(new XmlReader(feedUrl));

            final List<SyndEntryImpl> entries = feed.getEntries();
            final List<String> allBlogWords = new ArrayList<>();
            for (final SyndEntryImpl entry : entries) {
                allBlogWords.addAll(cleanAndSplitString(entry.getTitle()));
                allBlogWords.addAll(doCategories(entry));
                allBlogWords.addAll(doDescription(entry));
                allBlogWords.addAll(doContent(entry));
            }
            for (final String word : blogWords) {
                resultMap.put(word, (double) Collections2.filter(allBlogWords, Predicates.equalTo(word)).size());
            }

        } catch (Exception ex) {
            ex.printStackTrace();
            System.out.println("ERROR: " + url + "\n" + ex.getMessage());
        }


        return resultMap;
    }
}

Main:

package net.briandupreez.pci.data;

import com.google.common.base.Predicates;
import com.google.common.collect.Maps;
import com.google.common.io.Resources;
import com.google.common.primitives.Doubles;
import org.encog.ml.MLCluster;
import org.encog.ml.data.MLDataPair;
import org.encog.ml.data.MLDataSet;
import org.encog.ml.data.basic.BasicMLData;
import org.encog.ml.data.basic.BasicMLDataPair;
import org.encog.ml.data.basic.BasicMLDataSet;
import org.encog.ml.kmeans.KMeansClustering;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.*;


public class FeedReaderMain {

    public static void main(String[] args) {
        final FeedReaderMain feedReaderMain = new FeedReaderMain();
        try {
            feedReaderMain.run();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public void run() throws IOException {
        final String file = Resources.getResource("short-feedlist.txt").getFile();
        final Set<String> blogWords = determineWordCompleteList(file);
        final Map<String, Map<String, Double>> blogWordCount = countWordsPerBlog(file, blogWords);
        //strip out the outlying words
        stripOutlyingWords(blogWords, blogWordCount);
        performCusteringAndDisplay(blogWordCount);
    }

    private void performCusteringAndDisplay(final Map<String, Map<String, Double>> blogWordCount) {
        final BasicMLDataSet set = new BasicMLDataSet();

        final Map<String, List<Double>> inputMap = new HashMap<>();
        for (final Map.Entry<String, Map<String, Double>> entry : blogWordCount.entrySet()) {
            final Map<String, Double> mainValues = entry.getValue();
            final double[] elements = Doubles.toArray(mainValues.values());

            List<Double> listInput = Doubles.asList(elements);
            inputMap.put(entry.getKey(), listInput);
            set.add(new BasicMLData(elements));
        }

        final KMeansClustering kmeans = new KMeansClustering(3, set);
        kmeans.iteration(150);

        // Display the cluster
        int i = 1;
        for (final MLCluster cluster : kmeans.getClusters()) {
            System.out.println("*** Cluster " + (i++) + " ***");
            final MLDataSet ds = cluster.createDataSet();
            final MLDataPair pair = BasicMLDataPair.createPair(
                    ds.getInputSize(), ds.getIdealSize());
            for (int j = 0; j < ds.getRecordCount(); j++) {
                ds.getRecord(j, pair);
                List<Double> listInput = Doubles.asList(pair.getInputArray());
                System.out.println(Maps.filterValues(inputMap, Predicates.equalTo(listInput)).keySet().toString());
            }
        }
    }

    private Map<String, Map<String, Double>> countWordsPerBlog(String file, Set<String> blogWords) throws IOException {
        BufferedReader reader;
        String line;

        reader = new BufferedReader(new FileReader(file));
        final Map<String, Map<String, Double>> blogWordCount = new HashMap<>();
        while ((line = reader.readLine()) != null) {
            final Map<String, Double> wordCounts = FeedReader.countWords(line, blogWords);
            blogWordCount.put(line, wordCounts);
        }
        return blogWordCount;
    }

    private Set<String> determineWordCompleteList(final String file) throws IOException {
        FileReader fileReader = new FileReader(file);
        BufferedReader reader = new BufferedReader(fileReader);
        String line;
        Set<String> blogWords = new HashSet<>();
        while ((line = reader.readLine()) != null) {
            blogWords = FeedReader.determineAllUniqueWords(line, blogWords);
            System.out.println("Size: " + blogWords.size());
        }
        return blogWords;
    }

    private void stripOutlyingWords(final Set<String> blogWords, final Map<String, Map<String, Double>> blogWordCount) {
        final Iterator<String> wordIter = blogWords.iterator();
        final double listSize = blogWords.size();
        while (wordIter.hasNext()) {
            final String word = wordIter.next();
            double wordCount = 0;
            for (final Map<String, Double> values : blogWordCount.values()) {
                wordCount += values.get(word) != null ? values.get(word) : 0;
            }
            double percentage = (wordCount / listSize) * 100;
            if (percentage < 0.1 || percentage > 20 || word.length() < 3) {
                wordIter.remove();
                for (final Map<String, Double> values : blogWordCount.values()) {
                    values.remove(word);
                }
            } else {
                System.out.println("\t keeping: " + word + " Percentage:" + percentage);
            }
        }
    }
}

The Results:

*** Cluster 1 ***

[http://www.briandupreez.net/feeds/posts/default]

*** Cluster 2 ***

[http://blog.guykawasaki.com/index.rdf]

[http://radar.oreilly.com/index.rdf]

[http://googleblog.blogspot.com/rss.xml]

[http://blog.outer-court.com/rss.xml]

[http://gizmodo.com/index.xml]

[http://flagrantdisregard.com/index.php/feed/]

[http://www.wired.com/rss/index.xml]

*** Cluster 3 ***

[http://feeds.feedburner.com/joelonsoftware]

[http://feeds.feedburner.com/codinghorror]

[http://martinfowler.com/feed.atom]

 

Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

6. Spring Interview Questions

7. Android UI Design

and many more ....

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*


9 × = eighteen

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Do you want to know how to develop your skillset and become a ...

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!
Get ready to Rock!
To download the books, please verify your email address by following the instructions found on the email we just sent you.

THANK YOU!

Close