## Apache OpenOffice just graduated from the Incubator

Apache OpenOffice has just made it out of the Incubator and is now an official Apache Software Foundation project. “What?”, might some people ask now, “wasn’t it official before a year or so?”. No, it wasn’t! When Oracle decided to donate OpenOffice.org to the Apache Software Foundation, it entered the so called Incubatorfirst. That was back in June 2011. And as an incubating project it was not-yet official. Actually it was hard work to make it an official ASF project. Let me explain what happened.What does happen when a project incubates? When a project want to join the Apache Software Foundation, there are many open questions. Who wrote the code? Does the project really own all the intellectual property of it? Which license does the code use? Is there a working community? Usually there are a couple of long term Apache activists who join the project as mentors. In the case of OpenOffice, there were a couple of well known and respected community members involved. Like for example Jim Jagielski (ASF President), Sam Ruby (who has so many roles at the ASF that it is being said Sam Ruby does not refer to a person but a whole team), Ross Gardler (actually on the ASF board too), Shane Curcuru (ASF Trademark Expert), Joe Schaefer (one of the ASF Infra Gurus), Danese Cooper (better read her Wikipedia entry) and Noirin Plunkett, who is also an Officer to the ASF. Oh, and me. Me – the only one without a Wikipedia entry. You can imagine how excited I was to see so many experienced people joining as Mentors. Of course you can learn much of them and this is what I did. As a Mentor you have not only the chance to look at the gory details of an incubation – you have the duty to do so. Finally only when the project is “running” like an Apache project – often referred as the Apache way, which describes core values like “being open” – it will graduate out from incubator and become an official top level project. You can be assured that licensing problems are no longer there and the project has a clean IP. OpenOffice and some of its issues The Mentors will look at all the questions and advise the project to solve them. Mentors usually say things like:”you cannot use dependency $a, because it uses license$x. These are not compatible.” They say it, because the Apache Software Foundation only release code licensed with the Apache License. Oracles OpenOffice.org has had a lot of dependencies and some where GPL’ed. GPL is a different philosophy and unfortunately these two licenses are not fully compatible. One of the first hurdles was to make sure everything which will be published by the OpenOffice project is compatible to the Apache License. If you were coding on huge projects in your life, you know how painful it can be to look at every single dependency you might use. Mentors also look at the community. In the case of OpenOffice, there was a totally different style of “project management”. It was – more or less – leadership based. But at the ASF there are no “real” leaders, or at there is not a role of a leader. There a people who do stuff, and when they do stuff, they somehow lead it. Finally the project agrees or disagrees with votes. We call that Do-cracy (or so). But there is never ever one person who can decide what will happen and when. The Apache style is not for everybody. But I am glad to say that many, many people at this project changed their way of working without much pain. The community of OpenOffice is huge. It was overwhelming huge. There are parts of OpenOffice which required some special thoughts. Like the official OpenOffice forums. These forums were once running more or less independently. But now the forums were about to be part of the project. In other terms: the people who were moderating/administrating the forum needed become Apache committers. Even when they would not write a single line code. It is often misunderstood that you would need to write code to join a project as a committer. But this is not true. Apache projects usually are glad about every contribution and will respect you for that. If you write docs, you are able to join. If you are active as supporter on the mailing lists you are also able to join. We had to do much work to integrate the forum people into the OpenOffice community and this community into the Apache community. There were language barriers and concerns. I mean: some folks just wanted to post in the forums as always. Why did they need to sign a CLA? Well, because we are concerned on the IP. Because we want them to join our community – fully. Besides: we have not had forums on the ASF before. How to operate them? But there were some great volunteers who succeeded with this job. This is the case with Apache: we are one community. Community over code, it is often said. With this incubation we had to bring a fully fledged community into ours. We needed to mentor without being arrogant. I hope it worked out that way (I doubt everybody will agree). But it was difficult. The folks of OpenOffice needed to bend more than we needed. We more or less changed some infrastructure things, like running the forums on our servers. But OpenOffice community needed to change the way they operate. Therefore I can just give all involved people my deepest respect. When two communities grow together and one community cannot move so far as the other, there are often misunderstandings and of course hurt feelings. But in just this little time (since june 2011!) it worked out. Here is a great quote from the official annoucement: “The OpenOffice graduation is the official recognition that the project is now able to self-manage not only in technical matters, but also in community issues,” said Andrea Pescetti, Vice President of Apache OpenOffice. “The ‘Apache Way’ and its methods, such as taking every decision in public with total transparency, have allowed the project to attract and successfully engage new volunteers, and to elect an active and diverse Project Management Committee that will be able to guarantee a stable future to Apache OpenOffice.” Yup, that’s it.The first release It was not only impressive to see the community grow. No, one of the most impressive things I ever seen was that OpenOffice people – surrounded by Nay-sayers and other destructive elements – simply made what they liked. They made a new release. With a complete new infrastructure. With brand new requirements. With mentors in their backs. And with a growing and successful LibreOffice community on the other side. But they kept on going and finally they made it. A project with size and this restrictions – I can just say:”wow guys, that was incredible.”. Check their releases out here: openoffice.apache.org. 20 million other people did so since the first release was out in May 2012!And what next? Incubation is over. My role at this project is done. OpenOffice is now self governing and they totally deserved it. Now they can say they are an official project and users can use software which is guaranteed to run under the permissive Apache License 2.0. This will make it possible to use in your own products. There will be some tasks to be done for post graduation. But actually these are just small steps. Graduation is important from a psychology point of view. From technical point of view: some redirections and then head on to the next release. However, I was glad to get such a great insight, even when it needed huge amount of my energy. Somehow I am glad to unsubscribe, but somehow I will miss this exciting project. In any way, thanks guys that I was allowed to learn so much. And I wish you all the best for the future. I now think it is a bright one.At our conference Did you know there are a couple of great OpenOffice talks at the ApacheCon EU?

## Setting up and playing with Apache Solr on Tomcat

A while back a had a little time to play with Solr, and was instantly blown away by the performance we could achieve on some of our bigger datasets. Here is some of my initial setup and configuration learnings to maybe help someone get it up and running a little faster. Starting with setting both up on windows. Download and extract Apache Tomcat and Solr and copy into your working folders. Tomcat Setup If you want tomcat as a service install it using the following: bin\service.bat install Edit the tomcat users under conf.: <role rolename="admin"/> <role rolename="manager-gui"/> <user username="tomcat" password="tomcat" roles="admin,manager-gui"/>If you are going to query Solr using international characters (>127) using HTTP-GET, you must configure Tomcat to conform to the URI standard by accepting percent-encoded UTF-8. Add: URIEncoding=’UTF-8′ <connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" />to the conf/server.xml Copy the contents of the example\solr your solr home directory D:\Java\apache-solr-3.6.0\home create the code fragment on $CATALINA_HOME/conf/Catalina/localhost/solr.xml pointing to your solr home. <?xml version="1.0" encoding="UTF-8"?> <context docBase="D:\Java\apache-tomcat-7.0.27\webapps\solr.war" debug="0" crossContext="true" > <environment name="solr/home" type="java.lang.String" value="D:\Java\apache-solr-3.6.0\home" override="true" /> </Context>Startup tomcat, login, deploy the solr.war. Solr Setup It should be available at http://localhost:8080/solr/admin/ To create a quick test using SolrJ the creates and reads data: Grab the following Maven Libs: <dependency> <groupid>org.apache.solr</groupId> <artifactid>apache-solr-solrj</artifactId> <version>3.6.0</version> <type>jar</type> <scope>compile</scope> </dependency> <dependency> <groupid>org.apache.httpcomponents</groupId> <artifactid>httpclient</artifactId> <version>4.1</version> <scope>compile</scope> </dependency> <dependency> <groupid>org.apache.httpcomponents</groupId> <artifactid>httpcore</artifactId> <version>4.1</version> <scope>compile</scope> </dependency> <dependency> <groupid>org.apache.james</groupId> <artifactid>apache-mime4j</artifactId> <version>0.6.1</version> <scope>compile</scope> </dependency> <dependency> <groupid>org.apache.httpcomponents</groupId> <artifactid>httpmime</artifactId> <version>4.1</version> <scope>compile</scope> </dependency> <dependency> <groupid>org.slf4j</groupId> <artifactid>slf4j-api</artifactId> <version>1.6.1</version> <scope>compile</scope> </dependency> <dependency> <groupid>commons-logging</groupId> <artifactid>commons-logging</artifactId> <version>1.1.1</version> <scope>compile</scope> </dependency> <dependency> <groupid>junit</groupId> <artifactid>junit</artifactId> <version>4.9</version> <scope>test</scope> </dependency>JUnit test: package za.co.discovery.ecs.solr.test; import java.io.File; import java.io.FileReader; import java.io.IOException; import java.net.MalformedURLException; import java.net.URISyntaxException; import java.util.ArrayList; import java.util.Collection; import org.apache.solr.client.solrj.SolrQuery; import org.apache.solr.client.solrj.SolrServer; import org.apache.solr.client.solrj.SolrServerException; import org.apache.solr.client.solrj.impl.HttpSolrServer; import org.apache.solr.client.solrj.response.QueryResponse; import org.apache.solr.common.SolrDocument; import org.apache.solr.common.SolrDocumentList; import org.apache.solr.common.SolrInputDocument; import org.junit.Assert; import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.junit.runners.JUnit4; @RunWith(JUnit4.class) public class TestSolr { private SolrServer server; /** * setup. */ @Before public void setup() { server = new HttpSolrServer("http://localhost:8080/solr/"); try { server.deleteByQuery("*:*"); } catch (SolrServerException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } /** * Test Adding. * * @throws MalformedURLException error */ @Test public void testAdding() throws MalformedURLException { try { final SolrInputDocument doc1 = new SolrInputDocument(); doc1.addField("id", "id1", 1.0f); doc1.addField("name", "doc1", 1.0f); doc1.addField("price", 10); final SolrInputDocument doc2 = new SolrInputDocument(); doc2.addField("id", "id2", 1.0f); doc2.addField("name", "doc2", 1.0f); doc2.addField("price", 20); final Collection<solrinputdocument> docs = new ArrayList<solrinputdocument>(); docs.add(doc1); docs.add(doc2); server.add(docs); server.commit(); final SolrQuery query = new SolrQuery(); query.setQuery("*:*"); query.addSortField("price", SolrQuery.ORDER.asc); final QueryResponse rsp = server.query(query); final SolrDocumentList solrDocumentList = rsp.getResults(); for (final SolrDocument doc : solrDocumentList) { final String name = (String) doc.getFieldValue("name"); final String id = (String) doc.getFieldValue("id"); //id is the uniqueKey field System.out.println("Name:" + name + " id:" + id); } } catch (SolrServerException e) { e.printStackTrace(); Assert.fail(e.getMessage()); } catch (IOException e) { e.printStackTrace(); Assert.fail(e.getMessage()); } } }Adding data directly from the DB Firstly you need to add the relevant DB libs to the add classpath. Then create data-config.xml as below, if you require custom fields, those can be specified under the fieldstag in the schema.xml shown below the dataconfig.xml <dataconfig> <datasource name="jdbc" driver="oracle.jdbc.driver.OracleDriver" url="jdbc:oracle:thin:@localhost:1525:DB" user="user" password="pass"/> <document name="products"> <entity name="item" query="select * from demo"> <field column="ID" name="id" /> <field column="DEMO" name="demo" /> <entity name="feature" query="select description from feature where item_id='${item.ID}'"> <field name="features" column="description" /> </entity> <entity name="item_category" query="select CATEGORY_ID from item_category where item_id='${item.ID}'"> <entity name="category" query="select description from category where id = '${item_category.CATEGORY_ID}'"> <field column="description" name="cat" /> </entity> </entity> </entity> </document> </dataConfig>A custom field in the schema.xml: <fields> <field name="DEMO" type="string" indexed="true" stored="true" required="true" /> </fieldsAdd in the solrconfig.xml make sure to point the the data-config.xml, the handler has to be registered in the solrconfig.xml as follows <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">data-config.xml</str> </lst> </requestHandler>Once that is all setup a full import can be done with the following: http://localhost:8080/solr/admin/dataimport?command=full-import Then you should be good to go with some lightning fast data retrieval. Then you should be good to go with some lightning fast data retrieval.

## Distributed Apache Flume Setup With an HDFS Sink

I have recently spent a few days getting up to speed with Flume, Cloudera‘s distributed log offering. If you haven’t seen this and deal with lots of logs, you are definitely missing out on a fantastic project. I’m not going to spend time talking about it because you can read more about it in the users guide or in the Quora Flume Topicin ways that are better than I can describe it. But I will tell you about is my experience setting up Flume in a distributed environment to sync logs to a HDFS sink.Context I have 3 kinds of servers all running Ubuntu 10.04 locally: hadoop-agent-1: This is the agent which is producing all the logs hadoop-collector-1: This is the collector which is aggregating all the logs (from hadoop-agent-1, agent-2, agent-3, etc) hadoop-master-1: This is the flume master node which is sending out all the commands To add the CDH3 repository: Create a new file /etc/apt/sources.list.d/cloudera.list with the following contents: deb http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib deb-src http://archive.cloudera.com/debian <RELEASE>-cdh3 contrib where: is the name of your distribution, which you can find by running lsb_release -c. For example, to install CDH3 for Ubuntu Lucid, use lucid-cdh3 in the command above. (To install a different version of CDH on a Debian system, specify the version number you want in the -cdh3 section of the deb command. For example, to install CDH3 Update 0 for Ubuntu Maverick, use maverick-cdh3u0 in the command above.) (Optionally) add a repository key. Add the Cloudera Public GPG Key to your repository by executing the following command: $curl -s http://archive.cloudera.com/debian/archive.key | sudo apt-key add - This key enables you to verify that you are downloading genuine packages Initial Setup On both hadoop-agent-1 and hadoop-collector-1, you’ll have to install flume-node (flume-node contains the files necessary to run the agent or the collector). sudo apt-get update sudo apt-get install flume-node On hadoop-master-1: sudo apt-get update sudo apt-get install flume-master First let’s jump onto the agent and set that up. Tune the hadoop-master-1 and hadoop-collector-1 variables appropriately, but change your /etc/flume/conf/flume-site.xml to look like: <configuration> <property> <name>flume.master.servers</name> <value>hadoop-master-1</value> <description>This is the address for the config servers status server (http)</description> </property><property> <name>flume.collector.event.host</name> <value>hadoop-collector-1</value> <description>This is the host name of the default 'remote' collector.</description> </property><property> <name>flume.collector.port</name> <value>35853</value> <description>This default tcp port that the collector listens to in order to receive events it is collecting.</description> </property><property> <name>flume.agent.logdir</name> <value>/tmp/flume-${user.name}/agent</value> <description> This is the directory that write-ahead logging data or disk-failover data is collected from applications gets written to. The agent watches this directory. </description> </property> </configuration> Now on to the collector. Same file, different config. <configuration> <property> <name>flume.master.servers</name> <value>hadoop-master-1</value> <description>This is the address for the config servers status server (http)</description> </property><property> <name>flume.collector.event.host</name> <value>hadoop-collector-1</value> <description>This is the host name of the default 'remote' collector.</description> </property><property> <name>flume.collector.port</name> <value>35853</value> <description>This default tcp port that the collector listens to in order to receive events it is collecting.</description> </property><property> <name>fs.default.name</name> <value>hdfs://hadoop-master-1:8020</value> </property><property> <name>flume.agent.logdir</name> <value>/tmp/flume-${user.name}/agent</value> <description> This is the directory that write-ahead logging data or disk-failover data is collected from applications gets written to. The agent watches this directory. </description> </property><property> <name>flume.collector.dfs.dir</name> <value>file:///tmp/flume-${user.name}/collected</value> <description>This is a dfs directory that is the the final resting place for logs to be stored in. This defaults to a local dir in /tmp but can be hadoop URI path that such as hdfs://namenode/path/ </description> </property><property> <name>flume.collector.dfs.compress.gzip</name> <value>true</value> <description>Writes compressed output in gzip format to dfs. value is boolean type, i.e. true/false</description> </property><property> <name>flume.collector.roll.millis</name> <value>60000</value> <description>The time (in milliseconds) between when hdfs files are closed and a new file is opened (rolled). </description> </property> </configuration> Web Based Setup I chose to do the individual machine setup via the master web interface. You can get to this pointing your web browser at http://hadoop-master-1:35871/ (replace hadoop-master-1 with public/private DNS IP of your flume master or setup /etc/hosts for a hostname). Ensure that the port is accessible from the outside through your security settings. At this point, it was easiest for me to ensure all hosts running flume could talk to all ports on all other hosts running flume. You can certainly lock this down to the individual ports for security once everything is up and running. At this point, you should go to hadoop-agent-1 and hadoop-collector-1 run /etc/init.d/flume-node start. If everything goes well, then the master (whose IP is specified in their configs) should be notified of their existence. Now you can configure them from the web. Click on the config link and then fill in the text lines as follows (use what is in bold): Agent Node: hadoop-agent-1 Source: tailDir(“/var/logs/apache2/”,”.*.log”) Sink: agentBESink(“hadoop-collector-1?,35853) Note: I chose to use tailDir since I will control rotating the logs on my own. I am also using agentBESink because I am ok with losing log lines if the case arises. Now click Submit Query and go back to the config page to setup the collector: Agent Node: hadoop-collector-1 Source: collectorSource(35853) Sink: collectorSink(“hdfs://hadoop-master-1:8020/flume/logs/%Y/%m/%d/%H00?,”server”) This is going to tell the collector that we are sinking to HDFS with the with an initial folder of ‘flume’. It will then log to sub-folders with “flume/logs/YYYY/MM/DD/HH00? (or 2011/02/03/1300/server-.log). Now click Submit Query and go to the ‘master’ page and you should see 2 commands listed as “SUCCEEDED” in the command history. If they have not succeeded, ensure a few things have been done (there are probably more, but this is a handy start: Always use double quotes (“) since single quotes (‘) aren’t interpreted correctly. UPDATE: Single quotes are interpreted correctly, they are just not accepted intentionally (Thanks jmhsieh) In your regex, use something like “.*\\.log” since the ‘.’ is part of the regex. In your regex, ensure that your blackslashes are properly escaped: “foo\\bar” is the correct version of trying to match “foo\bar”. Additionally, there are also tables of Node Status and Node Configuration. These should match up with what you think you configured. At this point everything should work. Admittedly I had a lot of trouble getting to this point. But with the help of the Cloudera folks and the users on irc.freenode.net in #flume, I was able to get things going. The logs sadly aren't too helpful here in most cases (but look anyway cause they might provide you with more info than they provided for me). If I missed anything in this post or there is something else I am unaware of, then let me know.