Friday, 19 June 2015

More About Apache Tika


It's been a long time since I wrote. Let's see Apache Tika in detail. All might be wondering about the applications of this. Apache Tika forms the major component of Elastic Search.

Program to parse the Google Web page

public static void main (String args[]) throws Exception {

URL url = new URL("https://www.google.co.in");
InputStream input = url.openStream();
LinkContentHandler linkHandler = new LinkContentHandler();
ContentHandler textHandler = new BodyContentHandler();
ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();
TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
HtmlParser parser = new HtmlParser();
parser.parse(input, teeHandler, metadata, parseContext);
System.out.println("TITLE:\n" + metadata.get("title").replaceAll("\\s+", " ").trim());
//System.out.println("LINKS:\n" + linkHandler.getLinks());
System.out.println("TEXT:\n" + textHandler.toString().replaceAll("\\s+", " ").trim());
//System.out.println("HTML:\n" + toHTMLHandler.toString().replaceAll("\\s+", " ").trim());
}

Result:

TITLE:
Google
TEXT:
Search Images Maps Play YouTube News Gmail Drive More » Web History | Settings | Sign in × A faster way to browse the web Install Google Chrome India   Advanced searchLanguage tools Google.co.in offered in: हिन्दी বাংলা తెలుగు मराठी தமிழ் ગુજરાતી ಕನ್ನಡ മലയാളം ਪੰਜਾਬੀ Advertising ProgramsBusiness Solutions+GoogleAbout GoogleGoogle.com © 2015 - Privacy - Terms

Program to parse an XML file

public static void main(String args[]) throws IOException, SAXException, TikaException{
//detecting the file type
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
FileInputStream inputstream = new FileInputStream(new File("sample.xml"));
ParseContext pcontext = new ParseContext();
//Xml parser
XMLParser xmlparser = new XMLParser();
xmlparser.parse(inputstream, handler, metadata, pcontext);
System.out.println("Contents of the document:" + handler.toString());
}

Sample XML file:

<note>
<to>Tom</to>
<from>Jerry</from>
<heading>Reminder</heading>
<body>Weekend Trip..</body>
</note>

Result:

Contents of the document:
Tom
Jerry
Reminder
Weekend Trip..


Apache Tika Server


Download the tika-server.jar from the Tika project site. Start the server using

java -jar tika-server-x.x.jar -h 0.0.0.0
The -h 0.0.0.0 (host) option makes the server listen for any incoming requests, otherwise without it it would only listen for requests from localhost. You can also add the -p option to change the port, otherwise it defaults to 9998.Once the server has started we can simply access it using  browser. It will list all available endpoints.


Tika Server Versions

The Apache Tika Server is available in two versions namely:

  • tika-app.jar 
It has the --server --port 9998 options to start a simple server. It provides text extraction and returns the content as HTML
  • tika-server.jar 
It is a separate component using JAX-RS. It acts as a RESTful service.Thats all about Tika. 'll come up with another interesting technology next time....



Silence is True Wisdom's Best Reply!!








No comments:

Post a Comment