Big Big Things in my Little Little World: Apache Tika

Hi friends , Today I would like to introduce Apache Tika which is a Java based API for content and metadata extraction of documents. Say we need to detect the language of all incoming mails. As discussed in the earlier post we might think of using the Language Detection API. But it has daily access limits and hence we won't be able to do the analysis for all the mails. So it's high time to think of an open source API which does not have these access limits. Hence we go for Apache Tika.

Apache Tika is a content analysis toolkit from Apache Software Foundation. It is used for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Audio formats
Image& Video formats
Java class files and archives

Latest Release

The latest version of this library is Apache Tika 1.8 which was released on 20 April 2015.

Maven Dependency

<dependency> <groupId>org.apache.tika</groupId>

<artifactId>tika-parsers</artifactId>

</dependency>

Language detection using Tika

Tika is able to detect the language of a piece of text. The main characteristic of this feature is that it doesn't have any daily access limits like the Language detection API that we used earlier.

Languages supported

By default Apache Tika supports 27 languages. This functionality can be extended to other languages by adding custom language profilers. The following are the languages supported by Tika:

Language code	Language
be	Belarusian
ca	Catalan
da	Danish
de	German
eo	Esperanto
et	Estonian
el	Greek
en	English
es	Spanish
fi	Finnish
fr	French
fa	Persian
gl	Galician
hu	Hungarian
is	Icelandic
it	Italian
lt	Lithuanian
nl	Dutch
no	Norwegian
pl	Polish
pt	Portuguese
ro	Romanian
ru	Russian
sk	Slovakian
sl	Slovenian
sv	Swedish
th	Thai
uk	Ukrainian

Sample Program for language detection using Apache Tika

import org.apache.tika.language.LanguageIdentifier;

public class TikaLanguageDetection {

public static void main(String args[]){

String message="This is a program to detect language";

LanguageIdentifier identifier = new LanguageIdentifier(message);

String language = identifier.getLanguage();

System.out.println("Language is:"+language);

}

Result

Language is:en

The functionality of Apache Tika cannot be limited to Language Detection. 'll update you about more features of Apache Tika in the next post.

Big Big Things in my Little Little World

Thursday, 28 May 2015

Apache Tika

Supported Document Formats

Latest Release

Maven Dependency

Language detection using Tika

Languages supported

Sample Program for language detection using Apache Tika

Result

"Expectation leads to Disappointment!!"

No comments:

Post a Comment

Blog Archive

Total Pageviews