Thursday, 28 May 2015

Apache Tika

Hi friends , Today I would like to introduce Apache Tika which is a Java based API for content and metadata extraction of documents. Say we need to detect the language of all incoming mails. As discussed in the earlier post we might think of using the Language Detection API. But it has daily access limits and hence we won't be able to do the analysis for all the mails. So it's high time to think of an open source API which does not have these access limits. Hence we go for Apache Tika.
Apache Tika is a content analysis toolkit from Apache Software Foundation. It is used for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image& Video formats
  • Java class files and archives

Latest Release

The latest version of this library is Apache Tika 1.8 which was released on 20 April 2015.

Maven Dependency

<dependency>            <groupId>org.apache.tika</groupId>
                <artifactId>tika-parsers</artifactId>
                <version>1.4</version>
</dependency>

Language detection using Tika

Tika is able to detect the language of a piece of text. The main characteristic of this feature is that it doesn't have any daily access limits like the Language detection API that we used earlier.

Languages supported

By default Apache Tika supports 27 languages. This functionality can be extended to other languages by adding custom language profilers. The following are the languages supported by Tika:
Language code
Language
be
Belarusian
ca
Catalan
da
Danish
de
German
eo
Esperanto
et
Estonian
el
Greek
en
English
es
Spanish
fi
Finnish
fr
French
fa
Persian
gl
Galician
hu
Hungarian
is
Icelandic
it
Italian
lt
Lithuanian
nl
Dutch
no
Norwegian
pl
Polish
pt
Portuguese
ro
Romanian
ru
Russian
sk
Slovakian
sl
Slovenian
sv
Swedish
th
Thai
uk
Ukrainian

Sample Program for language detection using Apache Tika

import org.apache.tika.language.LanguageIdentifier;
public class TikaLanguageDetection {
                public static void main(String args[]){
                                 String message="This is a program to detect language";
                                 LanguageIdentifier identifier = new LanguageIdentifier(message);
                                 String language = identifier.getLanguage();   
                                   System.out.println("Language is:"+language);
                                 }
}

Result

Language is:en

The functionality of Apache Tika cannot be limited to Language Detection. 'll update you about more features of Apache Tika in the next post.


"Expectation leads to Disappointment!!"

No comments:

Post a Comment