Hi friends , Today I would like to introduce Apache Tika
which is a Java based API for content and metadata extraction of documents. Say
we need to detect the language of all incoming mails. As discussed in the
earlier post we might think of using the Language Detection API. But it has
daily access limits and hence we won't be able to do the analysis for all the
mails. So it's high time to think of an open source API which does not have
these access limits. Hence we go for Apache Tika.
Apache Tika is a content analysis toolkit from Apache
Software Foundation. It is used for detecting and extracting metadata and
structured text content from various documents using existing parser libraries.
Supported Document Formats
- HyperText Markup Language
- XML and derived formats
- Microsoft Office document formats
- OpenDocument Format
- Portable Document Format
- Electronic Publication Format
- Rich Text Format
- Compression and packaging formats
- Text formats
- Audio formats
- Image& Video formats
- Java class files and archives
Latest Release
The latest version of this library is Apache Tika 1.8 which
was released on 20 April 2015.
Maven Dependency
<dependency> <groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.4</version>
</dependency>
Language detection using Tika
Tika is able to detect the language of a piece of text. The
main characteristic of this feature is that it doesn't have any daily access
limits like the Language detection API that we used earlier.
Languages supported
By default Apache Tika supports 27 languages. This
functionality can be extended to other languages by adding custom language
profilers. The following are the languages supported by Tika:
Language code
|
Language
|
be
|
Belarusian
|
ca
|
Catalan
|
da
|
Danish
|
de
|
German
|
eo
|
Esperanto
|
et
|
Estonian
|
el
|
Greek
|
en
|
English
|
es
|
Spanish
|
fi
|
Finnish
|
fr
|
French
|
fa
|
Persian
|
gl
|
Galician
|
hu
|
Hungarian
|
is
|
Icelandic
|
it
|
Italian
|
lt
|
Lithuanian
|
nl
|
Dutch
|
no
|
Norwegian
|
pl
|
Polish
|
pt
|
Portuguese
|
ro
|
Romanian
|
ru
|
Russian
|
sk
|
Slovakian
|
sl
|
Slovenian
|
sv
|
Swedish
|
th
|
Thai
|
uk
|
Ukrainian
|
Sample Program for language detection using Apache Tika
import org.apache.tika.language.LanguageIdentifier;
public class TikaLanguageDetection {
public
static void main(String args[]){
String message="This is a program to
detect language";
LanguageIdentifier identifier = new
LanguageIdentifier(message);
String language =
identifier.getLanguage();
System.out.println("Language is:"+language);
}
}
Result
Language is:en
The functionality of Apache Tika cannot be limited to
Language Detection. 'll update you about more features of Apache Tika in the
next post.
No comments:
Post a Comment