Friday 29 May 2015

Tips and tricks- 4

1. Query to find week day in PostgreSQL

select to_char(current_date,'Day');
Friday


2. Query to find current day of week

select to_char(current_date,'D');
6

3. Query to extract year and month from date in PostgreSQL

SELECT
 EXTRACT(YEAR FROM to_timestamp(column_name, 'YYYY-MM-DD')) years,
 EXTRACT(MONTH FROM to_timestamp(column_name, 'YYYY-MM-DD')) months
 FROM table_name

4. Query to find number of days between two dates in PostgreSQL

select '2015-05-22'::date - '2015-05-12'::date
10

5. Query to display name of month in PostgreSQL

SELECT to_char(to_timestamp (4::text, 'MM'), 'TMmon')
Apr

6. How to get fully qualified path name for a file in /src/Resources folder in Eclipse Java Project

className.getClass().getResource("/Resources")

7. List all filenames in a particular directory using java

public List<String> listFiles(String location){
List<String> fileNames = new ArrayList<String>();
File[] files = new File(location).listFiles();
for (File file : files) {
   if (file.isFile()) {
       fileNames.add(file.getName());
   }
}
return fileNames;
}

8. Find Elements in list1 but not in list2 in java

ListUtils.subtract(list1,list2);

9. Find union of two lists in java

ListUtils.union(list1,list2);

Mistakes lead us to right choice!!

Thursday 28 May 2015

Apache Tika

Hi friends , Today I would like to introduce Apache Tika which is a Java based API for content and metadata extraction of documents. Say we need to detect the language of all incoming mails. As discussed in the earlier post we might think of using the Language Detection API. But it has daily access limits and hence we won't be able to do the analysis for all the mails. So it's high time to think of an open source API which does not have these access limits. Hence we go for Apache Tika.
Apache Tika is a content analysis toolkit from Apache Software Foundation. It is used for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Supported Document Formats

  • HyperText Markup Language
  • XML and derived formats
  • Microsoft Office document formats
  • OpenDocument Format
  • Portable Document Format
  • Electronic Publication Format
  • Rich Text Format
  • Compression and packaging formats
  • Text formats
  • Audio formats
  • Image& Video formats
  • Java class files and archives

Latest Release

The latest version of this library is Apache Tika 1.8 which was released on 20 April 2015.

Maven Dependency

<dependency>            <groupId>org.apache.tika</groupId>
                <artifactId>tika-parsers</artifactId>
                <version>1.4</version>
</dependency>

Language detection using Tika

Tika is able to detect the language of a piece of text. The main characteristic of this feature is that it doesn't have any daily access limits like the Language detection API that we used earlier.

Languages supported

By default Apache Tika supports 27 languages. This functionality can be extended to other languages by adding custom language profilers. The following are the languages supported by Tika:
Language code
Language
be
Belarusian
ca
Catalan
da
Danish
de
German
eo
Esperanto
et
Estonian
el
Greek
en
English
es
Spanish
fi
Finnish
fr
French
fa
Persian
gl
Galician
hu
Hungarian
is
Icelandic
it
Italian
lt
Lithuanian
nl
Dutch
no
Norwegian
pl
Polish
pt
Portuguese
ro
Romanian
ru
Russian
sk
Slovakian
sl
Slovenian
sv
Swedish
th
Thai
uk
Ukrainian

Sample Program for language detection using Apache Tika

import org.apache.tika.language.LanguageIdentifier;
public class TikaLanguageDetection {
                public static void main(String args[]){
                                 String message="This is a program to detect language";
                                 LanguageIdentifier identifier = new LanguageIdentifier(message);
                                 String language = identifier.getLanguage();   
                                   System.out.println("Language is:"+language);
                                 }
}

Result

Language is:en

The functionality of Apache Tika cannot be limited to Language Detection. 'll update you about more features of Apache Tika in the next post.


"Expectation leads to Disappointment!!"

Wednesday 20 May 2015

Tips and Tricks- 3



1. To find a process running in a port in linux

sudo netstat -lpn |grep :port_number
Eg:
sudo netstat -lpn |grep :9001
Sample Output:
tcp        0      0 :::9001                     :::*                        LISTEN      1070/java   


2. To kill a process running on a particular port in Linux

sudo kill -9 processid
Eg:
sudo kill -9 1070


3. Select varchar as timestamp in PostgreSQL

select to_timestamp(column_name, 'DD-MM-YYYY hh24:mi:ss') from table_name


4. Unzip a file in Linux

cd path_to_directory_containing_file
unzip file_name


5. Log of tomcat is found at

/tomcat/logs/catalina.out


6. Rename a table in postgreSQL

ALTER TABLE old_table_name RENAME TO new_table_name;


7. Disable logging in Java Xerces (“[Fatal Error] :1:1: Content is not allowed in prolog.”)

Setting the ErrorHandler to null suppresses the Fatal Error print line.

parser.setErrorHandler(null);
Eg:
DocumentBuilderFactory dBF = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = dBF.newDocumentBuilder();
builder.setErrorHandler(null);
Source: http://stackoverflow.com/questions/1575925/disable-logging-in-java-xerces-fatal-error-11-content-is-not-allowed-in-p


8. Display today's date in PostgreSQL

select now();

9. Extract hour from timestamp in PostgreSQL

select (extract(hour from to_timestamp(column_name, 'DD-MM-YYYY hh24:mi:ss'))) from table_name;




No Cloud is so Dark that the Sun can't Shine Through!!

Tuesday 19 May 2015

Language Detection API




Recently I came across a requirement to identify the language in the given text. First I started with the language detection API. Let's have a look into the details of it:
Language detection API is a language detection web service. It accepts text and produces result with detected language code and score. It currently detects 160 languages.

Available plans

Plan name No: of requests/day Data usage/day Price
Free 5,000 requests 1 MB Free
Basic 100,000 requests 20 MB $5/month
Plus 1M requests 200 MB $15/month
Premium 10M requests 2 GB $40/month

API Key

To use Language detection API we need an API key which can be obtained from:
https://detectlanguage.com/users/sign_up
API Clients
Language detection web service provides API clients for the following programming languages:
  • Ruby
  • Java
  • Python
  • PHP
  • C# (.NET)


JSON API Usage

  1. Basic detection
Submit HTTP request to http://ws.detectlanguage.com/0.2/detect with the following parameters:
q - Your text, mandatory
key - your API key, mandatory
Response is:
{"data":{"detections":[{"language":"es","isReliable":true,"confidence":10.24}]}}
Interpretation of results:
Confidence value depends on how much text we pass and how well it is identified. The more text we pass, the higher confidence value will be. It is not a range, it can be higher than 100.
Reliability is not directly linked to the confidence. In case our text contains words in different languages then isReliable: true would identify that first detected language is significantly more probable than the second one. When only one language is detected isReliable: false would mean that confidence is very low.
Language defines the language code identified. API returns 'xxx' code for unknown language.
  1. Batch Requests
It is possible to detect language of several texts using one query. This saves network bandwidth and increases performance. Batch request detections are counted as separate requests, i.e. if 3 texts were passed they will be counted as 3 separate requests.
Eg:
Response:
{"data":{"detections":[[{"language":"es","isReliable":true,"confidence":10.24}],[{"language":"en","isReliable":true,"confidence":11.94}]]}}
  1. Accessing plan details
User request and data counters can be accessed at http://ws.detectlanguage.com/0.2/user/status
Eg:
Response:
{"date":"2015-05-19","requests":0,"bytes":0,"plan":"FREE","plan_expires":null,"daily_requests_limit":5000,"daily_bytes_limit":1048576,"status":"ACTIVE"}
  1. Language Support
List of all supported languages are available at:
  1. Secure Mode(SSL)
Texts submitted to the API are used by language detection engine only. Texts are not stored or used in any other way. If you are passing sensitive information to the API, you can use HTTPS protocol to ensure secure network transfer.Source: https://detectlanguage.com/

Sample Code:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import java.net.URLConnection;

public class LanguageDetection {

public void detectLanguage(String message) throws IOException{
        String text;
String head="http://ws.detectlanguage.com/0.2/detect?q=";
String apiKey="&key=your API key";
text=message.replaceAll("\\s+","%20");
try
     {
URL url = new URL(head+text+apiKey); 
        URLConnection urlConnection = url.openConnection();
        HttpURLConnection connection = null;
        connection = (HttpURLConnection) urlConnection;
        BufferedReader in = new BufferedReader(
        new InputStreamReader(connection.getInputStream()));
        String urlString = "";
        String current;
        while((current = in.readLine()) != null)
        {
           urlString += current;
        }
        System.out.println(urlString);
     }catch(IOException e)
     {
        e.printStackTrace();
     }
}

public static void main(String args[]) throws IOException{
LanguageDetection langDetect=new LanguageDetection();
langDetect.detectLanguage("suprabhatham");
}
}

Result:

{"data":{"detections":[{"language":"sa","isReliable":true,"confidence":15.75}]}}
The language was identified as Sanskrit.


A great attitude makes a great life!!

Friday 15 May 2015

The Google Geocoding API


Nowadays the word Google has become the synonym for "Search". To find any information we will just Google it. The term Googling has now become a phrase in English! It has now become one of my favourite word. Let's see how we can get the details about a place using Google.
Geocoding is the process of converting addresses to geographical co ordinates. The reverse process is termed as Reverse geocoding. The Google Geocoding API provides a direct way to access these services via an HTTP request.

Usage Limits

The Google Geocoding API has the following limits in place:
Users of the free API:
2,500 requests per 24 hour period.
5 requests per second.
Google Maps API for Work customers:
100,000 requests per 24 hour period.
10 requests per second.

Geocoding API Request Format

A Geocoding API request must be of the following form:
https://maps.googleapis.com/maps/api/geocode/output?parameters
where output can be either JSON or XML

Maven Dependency:

<dependencies>
  <dependency>
    <groupId>com.google.code.geocoder-java</groupId>
    <artifactId>geocoder-java</artifactId>
    <version>0.16</version>
  </dependency>
</dependencies>

Sample:

import java.io.IOException;

import com.google.code.geocoder.Geocoder;

import com.google.code.geocoder.GeocoderRequestBuilder;
import com.google.code.geocoder.model.GeocodeResponse;
import com.google.code.geocoder.model.GeocoderRequest;


public class Main {


public static void main(String args[]) throws IOException{


final Geocoder geocoder = new Geocoder();
GeocoderRequest geocoderRequest = new GeocoderRequestBuilder().setAddress("Trivandrum, India").setLanguage("en").getGeocoderRequest();
GeocodeResponse geocoderResponse = geocoder.geocode(geocoderRequest);
System.out.println(geocoderResponse);
}
}


Result:


GeocodeResponse{status=OK, results=[GeocoderResult{types=[locality, political], formattedAddress='Thiruvananthapuram, Kerala, India', addressComponents=[GeocoderAddressComponent{longName='Thiruvananthapuram', shortName='TVM', types=[locality, political]}, GeocoderAddressComponent{longName='Thiruvananthapuram', shortName='TVM', types=[administrative_area_level_2, political]}, GeocoderAddressComponent{longName='Kerala', shortName='KL', types=[administrative_area_level_1, political]}, GeocoderAddressComponent{longName='India', shortName='IN', types=[country, political]}], geometry=GeocoderGeometry{location=LatLng{lat=8.524139099999999, lng=76.9366376}, locationType=APPROXIMATE, viewport=LatLngBounds{southwest=LatLng{lat=8.3867048, lng=76.84168699999999}, northeast=LatLng{lat=8.6127611, lng=77.0070362}}, bounds=LatLngBounds{southwest=LatLng{lat=8.3867048, lng=76.84168699999999}, northeast=LatLng{lat=8.6127611, lng=77.0070362}}}, partialMatch=false}]}




Never Ever Give UP!!

Wednesday 13 May 2015

Tips and Tricks- 2

1. What is JConsole?

The JConsole command launches a graphical console tool that enables you to monitor and manage Java applications on a local or remote machine.
JConsole displays useful information such as thread usage, memory consumption, and details about class loading, runtime compilation, and the operating system.

Uses of JConsole:

  • monitoring
  • Dynamically change several parameters in the running system

Using JConsole

JConsole requires Java 5 or later. JConsole comes with the JDK (but not the JRE) and can be found in the%JDK_HOME%/bin directory. To launch JConsole, open a terminal or command window, change to the directory containing it, and execute jconsole. When JConsole starts, it shows a window listing the managed Java VMs on the machine. The process id (pid) and command line arguments for each Java VM are displayed. Select one of the Java VMs, and JConsole attaches to it.

JConsole interface

The JConsole interface in JDK 7 is composed of the following six tabs:
  • Overview displays summary information on the JVM and monitored values.
  • Memory displays information about memory use.
  • Threads displays information about thread use.
  • Classes displays information about class loading.
  • VM Summary displays information about the JVM.
  • MBeans displays information about MBeans. (MBean or managed bean is a Java object that represents a manageable resource such as an application, a service, a component, or a device.)

Source: http://www.techrepublic.com/blog/software-engineer/monitor-and-manage-java-applications-with-jconsole/

2. How to prevent computer from automatically sleeping in Windows7?

1. Open Power Options by clicking the Start button, clicking Control Panel, clicking System and Security, and then clicking Power Options.
2. On the Select a power plan page, click Change plan settings next to the selected plan.
3. On the Change settings for the plan page, click Change advanced power settings.
4. On the Advanced settings tab, double-click Sleep, double-click Sleep after. If you're using a desktop computer, click Setting, click the arrow, and then click Never.
5. Click OK, and then click Save changes.
Source: http://windows.microsoft.com/en-in/windows7/sleep-and-hibernation-frequently-asked-questions


Think Left and Think Right
Think Low and Think High
The Thinks You can Think
If only You TRY!

Tips and Tricks- 1

1. Export pgadmin3 query results to a file

In pgAdmin III there is an option to export to file from the query window. In the main menu it's Query -> Execute to file or there's a button that does the same thing (it's a green triangle with a blue floppy disk as opposed to the plain green triangle which just runs the query).

Source: http://stackoverflow.com/questions/1517635/save-pl-pgsql-output-from-PostgreSQL-to-a-csv-file

2. Convert varchar to number in PostgreSQL and displaying results in sorted order

select * from table_name order by to_number(column_name,'99') desc;

here '99' is the format of the number to be displayed

3. Change data type of a column in PostgreSQL

alter table table_name alter column column_name type datatype

4. Add a column to existing table in PostgreSQL

alter table table_name add column column_name  datatype

5. Enable only error message logging in log4j properties file

# Root logger option
log4j.rootLogger=ERROR

6. Thread Dump

A Java thread dump is a way of finding out what every thread in the JVM is doing at a particular point in time. It Java threads that are currently active in a Java Virtual Machine. This is especially useful if a Java application sometimes seems to hang when running under load, as an analysis of the dump will show where the threads are stuck. We can generate a thread dump under Unix/Linux by running kill -QUIT <pid>, and under Windows by hitting Ctl + Break.

Source: http://stackoverflow.com/questions/12277091/what-is-the-meaning-of-thread-dump

7. Get thread dump in linux

 top
    - Command used to display the top CPU processes

We 'll get the pid corresponding to java from this



to display thread dump
kill -3 <PID>
the corresponding thread dump will be displayed in console

8. Eclipse: The specified JRE installation does not exist

This is how I fixed it:
    1. Open Eclipse.
    2. Go to Preferences.

    3. Click Add

    4. A Window should popup 
    5. Select Standard VM.
    6. Select Directory
    7. Use this path: Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/

    8. Click Open
    9. Then Finish

    10. Right click your Project then click    Properties

    11. Select Java Build Path then click Add Library
    12. Select JRE System Library
    13. Click Environments and select the jdk1.7.0_45
    14. Finish

Source: http://stackoverflow.com/questions/26477692/eclipse-the-specified-jre-installation-does-not-exist



Rainbow comes after a little rain!!

Saturday 9 May 2015

More about R

Pros and Cons of R

Advantages

  • Open Source
  • Built in data analytic and statistical functions
  • Interfaces with data bases
  • Data handling and storage
  • High quality graphics capability

Disadvantages

  • Steep learning curve
  • Working with large datasets is limited by RAM
  • Language interpreter can be very slow
  • No professional or commercial support

High Performance Computing using R

The major limitations of R are:
    • R by default uses only a single core regardless of the number of cores of the CPU.
    • R reads data into the memory.

To resolve these problems different parallel computing packages have been introduced in R. Some of them are: Parallel, ff and big memory.
  • ParallelParallel package includes the functionality from snow and multicore packages. Even if it supports parallelism, the performance degrades while dealing with large datasets.
  • ffIt provides file based access to datasets that does not fit in memory. The main bottleneck is that it does not support character vectors.
  • Big memory: This package uses external pointers to refer to large objects stored in memory.

Alternatives to R

The following are the main alternatives to R:
Matlab: MATLAB is a high-level language and interactive environment for numerical computation, visualization, and programming. Matlab has stronger support for physical sciences while R is stronger for statistics.
Maxima: Maxima is a free computer algebraic system written in LISP based on 1982 version of Macsyma.
Gnu plot: Gnu plot is a plotting program and is much simpler than R
Scilab: Scilab is an open source language for numerical computation. The syntax is similar to Matlab.
Octave: Octave is a high level language used for numerical computation.
Mahout: It is a library of machine learning algorithms built on top of Apache Hadoop and map reduce.

R Vs Mahout

Apache Hadoop is used for the processing of big data. Mahout is a machine learning system that runs on Hadoop. The major drawback of R is in terms of its memory limitations. Generally R needs three times the dataset size in RAM to be able to work comfortably. Hence Mahout is the best alternative to R when dealing with large datasets.

Model
Implementation in R
Implementation in Mahout
Decision Tree
Yes
No
Random Forest
Yes
Yes
Stepwise logistic Regression
Yes
No
Neural Networks
Yes
No
Continuous network(Y)
Yes
No
Table: Comparison of R and Mahout

Note


R is a free statistical and graphical programming language. It contains many advanced statistical routines. It runs on a variety of platforms including UNIX, Windows and MacOS. Lack of futuristic insights and complex predictive analytics algorithms are some of the major pitfalls of presently available data analytic tools. R as a statistical and predictive language resolves all these issues. It is recommended in scenarios where the different steps of analysis should be documented for future updates.


It seems bored of reading about R. Will update you on another interesting technology in the next post!!!!!!!!
Happy Learning