Tuesday, February 14, 2017

OracleDV : Geocode Address data with Custom R-Script

Pre-requisites: Internet connection

In this blog we will discuss about an exciting possibility on OracleDV: How to Geocode Address data to extract latitude and longitude information using custom R-script. This Custom R-Script takes Address data and locates the address by identifying Latitude and Longitude. Address can either be a single line like "200 Oracle Pkwy,Redwood City, CA 94065" or it can be in a general format with Street, House Number, Area, City, State, Country details specified.  This script can be downloaded from Oracle BI Public Store and here is how your DV Desktop will look like after deploying this plugin:

Please note that you have to deploy Custom Points Map custom visualization plugin from BI Public store to get this rendering.

What is Geocoding and how does this script work: It is a computational process of transforming postal address description to a location in the earth's surface. This Script uses Oracle Geocoder web service. So internet connection is required for this script to work. As of now Oracle Geocoder service doesnt support following countries China, South Korea and Japan. If you want to geocode a non-us address, make sure that you pass country name as an input to EVALUATE_SCRIPT.  Geocoder service accepts requests in many xml formats. In this script we generate either Unformatted Address type XML which has the following structure:

<geocode_request vendor="elocation">
    <input_location id="1000">
      <input_address match_mode="DEFAULT">
        <unformatted Country="Brazil">
          <address_line value="Pawtucket 7935 Shasta Way"/>

or General form XML which has following structure:

<geocode_request vendor="elocation">
    <input_location id="27010">
      <input_address match_mode="DEFAULT">
        <us_form2 street="500 oracle pky" city="redwood city" state="ca"/>

In the above formats, elements and attributes between <input_location id = "XXXX">  and </input_location> represents a row. Depending on the input address format type, in the script we construct xml request with one of the above formats. We batch those requests and send 1000 records at a time for geocoding. This script uses httr and XML R packages.

Please note that this script does not guarantee to geocode each and every address given, neither does it guarantee to provide accurate address these are only approximations. To differentiate between the addresses that could be found and not found, we have added a flag called "address_located" which will be set to "Y" if the address could be geocoded and "N" if the address cannot be geocoded.

Inputs to the script:
This script can take two kinds of inputs: First is an address passed in as a single unformatted string for example "500 Oracle Parkway. Redwood Shores, CA 94065." or address in general format which is broken into individual components like House number, street, city, state, Postal code etc. Please note ID column is a mandatory input for both the kinds of inputs. Here is a brief on what inputs should be sent to the R-script for each of these addresss kinds.
1) Unformatted : If your address is an unformatted single string, then you should pass in id column(which is the identity column required to identify the address) and address_string which contains the unformatted address.
NOTE: If the country details are missing in the unformatted string, please specify the country name for greater accuracy.
2) GeneralForm : In your dataset if you have the address in formatted structure with broken up components like Street, City, State, Country etc then along with ID column you need to send Street,  City,State,Country,Postal_code details.

Optional Inputs:
1) address_format : Specify the address format of the inputs. Like mentioned above currently we support two address formats : "unformatted" and "gen_form"
2) service_url : Specify the URL for Oracle geocoder service. If you do not specify anything by default the current ongoing version of Oracle Geocoder service.
3) proxy_url : If your network requires you to use some proxy(because of some firewall or VPN), please specify the proxy url with the port number.

This R-Script outputs Longitude, Latitude for the address and it also returns an additional column called address_located which tells whether the address could be geocoded or not. If address_located is "Y" the address was successfully geocoded. If address_located is "N" then the address could not be located successfully.

Steps to deploy this R-Script plugin in your local OracleDV:

1) Install Advanced Analytics feature in Oracle DV by clicking on the below icon. This will install Oracle R deployment. Alternatively you can install Advanced Analytics by running install_advanced_analytics.cmd present in <DV_INSTALL_DIRECTORY>

2) If not installed httr and XML R packages already, please install it using following instructions
    Open R console(double click Rgui.exe present in <Advanced_Analytics_Install_Dir>\bin\x64) and
    install httr and XML Packages.
    Following are the R commands to install:
     Set Proxy:
        $ Sys.setenv(http_proxy="http://<your_proxy_host>:<port>")
           set proxy appropriate to your network settings.
     Install Package:
        $ install.packages("httr")
        $ install.packages("XML")
3) Download Geocoding_V1.zip from OracleBI Public Store and unzip it.
4) Copy R.ReverseGeocoding.xml to <DV_INSTALL_DIRECTORY>\OracleBI1\bifoundation\advanced_analytics\script_repository
5) Download and deploy Custom Points Map custom visualization plugin from Oracle BI Public Store. Instructions to deploy this Custom Viz plugin are described in the Public store.
6) Import the .dva project to Oracle DV. Password for the .dva file is Admin123
NOTE: Use Proxy_Url optional parameter in EVALUATE_SCRIPT only if your network requires you to use proxy.

1 comment:


Made following additions to the blog:
Update on Supported countries: Earlier we mentioned only countries from North and South American continents are supported. Corrected it. All countries except Japan, South Korea, China are supported. If country name is not part of your address string, for greater accuracy please pass in the country name.

Post a Comment