Monday, July 20, 2020

Custom Scripts in Oracle Analytics Desktop - Geocoding

Oracle Analytics is a comprehensive unified platform offering services including self-service visualizations, powerful data-prep tool, advanced analytics and augmented analytics.
In this blog, we are going to explain the use of leveraging custom-scripts using Oracle Analytics Desktop (OAD) and few tips and tricks to install additional custom Python packages that do not come by default with the Python engine.

Let us take the following scenario : a user has a dataset with an address column and needs to plot these address on a map visualization. In technical terms, he needs a geo-coding capability which converts valid addresses to corresponding Lat & Longs which can then be used for plotting on a map. .

One way for the user to achieve that would be to leverage Spatial Studio application (free with Oracle Database). Spatial Studio is a self-service web application that makes it easy for users to create interactive maps and perform spatial analysis on their business data. User just needs to upload his dataset into Spatial Studio and the geocoding service can then be invoked to find the corresponding Lat/Longs. Once done, the user can simply visualize his dataset in Oracle Analytics platform. 

But the user has another option to achieve the same result : using Custom Scripts in OAD.
Oracle Analytics Desktop comes with in an in-built Python engine. This Python engine is wrapped around an internal layer called DVML which needs to be installed for a user to run custom scripts on OAD.
Search for Install DVML and install the application

As of OAD 5.6, DVML uses Python version 3.5.2. OAD expects a particular format for a script to be accepted and executed by it's Python engine. This should be an .xml file with well-defined inputs, outputs and options tags. We can break down the process of creating a custom Python script on OAD into 2 simple steps
  1. Have the script working on a standalone Python editor
  2. After completing the first step, wrap the script in the form of an XML wrapper by properly defining the input and output tags and also by making sure that the resultant dataframe of the Python script matches with the output structure defined in the XML.
Coming back to our scenario where a user wants to perform geocoding, this functionality can easily achieved in Python through a package called geopy

A custom python script can simply call the geocode function from this package, using the Address column as argument. If user creates such a custom Python script and uploads it into OAD, he can then simply invoke it from an Oracle Analytics Data Flow by using the "Apply Custom Script" node. This will let the OAD's Python engine to execute the geocoding script. But at this point, when the dataflow is executed the user is likely to see an error...

Checking the logs it becomes clear that geopy package was not found by DVML python engine. This is an indication that the geopy package needs to be installed as a part of DVML. When a user installs DVML, it comes with default set of commonly used Python packages like pandas, numpy, scikit-learn, etc. But how to install new python packages into DVML ?

Installing Python packages into DVML

Here are the steps to be followed to install Python packages into DVML on a MS Windows machine :  search for DVML Runtime and choose the option "Open File Location".

This will open the folder which has the shortcut to DVML. To get the actual location as to where the DVML files are installed, right click on the short-cut and choose the options "Open File Location".
If the default path was chosen during the initial DVML installation (MS Windows machine), the DVML files will be under this directory : C:\Program Files\DVMLRuntime\PCbuild\amd64

Open cmd prompt and run it as an administrator. Navigate to the above path and type in the following command : python -m pip install geopy

This should install the geopy package as part of DVML. 

If you encounter any issues in the above step you can try this alternative approach : search for DVML Runtime and open it as an administrator. This open up a Python prompt where a user can type in python commands. Type the following commands to install python packages.
from pip._internal import main

main.main(["install","geopy"])

In both approaches, replace geopy with the package you want to install and it should install the package into DVML. Once the user has imported all the relevant packages, these packages can be imported and used in the OAD custom scripts. Restarting the OAD and running the geocoding dataflow should now result in a successful execution.

In this approach of geocoding, the user needs not switch between applications or do multiple upload or download of files to get the resultant output. The user can stay in OAD, run the Python script and get the resultant output.

Anything that you can do in standalone Python editor like a Jupyter notebook is possible in OAD by using Dataflows and custom scripts.
As of today, custom scripts import are only applicable to desktop or On Prem (OAS) versions of Oracle Analytics. The cloud version OAC is being worked on to support similar capability in the near future.

Geocoding Script Details


As evident from the above pic, geopy package was imported into the script along with few other packages. There are many providers who offer the geocoding capability. Some of them include Google, OpenStreetMap, Bing, etc. In order to use Google or Bing service, the user should sign up on their platform and get a valid API key to pass it along the geocode request. OpenStreetMap offers geocoding service called Nominatim which offers free geocoding service without the need for the API key. However it requires a user_agent to be passed. This can be the App id of the application or just an email address.

In this script, I have used the Nominatim geocoding service and have used my email address as the user agent. Also another aspect about the Nominatim geocoding service is that there a limit on the number of requests it can serve at a moment. Hence I have used the ratelimiter package to send only one row per second to the Nominatim server. A place name will be passed to the Nominatim geocoding service which results in the address and the Lat/Long coordinates of the place. The results are then passed into the output dataframe.   

Thanks for reading the blog and you can download the geocoding custom script here for your future references.

No comments:

Post a Comment