Tuesday, March 27, 2018

Pre-Processing and Preparing Data for ML Predictions in OAC

In this blog post we will talk about how to use inbuilt methods in OAC to cleanse and prepare data used for Training a Machine Learning model in OAC.

One of the important steps in Training a Machine Learning model is to cleanse and prepare the data that we are going to use to train the model. What exactly do we mean by "cleanse and prepare the data":

  "It is the process of detecting and correcting (or removing) corrupt or inaccurate records from a
    record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or
    irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data."


It is important to cleanse and prepare data because if we train a model with missing column values or anomalies/outliers in the data which can either be garbage data or simple outliers then the model prediction accuracy can go awry. To solve this problem Machine Learning feature in Oracle Analytics cloud provides some inbuilt methods to perform data cleansing/preparation which can be invoked in Train model scripts.

    In this blog post we will learn about how to cleanse/prepare Training data in a Custom Train Model script using inbuilt methods in Machine Learning in Oracle Analytics Cloud. Data cleansing/preparation process can be broken down into three steps: Data Imputation (filling missing values), Encoding (Converting Categorical to numerical values if necessary) and Standardization (Normalization). All the functions needed to perform these operations are implemented in a python module called datasetutils with in OAC. Users can develop their own functions or use the existing module. Here is a snapshot of the parameters accepted from UI for each of these operations while Training a model:

                            

Here is brief description about each of the data preparation functions:

 1) Data Imputation: Data imputation is a process of filling missing values. There are multiple
     inbuilt imputation methods. Users are given option to choose the imputation method for filling the
     missing values for both numerical(Mean, Median, Min, Max) and Categorical(Most Frequent,
     Least Frequent) variables. datasetutils python module in OAC contains a function called fill_na()
     that performs data imputation. This code accepts the methods for imputation as parameters and
     returns a dataframe with data imputed for categorical and numerical columns. Following snippet
     of code shows a sample usage of fill_na() function.:
 
     # fill nan columns with mean, max, min, median values for numeric and most frequent, least
        frequent for categorical

     df = datasetutils.fill_na(df, max_null_percent=max_null_value_percent,
                 numerical_impute_method=numerical_impute_method, categorical_impute_method =
                 categorical_impute_method)


 2) Encoding: Encoding is a process of converting Categorical variables to numerical values. This is
      usually required in cases where regression needs to be performed. There are two inbuilt methods
      to perform encoding, they are: Onehot and Indexer. features_encoding() function in datasetutils
      module performs encoding. Following snippet of code performs encoding:

      # encoding categorical features
      data, input_features, categorical_mappings = datasetutils.features_encoding(df, target,     
      None, 
encoding_method=encoding_method, type = "regression")

  3) Standardization: 
Standardization is a process of normalizing the data to reduce the effects of
      skews introduced due to outliers. standardize_clean_data() function in datasetutils module is an
      inbuilt method in OAC to perform standardization. Following is a sample snippet of code that
      performs standardization and returns a dataframe with standardized data by normalizing
      outliers:

      # Standardize data. This is to make sure that the data is normalized so as to reduce the
         influence of Outliers

      target_col = data[[target]]
      if standardization_flag:
            features_df = datasetutils.standardize_clean_data(data, input_features)
      else:
            features_df = data[features]

All these functions can be invoked directly from Train model custom scripts by importing datasetutils module in the beginning of Train Model script.

Related Blogs: How to build Train/Apply custom model scripts in OAC, How to Create Related Datasets, How to Populate Quality Tab

No comments:

Post a Comment