In this blog post we will talk about how to use inbuilt methods in OAC to cleanse and prepare data used for Training a Machine Learning model in OAC.
One of the important steps in Training a Machine Learning model is to cleanse and prepare the data that we are going to use to train the model. What exactly do we mean by "cleanse and prepare the data":
"It is the process of detecting and correcting (or removing) corrupt or inaccurate records from a
record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data."
It is important to cleanse and prepare data because if we train a model with missing column values or anomalies/outliers in the data which can either be garbage data or simple outliers then the model prediction accuracy can go awry. To solve this problem Machine Learning feature in Oracle Analytics cloud provides some inbuilt methods to perform data cleansing/preparation which can be invoked in Train model scripts.
In this blog post we will learn about how to cleanse/prepare Training data in a Custom Train Model script using inbuilt methods in Machine Learning in Oracle Analytics Cloud. Data cleansing/preparation process can be broken down into three steps: Data Imputation (filling missing values), Encoding (Converting Categorical to numerical values if necessary) and Standardization (Normalization). All the functions needed to perform these operations are implemented in a python module called datasetutils with in OAC. Users can develop their own functions or use the existing module. Here is a snapshot of the parameters accepted from UI for each of these operations while Training a model:
Here is brief description about each of the data preparation functions:
1) Data Imputation: Data imputation is a process of filling missing values. There are multiple
inbuilt imputation methods. Users are given option to choose the imputation method for filling the
missing values for both numerical(Mean, Median, Min, Max) and Categorical(Most Frequent,
Least Frequent) variables. datasetutils python module in OAC contains a function called fill_na()
that performs data imputation. This code accepts the methods for imputation as parameters and
returns a dataframe with data imputed for categorical and numerical columns. Following snippet
of code shows a sample usage of fill_na() function.:
# fill nan columns with mean, max, min, median values for numeric and most frequent, least
frequent for categorical
df = datasetutils.fill_na(df, max_null_percent=max_null_value_percent,
numerical_impute_method=numerical_impute_method, categorical_impute_method =
categorical_impute_method)
2) Encoding: Encoding is a process of converting Categorical variables to numerical values. This is
usually required in cases where regression needs to be performed. There are two inbuilt methods
to perform encoding, they are: Onehot and Indexer. features_encoding() function in datasetutils
module performs encoding. Following snippet of code performs encoding:
# encoding categorical features
data, input_features, categorical_mappings = datasetutils.features_encoding(df, target,
None, encoding_method=encoding_method, type = "regression")
3) Standardization: Standardization is a process of normalizing the data to reduce the effects of
skews introduced due to outliers. standardize_clean_data() function in datasetutils module is an
inbuilt method in OAC to perform standardization. Following is a sample snippet of code that
performs standardization and returns a dataframe with standardized data by normalizing
outliers:
# Standardize data. This is to make sure that the data is normalized so as to reduce the
influence of Outliers
target_col = data[[target]]
if standardization_flag:
features_df = datasetutils.standardize_clean_data(data, input_features)
else:
features_df = data[features]
All these functions can be invoked directly from Train model custom scripts by importing datasetutils module in the beginning of Train Model script.
Related Blogs: How to build Train/Apply custom model scripts in OAC, How to Create Related Datasets, How to Populate Quality Tab
One of the important steps in Training a Machine Learning model is to cleanse and prepare the data that we are going to use to train the model. What exactly do we mean by "cleanse and prepare the data":
"It is the process of detecting and correcting (or removing) corrupt or inaccurate records from a
record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or
irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data."
It is important to cleanse and prepare data because if we train a model with missing column values or anomalies/outliers in the data which can either be garbage data or simple outliers then the model prediction accuracy can go awry. To solve this problem Machine Learning feature in Oracle Analytics cloud provides some inbuilt methods to perform data cleansing/preparation which can be invoked in Train model scripts.
Here is brief description about each of the data preparation functions:
1) Data Imputation: Data imputation is a process of filling missing values. There are multiple
inbuilt imputation methods. Users are given option to choose the imputation method for filling the
missing values for both numerical(Mean, Median, Min, Max) and Categorical(Most Frequent,
Least Frequent) variables. datasetutils python module in OAC contains a function called fill_na()
that performs data imputation. This code accepts the methods for imputation as parameters and
returns a dataframe with data imputed for categorical and numerical columns. Following snippet
of code shows a sample usage of fill_na() function.:
# fill nan columns with mean, max, min, median values for numeric and most frequent, least
frequent for categorical
df = datasetutils.fill_na(df, max_null_percent=max_null_value_percent,
numerical_impute_method=numerical_impute_method, categorical_impute_method =
categorical_impute_method)
2) Encoding: Encoding is a process of converting Categorical variables to numerical values. This is
usually required in cases where regression needs to be performed. There are two inbuilt methods
to perform encoding, they are: Onehot and Indexer. features_encoding() function in datasetutils
module performs encoding. Following snippet of code performs encoding:
# encoding categorical features
data, input_features, categorical_mappings = datasetutils.features_encoding(df, target,
None, encoding_method=encoding_method, type = "regression")
3) Standardization: Standardization is a process of normalizing the data to reduce the effects of
skews introduced due to outliers. standardize_clean_data() function in datasetutils module is an
inbuilt method in OAC to perform standardization. Following is a sample snippet of code that
performs standardization and returns a dataframe with standardized data by normalizing
outliers:
# Standardize data. This is to make sure that the data is normalized so as to reduce the
influence of Outliers
target_col = data[[target]]
if standardization_flag:
features_df = datasetutils.standardize_clean_data(data, input_features)
else:
features_df = data[features]
All these functions can be invoked directly from Train model custom scripts by importing datasetutils module in the beginning of Train Model script.
Related Blogs: How to build Train/Apply custom model scripts in OAC, How to Create Related Datasets, How to Populate Quality Tab
Are you an Oracle Analytics customer
or user?
We want to hear your story!
Please voice your experience and provide feedback
with a quick product review for Oracle Analytics Cloud!
6 comments:
Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updatingmulesoft online training Hyderabad
Thanks for sharing such a great blog Keep posting.
Machine Learning Training in Delhi
Data migration services offered by your company have helped in passing all the series of functions entirely, which are necessary for data transformation.
Pretty article! I found some useful information in your blog, it was awesome to read, thanks for sharing this great content to my vision, keep sharing. thanks a lot guys..
python Training in chennai
python Course in chennai
If you’re using Oracle Database you’re no doubt familiar with the Oracle Data Warehouse Data Mining Add-on. This component can do a lot to help you get the most out of your data, but it has some limitations. The big one is the number of rows you can analyze at one time—a mere 100,000. If you’re using Oracle’sBig Data Applianceyou’ve got a lot more data at your disposal. You can store up to 70 petabytes of data on it, and perform data mining on tables containing up to 10 billion rows.
Thank you for sharing such a really admire your post. Your post is great! .href=”https://mypetclinicvizag.com/pet-dentistry-in-vizag/”> Pet Dentistry in Vizag
Post a Comment