The use of AI/ML for proactive self-evaluation of Data submitted on HMDA, capturing the change using model based on CFPB published data

citi hexanika
RSM AML analytics hexanika success story

For Consumer Financial Protection Bureau (CFPB) and other regulators, data has become a new tool for carrying out analysis and identify the activities, that are discriminatory and unfair. The CFPB has been accumulating the data for years. Also going forward, the different datasets across the depository and financial institution are getting captured through regulations like HMDA, CRA etc. This data is new ground for conducting the tests to find the incidences of irregularities. The regulations being expanded, enhance data attributes requirements. Federal government enforcing consumer protection and fair lending laws.

In the month of March, CFPB had hosted the TechSprint to ascertain new ideas for using latest technology. The context of the TechSprint was to Identify and scope additional enhancements to HMDA data products and services, visualizations, or the development of our resources (These enhancements could focus on, for example, a specific geographic area or lender or data trend. Additionally, they might include new products, new ways to interact with existing products, data analysis capabilities, interfaces to other datasets, or guidance and aids to understanding and using HMDA data.)

The team HEXANIKA, participated. The team thought, that we can actually use machine learning to predict the attribute values in upcoming submission data. The broader objective is to use existing submitted data, with correlated parameters, can actually uncover disparities, irregularities, or discrimination present within the submission reports.  

The objective of the process, set to

  1. To predict the “Action Taken” within the testing dataset or loan application register
  2. To utilize the CFPB historical data, data publication API to train the model
  3. The testing dataset can be the loan application register

To create the model, the team use CFPB historical submission data. The team leverage the data publishing APIs from CFPB and developed a process that can download the data from the CFPB database.  The select data was for a certain peer group – based on size, region, area of operation etc. The process involved

Exploratory data analysis – This process used to find out the related attribute, and to decide the steps required for data preparation.

Data Preparation – This involved,

  1. Missing Value imputation – The techniques used were 
  2. Imputing the values in the attributes by using mean, median, or mode
  3. Removing the sparse data columns.
  4. The outlier detections – The techniques used were
  5. Z-Score detection – The Z-Score for each cell computed and defined the threshold for outlier was removed 
  6. Outlier Trimmer – The maximum and minimum threshold was set to remove outliers
  7. Isolation Forest – This uses the linear algorithm method to replace the outliers
  8. Feature Reengineering – This is the transformation step of the data preparation
  9. Feature Encoding -The data transformed for categorical conversion of varchars into numeric data
  10. Handling imbalanced data – This is required to create the balanced model, this step involved, the selection of peer group data based on assets size, area of operations etc. Also, oversampling or under sampling done to remove the biases from data.

Additionally, Feature selection techniques like K-Best, recursive feature elimination, VIF etc were used.   

Training the Model :

All these techniques used to establish the variables that are most correlated with the target variable i.e., Action Taken. Once the most relevant variables are identified then the model is developed using a training dataset. The techniques used for training model were XGBoost Classifier and Random boost Classifier to fit the model and predict the outcome.

Testing and Results :

The testing set is current loan application register from the institution. This provides predictive plus proactive testing of the data before submission. Further unleash the opportunity to execute more analytics on the data. The correlation model we found to be giving 90% accuracy, however, we are interested in finding out the 10% negative test cases and false positives in the failed scenarios.

The scenarios, where peer groups are lending, but the institution is not lending. The disparity between the reason for denials and the actual denials can be made consistent. The reason for preparatory testing can be found out. The outliers for auditing purposes can be identified.

In the nutshell, CFPB has taken the steps to use technological advances in data discovery for the implementation of consumer protection and fair lending laws. HEXANIKA has comprehended the way forward to offer the coverage to customers on moving forward on these technological advances.

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed

81 − = 80