Real estate data mining for your next business need using Public Records

Here are a few examples of how our public record data mining is leveraged

Comprehensive nationwide real estate public record data helps tackle various business needs in many industries beyond real estate. We mine terrabytes of real estate data to find the those needles or patterns in the haystack. This post covers a few real estate and non-real estate use cases we are actively supporting.

As previously announced, our public record data provides a comprehensive set of property attributes such as owner occupancy, last sale information, and more detailed tax assessment information along with full property details. It also provides property identification, seller/buyer information, tax exemption details, building information, and legal description of the property.

The valuation models and comparable similarity scoring are now based on authoritative property details and current market conditions from listing data.

Marketers in any industry

PropMix real estate data is very well suited for finding target customer base for many businesses. Here are a few examples:

  1. A skylight company recently needed information on all homes that have a skylight so that they could offer upgrades or servicing options
  2. A flooring company is able to provide an automated estimate of carpeting or hardwood flooring costs using our building area information
  3. An insurance company is able to target customers who have lived for more than 10 years in a home to consider modifying their insurance coverages

Real Estate Investors

Investors are interested in finding undervalued homes in good rental markets to buy and convert them to income generating rental properties.

  1. We identify tenant occupied properties in each neighborhood in the country and find areas where rental demand is increasing
  2. We then find owner occupied homes in these areas that can potentially be converted to investment properties.

Our data can also power a full investment pro forma including the total cost of ownership and return on your investment.

realestate
Mortgage Industry

Appraisers and lenders need information to accurately assess the risk of a collateral before underwriting a loan –  purchase, refinance, and/or home equity.

  1. Appraisers improve the accuracy of their valuations using extensive assessor recorder property data and comparable sales from public records – including new homes sales and/or owner sales not in the Multiple Listing Services.
  2. Underwriters or lender reviewers can run their appraisal review and AVMs using:
    1. Transaction history on a property
    2. Comprehensive report of the property details

As we continue to solve additional business problems we will provide updates on this blog on new and creative ways in which our customers are mining our data.

Improve the Quality of Your Real Estate Data

Part 2 – How to improve Real Estate Data Quality?

 

In Part 1 of this series we broadly covered why data quality is important in real estate, why real estate data quality has become a hard problem to solve, and presented a few examples of how to measure the quality of your real estate data. In this second and final part we will present a few ideas on how you could begin the practice of improving real estate data quality.

Data Quality Best Practices

As you would expect data quality is a common problem in many other industries irrespective of how old or new the industry is. As a result many best practices already exist for managing and improving data quality that can be easily adopted within real estate. Here are a few important areas to focus on.

Data Quality Assessment

Before we can start improving quality we need a solid understanding of the current state of the data. As we presented in the last section of Part 1, knowing how to measure for the quality of your data is a first step. These data quality metrics are very specific to the industry we are in and we have provided a few good starting points.

 

In addition, to knowing your current state a good data quality assessment practice is required to assess yourself periodically to measure improvements and also measure any data quality leaks due to data trickling into your platform. It is also a great way to present to senior management on the strides you are making in your organization.

 

Design of the quality metrics needs to be traceable directly to your company’s business objectives which would be different depending on where in the real estate market you play – lead generation, mortgage origination, appraisals, brokerage, etc. Such a traceability is important to get buy-in from the management to invest in data quality.

Data Governance
To have a strong commitment from the organization towards data quality and to continuously support the people, processes, and technologies to maintain the data quality a data governance board must be established with participants from the business and IT. Business participants would be those who are close to the consumption and production of data and the IT participants would be the data architects and modelers. The objectives of the governance board would be to

 

  • Establish data policies and standards
  • Defining and measuring data quality metrics
  • Discover data related issues and provide resolution paths
  • Establish proactive measures to reduce data quality leakage

Data Stewards

One of the most important roles within a data governance board and the overall data management practice is the Data Steward. Data stewards are the ultimate owners of specific sections of the data – usually called subject areas, and they would represent business users and producers of data. The buck stops with the data steward for all data quality issues and the steward takes a leadership role to resolve data accuracy, consistency, or integrity issues.

 

Data stewards are often the liaisons between the business and the IT department that manages the data for the business. In this role, they are required to work with the business and IT to define relevant quality metrics, have it interpreted and implemented appropriately with the IT department and ultimately showcase their quality improvements that improve business outcomes.

Create a Data Quality “Firewall”

Most data resulting within an organization are traceable broadly to 2 types of sources – applications where users are entering data or data feeds that are processed to load data into data stores.The idea of a data quality firewall is to catch and reject any data that violates data quality rules at the time of its entry into a data store. All data ingestion points will have to hit this one virtual firewall to be validated before being processed and stored.

 

The keyword above is “virtual” – because it is impractical to create a single system to act as a data quality firewall given the various subject areas of data and the departmental data ingestion points across the organization. The idea is not to create a choke point but a proactive mechanism to catch data quality issues for follow up and resolution before it goes downstream into transactional or analytical systems.

conclusion

Data Standardization vs. Data Quality – What’s the difference

Does compliance to a data standard mean high data quality? In other words, if your data is Platinum level certified by RESO 1.5 data dictionary would you also considered it to be of high quality? It turns out the answer is not that straightforward.

 

There are typically 2 different views on data quality – conformance to a standard specification or usability of data for a specific purpose. If we take the first definition the data quality would be very high if a data set is certified by RESO. On the other hand as we discussed in Part 1, an agent could inadvertently enter erroneous listing data or purposefully tweak the listing for improved marketability. This can result in data inconsistency between a public record and a listing record for the same property leaving the user of the data to assign trustworthiness to the data sources before consumption. Since business objectives are driven by data use as opposed to conformance to a standard we prefer the second definition of data quality which is measured by its usability.

 

Consider another example of standard vs. quality: Assignment of a PropertySubType value of Condominium or Townhouse or Single Family Residence is standards compliant but an erroneous assignment of this field can cause the property to be missed from appearing in IDX searches. In addition, it can also cause valuation issues if not combined and cleansed against other data sources.

 

Having said that, certain standards specifications include elements of data use as well, in which case conformance to standards and usability begin to mean the same. But given the various uses of a particular data set it is in unfair to expect a standards organization to completely define all the usability of specs for the data resulting in an unwieldy standard that may reduce its adoption.

 

Here are some typical data quality concerns to consider:

Completeness Are we missing any values of critical fields?
Validity Is the data in a field valid? Does the whole record match my rules?
Uniqueness How much of our data is duplicated?
Consistency Is information consistent within a single record, across multiple records, and across multiple data sets?
Accuracy Does the data represent reality?
Temporal Consistency & Accuracy Does a snapshot in time represent reality at that time and are all data sets consistent with that snapshot?

 

As you can see, a data standard such as RESO would not be able to answer the above for all the real estate ecosystem players. We could define detailed rules for each of the concerns above and such rules will look different in a mortgage company and a sales lead generation company.

Practical data quality for real estate

Now let us bring all this down to a few specific takeaways to improve the quality of data in your company. We will define these in a few steps to begin with. But certainly stay tuned into our blog for future posts on this topic where we will continue to provide specific rules and heuristics you could implement.

 

Many of the activities below must be driven by an appointed data steward for each major data set you are dealing with – assessment, listings, deeds, mortgages, permits, etc.

Identify critical fields

The first step in your data quality journey is to identify the most critical fields for your particular application. Out of the 639 fields contained in the RESO 1.6 data dictionary, you would want to identify the fields that are required for your computations. There are some fields commonly required for any application and were listed in Part 1 of the article and repeated here for quick reference:

 

Parcel Number ListingContractDate AssociationName
Address StandardStatus AssociationFee
PropertyType OriginalListPrice Subdivision
PropertySubType ListPrice School Districts
Lot Size CloseDate
Zoning ClosePrice TotalActualRent
NumberOfBuildings DaysOnMarket
BedroomsTotal ListAgent Information
BathroomsTotal ListBroker Information
LivingArea SellingAgent Information
Tax Year SellingBroker Information
Tax Value Public Remarks
Tax Amount
Land Value
Improvement Value
StoriesTotal
ArchitectureStyle

Define Data Quality Rules

The next step is to define a set of rules that will consider 2 dimensions to begin with:

 

Data Quality Concerns: Completeness, Validity, Uniqueness, Consistency, Accuracy, and Temporal Consistency & Accuracy.

Extent of measurement: Single record, multiple history records of the same property, multiple history records of the same listing, multiple data sets (public records and listings)

 

You would end up with rules for each field, for each type of record, for a data set, and rules that cut across multiple data sets. These rules would validate the field, a record, a set of records, or the whole data set. Execution of these rules would result in either errors or warnings about the quality of your data.

Discovery with Data Profiling

Data Profiling helps you run a statistical analysis on the data to discover hitherto unknown problems

For example, we usually expect PropertySubType values to be always one of the known ones. But as new data gets processed, we might discover that certain PropertySubType mappings are absent in our standardization routines and as a result non-standard PropertySubTypes may be getting added to our DB.

 

To catch such issues, a data profiling capability will provide detailed stats on field populations, null counts, blank counts, and also field value distributions. For the PropertySubType values, the field value distribution will reveal to us that there is a new PropertySubType value with over 100,000 entries. This will mean that we should remap these values as required.

 

Running a data profiler periodically will help identify issues that creep up into the data. Note that a data quality firewall would only prevent “unclean” data when we have modeled such cleansing rules or quality rules within that firewall. But for previously unknown issues that get loaded via daily incremental data ingestions, we need to discover the issues and model prevention rules into the firewalls.

 

Establish Data Quality Metrics

Having defined the rules it is time to measure your quality against the rules you have established. Common quality metrics are:

  • Number of records that failed a particular quality rule
  • Field population thresholds and where we fall short
  • Field value distributions
  • Number of records with invalid data for each field
  • Number of records that failed a record level quality rule
  • Number of multi-record quality rule failures
  • Number of data-set level quality rule failures

 

For each of the above it is important to understand the trends and so you need to run the Data Profiler in regular intervals – weekly or monthly, to know how your data quality is trending – improving, getting worse, or discover issues that did not exist before.

Enforce the rules at the data ingestion points

This is the first and proactive step in improving and maintaining high quality of data.

 

Having defined the rules for measuring data quality, it is now important to maintain a higher quality data by ensuring we enforce these rules at the time data is created in the organization. Get the data steward to become the evangelist for the rules he/she has defined to work with each data origination point to implement the validation rules.

Define Heuristics for Quality Improvement

The reactive posture to data quality improvement is considered more of a data cleansing process and is a required element of a data quality practice. Most of the times, you are not in control of the data origination points and if the rule enforcement at the data origination point is too restrictive you might not have enough data for your applications. And hence the need for a reactive measure to cleanup data you have received.

 

There are broadly 2 alternatives – either perform the cleanup and then put it through a highly restrictive data quality firewall or have a lenient firewall with a downstream cleaning process. The choice depends very much on your application and its ability to deal with imperfect data.

 

Any data quality improvement mechanism is dependent on a set of heuristics that the data steward and the data architects work together to define. For example, you could reclassify a rental listing correctly by looking at the listing price and comparing it to local median sale price and to the median rental price. A strong partnership between a data steward and the data architect is necessary to define and develop these cleansing heuristics.

 

It is also recommended that you maintain a list of all active and retired heuristics used for cleansing. Another need alongside data cleansing is the ability to track the data lineage where you would keep track of the source of the cleansed data and the heuristics that caused the data to be modified.

Conclusion

Data quality is a cyclical process that begins with establishing rules, implementing them to measure quality, profiling the data, cleaning up the data as required, and finally go back to tweaking the rules to execute the cycle once more. The target metrics would start small but continue to tighten it with time.

firewall

 

We hope this article provided an overview and some key takeaways to implement a good data quality practice within your real estate technology platform. We will continue this conversation with more blog posts to provide you:

  • Practical data quality rules and metrics
  • Data cleansing heuristics to implement
  • Machine learning techniques in real estate data cleansing

 

We are planning to release our Data QA Tool specialized for real estate data free to the community. Please sign up here to be notified when the tool is released.

Want access to Data QA Tool?

Please provide your email to be alerted when Data QA Tool is published.

Improve the Quality of Your Real Estate Data

Part 1 – The Real Estate Data Quality Problem Part 2

Introduction

Real Estate data comprises of many categories – characteristics of a property, history of the property and how did it change during its lifetime – renovations, add-ons, permits, etc., current for sale properties, history of sale records, history of tax assessments, current mortgage information, any outstanding liens, utility consumption, neighborhoods, schools, and the list goes on. You can see that there is data about a real property and a lot of additional data about how the property is influenced. And as you read through that partial list of data categories you would have also observed that each of those categories are created and maintained by a different company or a government agency. Given this disparate sources of data and how the real estate industry has evolved, assembling all this in one place to know everything about a single property has become a challenge. Before we begin explaining why this is a challenge, let us briefly explain who uses this data and why this is so important.

Relevance of Data Quality in Real Estate

Housing alone contributes about 15-18% to the GDP of the US economy [1]. If you consider commercial real estate the numbers climb to well over 20% [2]. The real estate ecosystem is comprised of numerous industries and each of them are dependent on data. Here are a few of them in the table below.

 

Producers & Consumers of Real Estate Data

Local Municipalities
County Governments
Federal Agencies
Mortgage Lenders (Banks, Credit Unions)
Mortgage Brokers
Mortgage Servicers
Investment Banks
Appraisers
Home Inspectors
Title Companies
Real Estate Brokers and Agents
Home Buyers and Sellers
Home Improvement Companies
Home Improvement/Repair Contractors
Builders and Developers
Architects
Civil Engineers
Investment Banks
ETF and Fund Managers
Retirement and/or Sovereign Funds
GSAs – Freddie Mac and Fannie Mae

Here is one reason why data quality matters across these players: Consider the loan processing steps in home buying. The homebuyer applies for a mortgage at a lender and the lender’s underwriter hires an appraiser to determine the actual value of the property before lending a percentage of that value (a maximum of 80% in most cases) to the buyer. Once the mortgage is issued it is often transferred to a mortgage servicer and the mortgage itself is sold to another financial institution to enable securitization of the loan. Securitization enables other investors across the world to participate in the US mortgage market and in turn in the US real estate market. Each party in this chain of activities and especially the investor in the security needs to understand the security’s Value at Risk (VAR) which is directly dependent on the value of the home among many other categories of risk such as borrower risk, market risk, and so on.

data quality

Home valuations are dependent on the property’s characteristics, recent sales in the market, current inventory of homes, neighborhood information, recent development and employment activity in the area, and many more such factors. As you can see accurate and consistent real estate data is highly important to arrive home valuations of high degrees of confidence for every player in the ecosystem.

For instance, consider a property with 4 Bedrooms, 3 Baths, 2,500 sq. ft. living area, on a one acre lot is listed in the MLS as a 5 Bedroom, 3 Bath property since the agent counted an additional room in the basement as a bedroom. By comparing this to other 5 Bedroom, 3 Bath comparable properties, the subject property could get overvalued or other properties can get undervalued if the list price of such a property is used as a comparable. Similarly, the subject property being compared to another one in better condition, or missing out on improvements made to the kitchen or the basement, will reflect an inaccurate value in an appraisal. As a result an appraiser tries not to solely depend on the MLS listing data for their work; she supplements it with onsite inspections to collect detailed information. Appraisals are thereby delayed and it further cuts into the profit margins in the appraisal business. Much worse, this has a direct bearing on the ability of the homeowner or the buyer in closing the transaction. So, unreliable data sources inadvertently exert strong influence on the whole process.

Why is it difficult to maintain data quality in Real Estate?

Of all the sources of information about a particular property, the most dependable data is that made freely available in most counties in the US via the public records act in each state. That covers tax assessment, deeds, mortgages, liens, etc. These data sets are again completely independent typically tied together by an APN (Assessor’s parcel number). But each county or municipality creates and maintains this data in their preferred model even though conceptually they all cover the same types of information. Integrating data from over 3000 counties across the country and unifying them to a single data model is one necessary step to ensure data consistency can be maintained across all properties.

 

Real Estate listings data gathering, on the other hand, has been a wild west even with the Real Estate Transaction Standard (RETS) maintained by the National Association of Realtors (NAR), which only provides a protocol standard for data exchanges but not a payload standard for the data actually exchanged. Enter Real Estate Standards Organization (RESO) with the RESO standard data dictionary and it has immensely improved consistency in data representation across the various players. But RESO does not address the types of home valuation related data issues discussed earlier (we will presented why RESO is justified with that position in Part 2 of this article). The MLS data capture platforms most often do not enforce any data consistency rules within the system or with the local county/municipality data. Even though a Board of Realtors or MLS may have a recommended format, there could be hundreds, if not thousands of agents, brokers and their assistants that could submit a listing. Much as no two people are alike, their choice of words and descriptions of key features could vary. The description of features is another common area where subjectivity is prevalent. For every person that calls a home a “fixer upper”, another person will say it is “an incredible value, with lots of potential”.

 

data source

Inaccuracies in the data can be introduced through other means as well. Property characteristics are largely affected by this. Real estate appraisers require the Gross Living Area (GLA) of a home to be the “Above Grade” square footage, which would be how the assessor would report it, but when the property is listed, the Living Area is often, inclusive of finished basements, which can be misleading. Even though the intent is not to create a wrong listing, misinterpretation of the data creates tricky situations during the appraisal process. Data entry errors can create a listing with the wrong number of bedrooms or bathrooms, living area or lot size area. When there are several hundreds of fields to update for a listing and time is limited, these errors tend to multiply exponentially.

 

Know Your Data – Measure its Quality

As we explained in the previous sections, data quality in real estate is much required but hard to achieve given the integration complexities across the various players. Identifying the individual root causes and fixing them can take a long time, but in the meantime we can try to improve the quality of current data to achieve immediate business objectives.

 

Before we can “cleanse” the data to improve its quality, we need to be able to identify how bad is the data at hand using a few applicable metrics. It is important to understand that the target quality and the metrics to measure it by depends a lot on the target use for the data. For example, a selling agent is most interested in data related to property characteristics, financing terms, showing instructions, etc. but a home improvement company would be interested in property features, property improvements, etc. Here are a few suggested common metrics to measure the quality of real estate data.

 

Field Population Statistics with specific focus on the following fields from the RESO standard data dictionary.

Parcel Number ListingContractDate AssociationName
Address StandardStatus AssociationFee
PropertyType OriginalListPrice Subdivision
PropertySubType ListPrice School Districts
Lot Size CloseDate
Zoning ClosePrice TotalActualRent
NumberOfBuildings DaysOnMarket
BedroomsTotal ListAgent Information
BathroomsTotal ListBroker Information
LivingArea SellingAgent Information
Tax Year SellingBroker Information
Tax Value Public Remarks
Tax Amount
Land Value
Improvement Value
StoriesTotal
ArchitectureStyle

 

Address Standardization measures the extent to which the address components for a property are usable to uniquely locate a property or helps in deriving a high accuracy geocode.

Geocode Accuracy is sometimes required to support accurate property searches for radius or polygon searches. Rooftop accuracy may be required for certain applications but a street side geocode might suffice for many.

Listing Duplication must be reduced as much as possible again depending on the application but at the least listings from the different MLSs will need to be linked with a common unique property id.

Raw listing data from an MLS will trickle in with multiple updates and improving in quality over the first few days or weeks of a property being listed. Listing history records may have to be merged to improve data consistency.

Often a listing will move into a Cancelled/Withdrawn status before it is recorded as Sold. In such cases the listing history data may require a consolidation to drop the superfluous status transitions.

Very often sale and rental listings may get mixed up in different RETS resources/classes. It may be required to reclassify such listings appropriately.

Click here to continue to Part 2 of this article which explores the following ideas.

  • Data Quality Best Practices
  • Data Standardization vs. Data Quality – What’s the difference
  • Practical data quality for real estate

References

Please click here to provide your contact information to be alerted when Data QA Tool is published.

Want access to Data QA Tool?

Please provide your email to be alerted when Data QA Tool is published.

Improve the Quality of Your Real Estate Data

PropMix published Part 1 of our latest Point of View series of articles called “Improve the Quality of Your Real Estate Data” earlier this week. The first part of the article covers in detail the relevance of data quality in Real Estate, why it is difficult to maintain good quality in Real Estate data, and ways to measure the data quality. Click here to read Part 1 of the article by Daniel Mancino, Sakeer Hassan, and Dr. Umesh Harigopal.

Deciphering the lead from the noise

PropMix @ Inman Connect NYC 2017
Deciphering the lead from the noise

Richard Bellamy, CTO at Terradatum
Todd Hoover, SVP at Equifax
Umesh Harigopal, CEO at PropMix.io
Scott Petronis, Chief Product Officer at OnBoard Informatics

 

With so many leads being generated through multiple inbound channels, Agents and Brokers are at a loss to understand which is a lead that will lead to a transaction, and which ones are just trying to gauge the market without a real intent to transact. How do we enable Real Estate market participants (agents, lenders, mortgage servicers, etc.) decipher the real intent of their customers and focus attention on the leads that are most qualified?

 

The following major themes emerged from the panel discussion:

  • Inbound marketing via content and research tools are a great way to bring customers to your site, retain them, and then watch their behavior on the site to identify buyers vs. sellers. A typical home purchase takes up to 365 days from intention to transaction and so attracting and retaining a customer and continuing to qualify them is valuable.
  • Chatbots are emerging to engage a consumer, qualify them, and then transition the dialog to a live agent seamlessly. These chatbots are not just talking sentences but offering an immersive experience for the user with video, audio, action prompts, and dialog.
  • Propensity to Sell/Buy/Move predictions are another great way to qualify lead. Machine learning algorithms combine various sources of data – property records, demographics, census, and economic data – to generate such predictions and continue to validate the predictions and learn from real transaction data.

The Propensity to Sell Score combines usage, behaviour, demographic, and economic data to predict how likely a homeowner will be listing their property within a specific time period. The score is a reflection of the confidence in the prediction. With this score, Agents and Brokers can focus on the leads that really matter. This helps them close more transactions faster by dealing with those who are likely to close transactions within a specified period of time.

Millennials leading the way in the global real estate market

PropMix @ Inman Connect NYC 2017
Millennials leading the way in the global real estate market

F. Jacob Cherian, CEO at SMC Global USA Curt Beardsley, VP Industry Development at Zillow Matt Kumar, CEO at Software Incubator Abhimany Anil Londhe, CEO at SMC & IM Capital

There are 2 major shifts happening in the real estate market currently – a large number of millennials are entering the market with a differentiated set of needs and dreams of a home and the real estate market is going global with foreign investment in the US real estate market. A study by Asia Society and Rosen Consulting Group showed that between 2010 and 2015, China alone invested over $93 Billion in the US residential real estate with a growth rate of 20%. India is the second Asian giant to pump over $8 billion into the US real estate market in 2015.

Given these major market shifts Curt provided an excerpt from the Zillow Group’s Research Reports on the influence millennials have in the housing market:

The largest segment of homebuyers in the market today are millennials with the average age of the home buyer at 36.

On the other hand the global nature of the real estate market is driving new technologies to create glocal data bridges to syndicate listings internationally and attract international buyers through the local brick and mortar brokers. SMC Global is one such establishment in India with over 2000 offices across the country to source and deliver buyer leads to agents in the US and vice versa. Non-resident Indians (NRIs) in the US are also investing in the growing Indian real estate market. Regulations in India has improved recently with the implementation of a universal citizen ID card and many property title and valuation specific controls. Online platforms such as Indunia – touted as the Zillow for India – help connect consumers and agents across the globe.

PropMix’s real estate data API platform provides an international listing connector to enable any US real estate portal to display international properties. Similarly the API platform also allows foreign listing portals to support US real estate listings. In addition, every foreign agent can now be an expert at the US real estate market with the PropMix cognitive fabric providing the automated guidance to the agents connecting their customers to the US market.

Predicting Real Estate Property Prices

PropMix recently added two insights – iEstimate and iPriceTrend, for predicting Real Estate property prices, to its catalog. These insights are provided by CogNub, our cognitive computing and machine learning partner.

iEstimate is the estimated market value for an individual home and is calculated for million homes nationwide. It is computed using a proprietary formula taking into account special features, location, and market conditions of the property based on Critical Field Analysis.

iPriceTrend provides the price trend of similar types of properties in a particular zip code for a time period. This insight will help you to keep track of property values across the time for a zip code of interest. These price trends are visualized in the form of graphs. It also displays additional information such as the volume of transactions in the zip code. iPriceTrend is available for three months, six months, one year, three years and five years.

Read more about the cognitive computing capabilities behind iEstimate and iPriceTrend here.