Well guys, I have an embarrassing confession. Due to a coding error, all of my market stress indicators have been understated since the beginning of August 2007. Here is what the Flipper Market Share, Troubled Inventory, and Sellers In Trouble Market Share really look like:
There were actually three different, distinct issues with the data. One was an issue with the overly strict way I was interpreting the data set, one was an artifact of the process I used to recalculate the statistics, and one was a simple coding mistake. In the interest of due diligence, and in hope of retaining your trust, I would like to discuss each issue openly. I'll start with the coding error.
$year_in_seconds = 31536000
When my system determines whether a house listing is a flipper or not, it compares the listing date to the last time the property was sold. If the listing took place within two years of the last sale, my system flags it as a flipper. In addition, a listing will carried over as a flipper if it was a flipper the week before. My mistake was a failure to set the comparison variable for the length of a year in seconds, so the system didn't look back far enough in the past sales database.
This error probably happened during a system overhaul I performed last summer, and the reductions in flipper market share happened so gradually (due to carry over) that I didn't attribute it to a coding mistake. It turns out the only flipper stat I've been reporting since August 8, 2007 is the decay rate of the flipper listings since then.
SELECT price WHERE address LIKE '$address%'
All of these issues came to light this Saturday when I was researching a different data source. I've been puzzled by a recent disconnect in inventory levels between MetroMLS and my source, so I decided to try a third source just to make sure. The inventory levels between my two sources matched OK, but I got a huge surprise when I noticed a spike in SIT levels.
It turns out I have been using overly strict criteria when comparing listing addresses to sales addresses in my database for SIT/FIT determination. As you can imagine, there are lots of ways to format an address. Unfortunately, the two sets of data I've been comparing come from disparate sources, and the address suffixes are not always included in one set. When I first set up the system I made a conscious decision to use a strict comparison structure because I was afraid of creating false positives (e.g. "123 Fake St" and "123 Fake Ave" would both compare positively against "123 Fake").
As luck would have it, the third data source was formatted almost exactly like my sales database, so it gave much higher SIT and FIT inventory levels. This led me to revisit the comparison issue and recalculate the statistics for my entire data set using a friendlier comparison operator.
Something Old, Something New
Lastly, an artifact of the recalculation process led to a slight increase in stress indicator levels. When SIT/FIT levels are calculated, it is done the week the listing data is acquired. Unfortunately, my sales data source is often a month behind (and sometimes two or three). As a result, any sale that happened within that lag period won't get picked up by the system at the time. However, when I reran the calculations, the data was there and the comparisons were made. Due to this, there was a very small increase in SIT/FIT inventory.
Here is a chart I prepared summarizing these issues, and their effect on the data set:
|Year Variable Error||Overly Strict Interpretation||Recalc Increase|
|% of Dataset Effected|
|Magnitude of change on effected data|
One good thing is these issues had absolutely no effect on the other stats. Inventory, price, and price level inventory are unchanged. However, the market is clearly under much more stress than I previously indicated. I will have more on the issue later in the week.
I would also like to thank everyone for bearing with me as I deal with these issues. I am a scientist by trade, and I firmly believe that the most important part of the experimental process is the airing of mistakes. My goal with this blog is to try and tell the housing bubble story by interpreting data in novel ways, and that story can't be told at all if it's not told accurately.
Even at the expense of reputation.