The United States of Data
If we wanted to combine all of the world’s geographic data in a single place, we must not just compile and document geographic data, but also compile and document data geography. In the case of the US, I learned the hard way that the laws, resources, and culture around open geographic data varies tremendously state to state. So I have produced a spreadsheet of my results, along with the map below. I only compiled laws that relate to the two most essential types of data, both parcels (also known as “cadastral” data – both technical words for properties or lots) and property tax assessor’s data (also known by various names). In the map, blue states are open and red states are closed (color choices are not innuendos of party politics). Any state that has any restrictions, non-commercial or otherwise, are considered closed (see the spreadsheet for details on any state). For any geographic (GIS) data other than parcel and assessor data, the ratings may or may not correspond, but the spreadsheet is still a very useful place to start.
I built this by emailing one place in each state, asking them to provide me with the exact laws that justified their fees. It was efficient, and a lesson in how to compile laws and information in general, which I have documented on another page. The list is not 100% certain, but close enough – for a few states, I had to fill in the gaps after reading their laws, and I may have missed something as far as exceptions. As always, if anyone wants to participate, they can simply ask to share the spreadsheet or email me. This is going to be useful for what we are doing, but it is also of course useful for anyone who wants to use data in a given place or state, for whatever purpose.
In case you are wondering, I did this for MapStory Local, an effort to map all of human settlement in history – if you don’t know about the effort, I suggest you read a quick overview for some context. In the case of the US, parcel and assessor data is the most useful data for MapStory Local, as it usually has the date that buildings were built on each property, the whole reason we can animate entire cities and counties without much effort.
I am not sure how things vary country to country – I do know that at least one country, the Netherlands, has combined all of its building data in a single place where it can be downloaded, a dataset that I have on a hard drive and I should perhaps break into smaller pieces and upload into MapStory. The United States is another story – most jurisdictions in the United States above a population of 30,000 have geographic data, which would need to be combined place by place, and state by state into a single dataset. People are slowly realizing that while you may think that data can simply be stored on personal hard drives, sharing data openly is what you should do, and eventually will be seen as the way things are done, and for good reason.
This is a longer article, serving the purpose of explaining the problems one often faces in collecting data, and it introduces the problems with open records laws in general, aiming particularly towards governments that have such laws and policies and should change them.
The nearly universal failure of open records laws
The problem with collecting public data has to do with how public bodies are required to provide it. If you had a pdf on your desktop, you would just email it or use a filesharing service like dropbox, right? These are quick, easy and free; you would not charge someone for the cost and time in printing it out and mailing it. And you certainly wouldn’t print geographic data if someone wanted a file. But believe it or not, laws allow, or are often interpreted to allow, and sometimes even require governments to do exactly that.
Almost all places have open records laws that say that copies must be furnished at the cost of duplication; that’s reasonable, but the problem is, it is often not clear in what form and how it should be reasonably duplicated. The laws were written for printed documents, which could only be duplicated one way. Despite the data being digital, they can send you a disk, which is lawful – and they may even bend the rules to say that they can provide it to you in any format. In a recent case, Orange County, California charged the Sierra Club $375,000 for a GIS parcel dataset, or offered to give it in PDFs or in paper format. They took the county to court, and the case dragged on, and finally the California State Supreme Court ruled unanimously that governments must provide GIS data in the native format, and follow open records laws about the cost of duplication. A less extreme but similarly relevant case happened in Illinois recently.
I mentioned that I learned this all the hard way. Having had luck googling for data and hearing about the California Supreme Court verdict, I decided it would make sense to try and collect entire states. Big mistake. The culture really varies from state to state, and it was very rare for people to give data in a reasonable way, and very few places had building outlines, which we were looking for at the time. Me and the interns I was working with fell flat on our faces. It turns out that geographic and assessor’s data are unique – there are very often special laws for this data, or even special rules that governments create at their whim. In the case of California, they actually have a specific law that allows assessor’s to charge for their data (hence California being red on the map, despite their Supreme Court verdict). So we could get parcels, but with no useful data. This is also where I learned that the law allows people to send disks, when I would think “the cost of duplication” would mean that someone can send it via dropbox or ftp for free, since that is the cheapest and most convenient way for everyone. I got some brief advice from someone at the Electronic Frontier Foundation (EFF), which aided Orange County in their case, and he made it clear that there was nothing that could be done.
In some cases I asked to have it sent to my google drive, and often, especially in the case of different states or in smaller jurisdictions, the person on the other end just did it. But often, they insist on their way “because it is the way things are done.” In the case of San Francisco, they said they could send a disk after I mailed them a check for $5. I insisted that they upload for free to the google drive account I offered them, and they really wanted to upload it too, but they said they could not, because they had to abide by their “data policy.” In the end, I mailed them a check and they uploaded it to my google drive account, which apparently was seen as fitting the policy. Amazing how detailed rules prevent people from making intelligent decisions.
Furthermore, in the case of geographic data, most, if not all, of the light blue states have many jurisdictions that are violating their open records laws with GIS and assessor’s data. To give one of probably hundreds of examples, in the case of Illinois, which had the Supreme Court Decision, Boone County, a small county of a little over 50,000 people, still charges perhaps $20,000 for their dataset ($0.25-0.50 per parcel for geographic data and $0.10 per parcel for assessor’s data). It is not unheard of for some jurisdictions to charge over a million dollars for their dataset, whether or not it is legal. This is all for 1-2 files that are sitting on someone’s computer and can readily be sent in under a few minutes online for free.
Does all this sound bizarre? What I am talking about is often a rule, with few exceptions. If you want to see a great depiction of what I and others experience, watch this amazing verbatim reenactment of a deposition of a case with the Ohio Supreme Court, done by the New York Times. I have watched it maybe a dozen times, and it warms my soul.
What I have found is that open data policies are almost universally a failure. All one has to do is have a simple policy, that all public digital data must be downloadable, in a single place online. Sure, it will take time to implement, but all the digital data is already there, and needs to be combined in one place – all one has to do is make a comfortable and reasonable deadline for such a policy or law to come into effect, and do it. If the cost of removing confidential information is untenable or will not bring great benefits, you can make stipulations about that, and make them available as they are requested.
Requiring this will save everyone vast amounts of time and energy in the long run, not to mention enabling people to use your data for greater benefits. In the case of the data I focused on, only three states, marked in dark blue, seem to require that data be online and downloadable for free. Massachussetts seems to have their data online in one place as well, and Texas seems to have a culture of having their data on the sites of their respective “appraisal districts,” though in the cases of both states, it does not appear to be required by law. I cannot imagine that the effort in doing this would not have paid itself back many times over.
There are additional useful rules of course, but that is the biggest one. Everything below this is an ‘F’. For additional rules, the Sunlight Foundation has an excellent overview of what a complete open data policy should be, which I am guessing only a tiny handful of governments have come close to mandating, if any. I may eventually like to work with the City of Ames (where I live) to adopt these guidelines as laws.
Why governments are charging what they are and why it’s wrong
So why are governments engaging in these policies and behaviors? There are a few reasons it seems, and these are educated guesses – first, it seems to be psychological – geographic maps have always been available to people, and they could come in and photocopy them. They would be charged the cost of the paper, ink and maybe the cost of the photocopy machine. As far as I know, they have never been allowed to recover costs of producing information – but people did have to pay something when it was on paper. But then when they switched to a digital system, it took resources to do that, and they wanted to recover the costs. And since it is a new system that requires essentially duplicating data they already had, they wanted to charge.
How much do governments actually make from selling data? I haven’t studied this in detail, but I did get some information from Los Angeles County – it appears that they make perhaps $100-200,000 a year in sales of assessor’s data. Orange County claims it cost $3.5 million to create their dataset, which has 640,000 parcels and 3 million people. So it may have been much more for Los Angeles County, which has 10 million people. They may or may not be recovering their initial investment from selling data, I am not sure – I would guess that they rarely, if ever do.
In any case, the reason they switched to digital is of course because it is massively enabling, and even saves more money than one spends on it. In a case study of King County, Washington, one economist showed how over an 18-year period, the county spent about $200 million on GIS, while the most conservative estimate is that it brought $776 million-$1.7 billion in benefits, with $5 billion on the most liberal end (see more case studies here). They did mention that assessment data benefits may have reduced in value, but I believe they attributed this to an increase in employees in King County’s case. Keep in mind also that they are talking about the yearly costs – the initial development costs are a tiny fraction, as the numbers that Orange County gave show.
Despite all this, the thing is, from the perspective of someone paying, even if something saves you money, you often have to spend money to do it. With new technology, there is perhaps also the fear that you are cutting off your own legs by replacing yourself with a machine and making yourself irrelevant. I can understand why one might want to recover the costs, and I can certainly understand why people would not be enthusiastic about working to replace themselves.
So why is this wrong? When it comes down to it, it seems the feelings and arguments are based on logical fallacies, or altogether baseless:
*There are plenty of cases where people lose their jobs to machines, but I’d doubt that is the case here – if anything, it seems to retain jobs, even create more, while enabling people and saving them money. In any case, the value to citizens would outweigh these drawbacks, and we are talking about a government, which must make more hard decisions.
*As far as the costs, this is a fact of any capital investment – there are sunk costs and you must see the benefits from the investment you make. Sometimes departments have budgeting problems, but it does not make sense to make it back from something that was a positive financial investment.
*Most importantly, this is not only illogical, it is unethical. Above all, the data is already paid for by taxpayers, and it is public – a government cannot treat it as their private property. One cannot claim the opposite, that they should return the cost of producing data to taxpayers, when it is already theirs, and you are actually charging people double.
*Soliciting data or requiring any fees is also incredibly unfortunate – people can benefit enormously from public data, for everything from demographic analyses, to understanding city policies and their effects, to making their communities safe, to understanding the history and contours of their communities.
*Lastly, on top of it all, the people who sell their data to are only the richest people and companies who are willing to buy it – sometimes governments give it to nonprofits and educational institutions, but even in those cases it’s usually a hassle and you cannot publish everything in its entirety.
There are some jurisdictions that have contracted their parcel data out – a third party develops it and sells it. This makes the data completely inaccessible to anyone but the government. This is the most unfortunate of all, though one can perhaps argue that it is not as unethical as selling public data. And there is perhaps an argument to be made about whether taxpayers should pay for something they actually don’t have to. But it is very possible that the benefits financially and to taxpayers are greater to actually just give it for free.
All in all, I can find no reason that a government would be able to justify selling public data from an ethical or a financial standpoint.
And a final note – some may say that charging a nominal fee is reasonable. This is certainly not the case – having your data available for even a penny creates a bar of entry that prevents people from using your data. And in our case, if we were to even have to pay $5 for a disk like in San Francisco, that quickly adds up. If the marginal cost of providing data is zero, which is always the case for a small number of readily available digital files of a certain size, a government should provide data for free, period.
Finally, how to combine a single dataset of the United States?
For creating a combined nationwide dataset, like most things, it’s often best to start with the easiest part of the job, and move through the hardest. Here the states are broken up into categories by color, and within the categories are states that are easier than others. One would want to start with the dark blue states – Florida, New Jersey, and Montana. In the case of the light blue states, it would be best to go state by state, in order of where it is easiest to aggregate data. Some states will be easier, like Texas and Massachussetts, where it seems like nearly every place has their data online. In most however, you would have to show different jurisdictions the law, one by one. We have compiled GIS contact people in all the counties in about half the states, and can draw on that. The strategy in each state would likely be different. Some may have a state agency that enforces the law, and you could literally go to every place with the same letter from them that says how they must comply with the law, along with a link to a google drive account. One could try something like this in Iowa. One may also want to see what can be done in Illinois, which had the Supreme Court case. In the red states it would be impossible without changing laws, such as in California. This is certainly worth it, and a group should do it somewhere – among them perhaps MapStory Local groups in the future. Along the way, when local jurisdictions see their neighbors opening their data and the positive things they are doing with it, and states see that they are among the red states, it will put on a lot of pressure to change.
Now, a key question one should always ask is – am I duplicating something that is already being done (in this case duplicating the duplication of data)? While starting to write this article, I actually stumbled across a very interesting report from the US Department of Housing and Urban Development (HUD), exploring creating a nationwide “multi-purpose cadastre” (worth skimming both full report & summary). Apparently, various agencies in the federal government have explored this since at least the 1980s. Early on, since many, or most local governments did not use digital data, the National Research Council (NRC) produced a report in 1983 exploring doing it from scratch, mostly by giving matching grants to local governments, which would have cost the federal government about $90 million a year ($210 million in 2015) over 20 years, for a combined local and federal cost of $8.2 billion by 2003, or $10.5 billion today. Over the years, local governments created their own data, and the recommendation of HUD in the 2013 report was to aggregate data from local governments and standardize it, as well as provide resources for states to aggregate data within their states.
While I have only skimmed the report from HUD, it is both interesting and excellent, detailing things very nicely. They took a small sample, and derived their estimations from it. They divided states into four categories, the first of which were essentially the states in dark blue above. In the case of Florida, they were sent a hard drive of everything, and were able to process the whole state in only 4 hours. The second category requires gathering the data over and over. The third and fourth categories are estimated to require 70-85% of the effort, negotiating agreements and purchasing data from third party providers. They estimated that over the 4-year period they say it would take to build the dataset, all 3,221 counties would require 45,653 person-hours in the first year, and wind down to 13,694 person-hours by the fourth year, costing a total of $22 million. They used a multiplier of $200/hr per person-hour – I have no idea where that number comes from. Suffice it to say, it seems that either they are being very conservative, or possibly very inefficient or very costly, or a combination. Whether or not the hours they are spending and/or the cost per hour is high, nonetheless, $22 million is dirt cheap for the benefits it will provide – again, I can’t imagine that it would not save and enable that amount of resources for the federal government many times over for the 25+ agencies they identified would use the dataset.
While I will certainly follow up with the people who did the study, the key part here is the last two categories – regardless of whether they receive data, much of it will not be accessible by the public. For one, the report seems to mention nothing about jurisdictions violating state laws. Perhaps they are not considering that, or perhaps they want to maintain a good relationship with jurisdictions and states. Regardless, requiring local governments to follow their state’s laws is the way to do it, whether or not the federal government is willing to take this route. And for the red states, it will take civic groups to work to change the laws. Slowly but surely, the world’s data will be opened and combined, one place at a time. It will be interesting to see how that plays out (which will be a mapstory unto itself).