Genealogical Data Quality

The Convenience/Accuracy Tradeoff


This is an unfortunate but entirely natural circumstance and one that we all need to keep in mind as we gather our historical data. How’s how it is:

The most accurate records are the originals, but they are also the most inconvient if not downright inaccessible. In order to be efficient, millions of records have been extracted from the originals. Almost without exception, these extractions drop vital information and clues that exist in the original document. As the data is processed further to make it easier to search, it undergoes further degradation, even in the best of circumstances. For example, your author was involved in a small genealogical project using baptisms from 20 adjacent English parishes. The data was independantly extracted by two differenct individuals and the computer was used to match and merge these two data streams. Mismatches were directed to a third person who had more expertise in reading old documents and a decision was made as to how the record would read from then on. These decisions are often judgement calls, there is no way around that fact. So as the data is made more accessible and more convenient it is more or less corrupted.

This underscores the absolute necessity of going back to the most original record available before deciding to adopt its data. Even then, there will be errors as described below:

  1. The testator doesn’t know or fabricates facts to the recorder
  2. The recorder doesn’t understand, or incorrectly interprets the words of the testator, or records later from a flawed memory, or his personal mores bias or corrupt his recording
  3. The recorder inadvertantly mixes data from two events
  4. The recorder makes an error while writing: omission, exchanges, wrong dates, wrong names (“I thought the father’s name was George not Gregory.”), etc.
  5. The extractor/indexer misreads or can’t read the record, or when in doubt knowingly adds his own idea of “how it must have been”
  6. The extractor/recorder makes a copying error. (We know by statistical analysis that this is usually about 2% of the data.)
  7. Computer programming errors drop, add or mix data.
  8. Data presentation to the genealogist is counter intuitive and is misinterpreted
  9. The genealogist makes a copying error
  10. The reader misreads