RusGenProject Data Extraction

My first assignment as a RusGenProject intern was to do some data extraction. With substantial help from a long-time friend who served as much more than a de-facto consultant on this project, I was able to dust off old programming skills.

Together we wrote and debugged the following script.

The owner of the company sent me files that had been scrubbed from the Russian archives. These files contain names and birth dates of Russian people -- information that needs to be made searchable in a database.

Needless to say, we can't expect a computer to read this mess:

So, my friend and I came up with the following script:

This small computer program works by breaking down the file into to little pieces in an organized way, then rebuilding them into a format that can be imported into a database.

We start by breaking the file apart by the "newline characters" -- the little "CR" tags you see at the end of the lines in the source file. Then we use special built-in functions to remove garbage from the file -- all the other black tags and empty lines. Then there are a whole bunch of nitty-gritty details that have to be attended to in order to organize all the data into columns and rows in memory. Once the data is organized and cleaned, we then tell the computer to print it all out into a text file in a tab-delimited format.

For testing purposes, I opened it in Excel.

And... voila!

Russian genealogy data is now in a searchable format for those who are looking for their ancestors. Now all we have to do is get the hundreds of other files like this one ready, and upload them into the database so people can access it. Of course, per Russian law, the data has to be removed from my computer and sent back to Russia for safe keeping.