Suppose that I have three tab-separated value data files: 2011.txt
, 2012.txt
, and 2013.txt
. Each file has the same format, where rows are like this:
UserID Data Data Data ...
Each file only contains data for the year it is named after. I would like to throw out all data in these files for UserIDs that do not make an appearance in either the preceding or following year. That is, I only want to keep data relating to UserIDs where I can track the UserID for at least two years in a row. How can I go about doing this? My usual tools for manipulating data files like this are vim
, and using simple perl
commands and regexp from the command line. If there is a way to do this using those tools, I'd like to do it that way. But I am open to learning new tools.
As an outline, I'm thinking:
run through each UserID in 2011.txt
if UserID doesn't appear in 2012.txt, delete this row from 2011.txt
run through each UserID in 2012.txt
if UserID doesn't appear in either 2011.txt or 2013.txt, delete this row from 2012.txt
run through each UserID in 2013.txt
if UserID doesn't appear in 2012.txt, delete this row from 2013.txt
But I've never modified files in a way that accesses multiple files like this.