Add Interaction Variables as Needed

In today’s edition of Becoming a Functional Data Analyst, I am writing to remind myself not to create interaction variables during data munging.  That is, when acquiring, cleaning, and aggregating data, I find it easier to not interact variables I will later need.  Instead, it is easier to keep variables as they are and only interact them as necessary.

I learned this after way too much tribulation.  It used to be that I would interact variables when I aggregate them.  Doing this requires knowing which interactions will later be needed, which you basically only have a faint idea of when you start processing data.  Once you are ready to explore your data and build models, you then have to go back deep into your pipeline to change or add interaction variables.

That process is needlessly complicated.  Instead, just interact variable as you need them. For example, if  your regression needs to interact Age with Gender, then add Age*Gender to the lm() call.  Do not interact it at some earlier process in your work.  Interacting when needed will save you lots of complexity, time, and headaches.  If only I realized this years ago.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.