There is many a data science meme degrading excel:
(Google Sheets seems to have escaped most of the memes here.)
While I no longer use it regularly for the purposes of analysis, I will always have a soft spot in my heart for excel1. Furthermore, using a “correct” set of data science tools often requires a bridge2. Integrating a rigorous component into a messy spreadsheet based pipeline can be an initial step towards the pipeline or team or organization starting on a path of continuous improvement in their processes3. Also, spreadsheets are foundational to many (probably most) BizOps teams and therefore are sometimes unavoidable…
In this post I will walk through a short example and some considerations for when you might decide (perhaps against your preferences) to integrate your work with extant spreadsheets or shadow “pipelines” within your organization.
Macro in the Shell
Say your finance organization has some workbook that contains fancy accounting triggered by a VBA macro, the calculation of which you need in your own script based data pipeline (but which also may not be in perfect order4…). You don’t want to go through the effort of reproducing (or worse, be responsible for maintaining) the complicated (and potentially changing) logic that lives within the spreadsheets / macros5. You’ve resolved instead to simply “call” finances spreadsheet / macro / logic in some way.
Often the VBA from finance will live in a macro enabled workbook (.xlsm file). To trigger it programatically, you generally want to write a wrapper VBscript that can then be run through the shell (Stack Overflow thread)6. The VBscript will open the workbook and trigger the macro(s) of interest7.
Note that saying “Use a shell script” is, in a way, almost always an answer for how to incorporate another technology into a pipeline. This is more just a reminder that many tools that are designed more for interactive use often also have a batch mode8. Here I’m writing about triggering VBA macros, but integrating GUI based data piplining tools like Orange or Knime into your pipeline can be set-up similarly.
Passing in arguments
You can pass arguments to a VBScript through the shell (SO thread). Though given that you are already using spreadsheets, it’s also sometimes easier to write data to pre-determined locations or cells (this is often how the workbook was set-up to be used anyways).
Example
See brshallo/macro-shell-example for a walk-through involving evaluating the present values of predicted deals.
In the example I…
- pass predictions programatically through an excel workbook –>
- that calculates present value via macros and excel formulas –>
- which is then read back in from the spreadsheet programatically.
These steps are orchestrated via a “run-all.R” script. With a little more effort these could be formalized via targets (or your pipeline toolkit of choice).
Setting-up Gaurd Rails
There are many unavoidable limitations to any spreadsheet dependent data pipeline. But here are a few things you can do to keep things sane:
- Think of your relationship with the workbook/macro/document in a somewhat similar way to how you might consider your relationship with an API or database you depend on. Get the owner of the document to think similarly – that is, as the owner of a standardized interface.
Identify which parts of the document need to stay consistent for your pipeline to keep working9, e.g.
- Location(s) where new data is to be inputted (e.g. a database, folder of .csv files, arguments passed through the shell script)
- Location(s) where you expect to retrieve data
- Names, data types, and possible values10
These are not much different from the kinds of considerations that happen when collaborating on any data pipeline11. The responsible party at each step has to adhere to certain structures about the form of the data as they expect it to come-in and the form with which it will proceed to the next step. Lines of communication should be open so that as changes occur to the tool, everyone (who needs to be) is made aware12.
- Set-up notifications / methods of contact.
Particularly if the pipeline is used in the wild13, make it easy to get notifications for when things do break14.
- Basic optimization
If you are using spreadsheets or VBA macros in your data pipeline you probably aren’t worried too much about performance, but there may be a few things you can do to be more efficient.
For example, for the macro-shell-example, the VB script has as (headless) steps opening and closing the excel document after processing each deal. Therefore, processing five deals entails compute time spent on four unnecessary opens and closes. This wasted processing could be corrected with small changes to the VB Script.
Closing
Data Scientists should still consider their work in a context of growth & development. I am reminded of the Alan Watts quote:
“People who are responsible for technical development [must] be well-imbued with an ecological philosophy and see the direction of things so they will not keep perpetuating anachronisms.”
For further reading on how data scientists should think about integrating their knowledge into sometimes lower-tech organizational contexts, see the excellent post by Allison Horst and Jacqueline Nolis:
"Merge conflicts: helping data science students merge their advanced skills into existing teams…
What do we do about students trained in R and Python for jobs with Excel, Google Sheets and Access?"
Appendix
Other Resources
A few other tools worth being aware of if you regularly interface with office products from R.
For integrating with google sheets there is googlesheets4.
For those without CS backgrounds, excel, or BI tools are typically the first analytics tools you get exposed to. Excel’s “Record Macro” feature (which enables watching VBA commands be generated while clicking through excel’s GUI / clicking through spreadsheets) was a helpful step towards me feeling comfortable with scripting. I also believe in the saying that it doesn’t matter so much what you have, but how you use it (this applies broadly to analytics technologies) and have seen many people make impressive things happen with it.↩︎
For me the bridge came from a friend in my Master’s program who inspired me to embrace R – before that, I’d had the moniker of “excel wizard” from my classmates in graduate school.↩︎
Or it may simply add another layer of complexity.↩︎
e.g. it is being run on a local machine, it is not updated on a fully automated basis, some parts have to be inputted manually, etc. EVERY DATA SCIENTIST HAS THEIR SKELETONS!↩︎
It may not be possible anyways if there is already an established user base for the existing tool… or may require a whole change management process which you can’t commit to.↩︎
You can run shell commands through R by running
system()
orshell()
functions, SO thread – see Related Alternative in the Appendix.↩︎Finance might have a VB wizard that can set this up for you.↩︎
You could even say this about {dplyr} and tidyverse tools which are largely designed with the intent of making interaction and fast iteration easy at the expense of making programming and functional programming and automation slightly more difficult.↩︎
While ensuring that whatever understanding you have allows flexibility for the owner to adjust the tool (but in a way that keeps in mind dependencies).↩︎
Nothing is more frustrating than when a data source changes arbitrarily and without notice.↩︎
Even those handled entirely programatically.↩︎
These guidelines/requirements are pretty similar to those you would have on any collaborative project.↩︎
I.e. being used in some way by people other than yourself or your immediate team.↩︎
For informal pipelines such as the ones you are likely looking at in regards to this post, this can be as simple as including an email that people should reach if things break, or triggering an auto-generated email populating an error message, e.g. via the blastula package in R.↩︎