Which programming language is best for economic research?
Which programming language is best for economic research: Julia, Matlab, Python or R?
While a large number of general-purpose programming languages are used in economic research, we suspect the four most common are Julia, R, Matlab, and Python. When we looked at this last time here on VoxEU (Danielsson and Fan 2018) two years ago, we concluded that R was the best in most cases. With all the developments since then, is R still in the lead?
There is, of course, no single way to answer the question — depending on the project, any of the four could be the best choice.
To narrow the question down, we have three separate criteria in mind, all drawn from our work.
First , one of us has written a book called Financial Risk Forecasting (Danielsson 2011), accompanied by practical implementation in all of the four languages, which provides the ideal test case for the power of libraries available for researchers.
We have two additional criteria common in data science: importing a very large dataset and a computationally intensive subroutine.
The quality of the language
Matlab dates back almost half a century and has been a reliable workhorse for economic researchers ever since. But while it does slowly add new features, it is still held back by poor design choices.
R, in the form of its precursor SPlus, also dates back to the 1970s. It was initially conceived as a language for statistical computing and data visualizations. Similar to Matlab, it is hampered by poor design, but the richness of its libraries make it perhaps the most useful of the four today.
Python, unlike the other three, started out as a general-purpose programming language used for file management and text processing. It is also really good at interacting with external libraries – the reason it is widely used in machine learning.
However, it is not a good language for general numerical programming. The main numerical and data libraries have been clumsily grafted on it, so it is unnatural, hard to work with, and prone to errors which are hard to diagnose. For example, suppose you have two matrices, X and Y, and want to multiply them into each other. You have to write:
while in Matlab and Julia the line is simply:
X * Y
Julia is the newcomer at only eight years old, and it shows. It doesn’t have any historical baggage, and as a result, the code is clean, fast and less error-prone than the others. Not surprisingly, it has been adopted in high quality projects, such as Perla et al. (2020).
One advantage is that it allows Unicode in equations, allowing Greek letters and other characters to be used in calculations. For example:
μ = 1 + 2 * θ + ε
When ranking the inherent suitability of the four languages for numerical computations, we consider Julia the best, followed by R, then Matlab, with Python the worst.
Julia was designed with speed in mind, taking advantage of modern compiler techniques, and is generally the fastest of the four.
Consequently, it doesn’t require the programmer to use complicated techniques for speeding, called code up, resulting in Julia’s code being both more readable and faster.
The slowest of the four languages is Python, but it does offer an excellent just-in-time compiler, Numba, that can significantly speed up computations where it can be applied (unfortunately, this is only in very simple calculations).
To evaluate speed, we also conduct three experiments.
The first is an experiment with the GARCH log-likelihood function. Since it is both iterative and dynamic, it captures a large class of numerical problems encountered in practice. When using Python, we use both pure Python and a version pre-compiled with Numba. For R, we tried both pure R and a C++ implementation (Rcpp). For comparison, we also do a C implementation.
Our results show that C is the fastest, with Rcpp not far behind, followed by Numba and Julia. All those exhibit excellent speed. Matlab is quite a bit slower, followed by R, with Python by far the slowest.
The second experiment measures the loading time for a very large CSV data set, CRSP, which is almost 8 GB uncompressed and over 1GB compressed. We first read in both the compressed and uncompressed files.
In this experiment, R is by far the fastest, followed by Python and Julia, with Matlab trailing.
The final experiment records the processing time for a typical calculation, where we find the annual mean and volatility of each of the stocks in the CRSP database.
In this experiment, Julia is the fastest, followed by R, with Matlab again trailing badly. These findings are in line with results in Arouba and Fernández-Villaverde (2018).
Details and code can be found on our web appendix (Aguirre and Danielsson 2020).
In conclusion, Julia is generally the fastest and requires the least amount of tricky coding to run fast. Any of the others could be the second best, depending on the application and the skill of the programmer.
Working with data
Researchers often have to grapple with large data read from and written to a number of different formats, including text files, CSV files, Excel, SQL databases, noSQL databases and proprietary data formats, either local or remote.
One might think that Python would excel at this, but to our surprise, it did not. While it has libraries that can handle almost every common data task, they are cumbersome and unnatural. For example, if we want to read-write access an element in DataFrame M, one has to use:
In R or Julia, the line is simply:
Matlab traditionally offered only numerical matrices and did not handle strings well. While it has improved considerably in recent years, it still is much more limited than the other three.
In conclusion, of the four languages, R is the best for working with different data formats, followed by Julia, then Python and Matlab comes last.
While each of the four languages provides a basic foundation for calculations, most researchers will end up using third-party libraries.
It is with libraries where network externalities become most important, researchers developing new computational techniques prefer the most popular platforms and those doing research, gravitate to the platforms with the most libraries. This virtuous cycle has strongly favoured R.
While Matlab has a lot of built-in functionality, it is the poorest of the four in terms of external libraries because it is proprietary and library developers prefer open languages.
Python, on the other hand, has by far the best selection of libraries for dealing with file system, text, web scrapping, databases, and machine learning. However, it has few statistical libraries that would be useful in economic research.
Julia, being the newcomer, is still catching up. While it has rapidly increased its universe of libraries useful for economic research, and generally has richer libraries than both Matlab and Python, it is still bested by R.
R is by far the richest. It has a library for almost every possible statistical calculation one could imagine. The downside is that some of these are old, of a low-quality, badly documented and often with multiple packages for the same functionality.
That said, Python, Julia and R can all call functions from each other. Thus, libraries in one can be used in all. We, however, caution relying too much on such cross-language functionality, it introduces the potential for instability and hard to diagnose errors. It is much more natural and robust to work in just one language.
Hence, in terms of libraries, R is the best, followed by Julia, and then Python and finally Matlab.
While all four languages are able to output high-quality graphics, in our view, R is head and shoulders above the other three. Not surprisingly, both the New York Times and BBC use R for their graphics, and you can even download the BBC’s library (BBC 2020).
There is little differentiating the other three. It is quite easy to make high-quality plots in Matlab, but the options are limited.
There are more options available in both Python and Julia, but it is cumbersome and unstable.
Therefore, for graphics, R is the best, followed by Python and Julia, with Matlab again last.
Ease of use
Matlab has traditionally been the easiest language to use. It has a high-quality integrated development environment and has by far the best documentation of the four languages.
R is not far behind in terms of integrated development environments (IDEs). However, its documentation is not as good. Both Python and Julia offer IDEs, but they are not as good as those of Matlab and R.
Thus, in terms of ease of use, especially for novice users, MATLAB and R are the best, followed by Julia, with Python last.
Three of these languages (Julia, Python, and R) are open source, while MATLAB is commercial. For pricing, see Mathworks (2020). This means that the first three are available on almost any platform, and one can install them without paying or getting permission.
Hence in terms of licensing and cost, Matlab is the worst and the other three are equal.
None of the four languages is universally the best. The recommended language is still the one the researcher is most comfortable with.
However, for new projects and especially new researchers not committed to a language, the picture changes.
Matlab and R benefit from being the veterans. You can do almost anything you want with it. However, their age shows, and Matlab in particular has not been able to keep up. Consequently, we cannot recommend it for new projects. R continues to be an excellent choice because of its unparalleled libraries, the language itself, however, leaves something to be desired.
Python was designed for other purposes, at which it excels, but we cannot recommend it for general-purpose numerical programming except in applications that play to its strength, such as machine learning.
What about Julia? It is the most modern language, elegant and fast with a rapidly growing library support. The danger with a new language like Julia is that it fades away, having committed considerable time and energy to a language, it is rather frustrating to see it lose traction.
When we made this comparison two years ago, we recommended R but said Julia was the language to look out for. At the time, Julia was developing very rapidly, so code was breaking between releases, and its long-term survivability was in doubt. Julia has now stabilised, and her long-term future is increasingly assured.
As a consequence, Julia is the language we now tend to pick for new projects and generally recommend.
Authors’ note: We thank the Economic and Social Research Council (UK) [grant number ES/K002309/1] and the Engineering and Physical Sciences Research Council (UK) [grant number EP/P031730/1] for their support.