Py + SQL + R == ☺

Why the UCLA stats department needs to introduce a Python and SQL course to supplement our R classes

Danny D. Leybzon


Update: We did it! I met with the Chair of the Statistics Department (Professor Mark Handcock) and he told me that (barring any unexpected hurdles) starting in Spring of 2018 the eminently qualified Miles Chen will likely be teaching a new course “Data Technologies” (STATS 131). This course will cover relevant new technologies, including Python, SQL, and distributed computing. Professor Handcock specifically told me that it was in large part thanks to a push from the undergraduate student body that this new class is being set up so good work team! A special thanks needs to also be extended to Professor Vivian Lew, who has been pushing for this forward-thinking course to be introduced for quite some time and was pivotal in making it a reality. Thanks also to Ryan Rosario who was as active of an advocate for this class as me and who was a great resource along the way.

While applying for summer data scientist and analyst internships last school year, I noticed an overwhelming trend: the most interesting and exciting internship opportunities required at least some proficiency in R, Python, and SQL. In fact, I would not have gotten my summer internship at a Bay Area data science startup if I hadn’t listed Python and SQL on my resume. When I talk to other students looking to go into similar fields they all say the same thing: the stats department at UCLA has done a great job of preparing us with R but in the modern world it’s important for us to also learn Python and SQL. That’s why I’m proposing prioritizing the creation of a Python and SQL course.

Rather than simply make platitudes and reference anecdotes however, I am going to make my case with what we stats majors came to UCLA to study: cold, hard numbers. The data, which come from different sources, all point in a single direction: Python and SQL knowledge is invaluable for the modern statistician, analyst, and data scientist.

Take, for instance, a recent poll conducted by the popular data scientist hub KDnuggets, which asked its readers (practicing statisticians and data scientists) which software packages they used. The responses indicated that 49% used R, 45.8% used Python, and 25.5% used SQL. By contrast, only 5.6% used SAS. In terms of growth and decline, R was up 4.5% from the previous year, SQL was up 15%, and Python was up a staggering 51%. According to the KDnuggets poll, R remains the number one data analysis package, while Python and SQL are second and third respectively.

An important aspect of university education is preparing undergraduates for employment within industry. As the graph to the left from (a popular job search engine) indicates, employers are primarily seeking candidates skilled in SQL, R, and/or Python. Both Stata and SPSS get few mentions in job postings and the popularity of SAS among employers is declining. By focusing on software packages used by potential employers, the UCLA stats department can better prepare its undergraduates for future employment outside of academia.

From Robert A. Muenchen’s seminal “The Popularity of Data Analysis Software

And it’s not just in industry that Python and R are experiencing tremendous growth. As this chart shows, references to the Python language in scholarly articles grew by 27.5%, while references to R grew 12.5%. By looking at the explosion of Python usage among academics we can see that preparing tomorrow’s researchers means teaching them Python.

It’s no wonder that Python is so popular in academia: it’s a language well-suited for the modern academic. As Hoyt Koepe of Washington University points out in his blog post, Python is an incredible language for academia because it’s holistic, readable, balanced, interoperable, well documented, contains a plethora of data structures, and has an incredibly active open-source community which consistently churns out great libraries. All of these traits make it a perfect language for academia and therefore a great candidate to be taught at UCLA.

Python’s utility goes beyond simply being a great language for data analysis; its OOP structure and ease of use mean that even serious software developers use it. By teaching the basics of Python, the UCLA stats department could ignite an interest in software development and gives its students the foundation to learn how to do things that simply can’t be done with R. Python has been steadily rising in IEEE’s list of top programming languages and now sits at third, behind only C and Java. Popular websites such as YouTube, Quora, Instagram, and Pinterest are all built with Python.

Last summer I had an amazing internship experience where I felt both challenged and productive doing exactly what I love doing. By petitioning for a Python and SQL class, I hope to give other stats students the same opportunity to find gainful employment doing what we all love: analyzing data.

If you agree with the sentiments expressed in this post, please press the “Heart” button, share with your friends, and sign this petition to have a Python and SQL course introduced.