SQL vs dplyr vs pandas

Check out the code here: https://github.com/phibya/sql-dplyr-pandas

I have been using SQL, dplyr, and pandas for data manipulation and analysis for a while now. When it comes to data manipulation, my favorite is dplyr, with syntax that is very similar to SQL. pandas on the other hand, not so much.

I finished the notebook above in about an hour with the help of Github Copilot. I can easily write the SQL code and Github Copilot will automatically generate the dplyr and pandas code for me to review. Interestingly, I did not have to fix any of the dyplr code to make it work. The code might be verbose or not optimal, but it works. On the other hand, many times the pandas code generated by Github Copilot does not work.

I know many people love pandas, but for some one who learn SQL first, naming and syntax of dyplr are much more familiar and easy to remember for me. For example:

SQLdyplrpandas
selectselect[] operator
wherefilterquery
group bygroup_bygroupby
order byarrangesort_values
joinleft_join, right_join, inner_join, full_joinmerge
unionunionconcat
ntilentileqcut
rankrankrank
row_numberrow_number?

and many more.

For some queries, I also needed to use numpy with pandas especially when it comes to vectorized operations for conditional statements. For a python package, pandas is also not very pythonic. For example, groupby instead of group_by and isnull instead of is_null.

Recently, I started using polars which is a new data manipulation library for Python. While the syntax is still clunky compared to SQL and dplyr, it is much more consistent and easier to read than pandas. I might add polars to the notebook in the future.

Read Next

More than 40% of job postings on LinkedIn are fake

A quick analysis of job postings on LinkedIn to determine the percentage of fake job postings.

Shared remote development environment for a research lab

My recount on setting up a computational environment for our research group of 10 members. Most of the members are PhD students and postdocs. The goal was to let everyone have easily access to a powerful computational environment with simple setup so that they can focus on their research.