SQL vs dplyr vs pandas

Check out the code here: https://github.com/phibya/sql-dplyr-pandas

I have been using SQL, dplyr, and pandas for data manipulation and analysis for a while now. When it comes to data manipulation, my favorite is dplyr, with syntax that is very similar to SQL. pandas on the other hand, not so much.

I finished the notebook above in about an hour with the help of Github Copilot. I can easily write the SQL code and Github Copilot will automatically generate the dplyr and pandas code for me to review. Interestingly, I did not have to fix any of the dyplr code to make it work. The code might be verbose or not optimal, but it works. On the other hand, many times the pandas code generated by Github Copilot does not work.

I know many people love pandas, but for some one who learn SQL first, naming and syntax of dyplr are much more familiar and easy to remember for me. For example:

SQL	dyplr	pandas
select	select	[] operator
where	filter	query
group by	group_by	groupby
order by	arrange	sort_values
join	left_join, right_join, inner_join, full_join	merge
union	union	concat
ntile	ntile	qcut
rank	rank	rank
row_number	row_number	?

and many more.

For some queries, I also needed to use numpy with pandas especially when it comes to vectorized operations for conditional statements. For a python package, pandas is also not very pythonic. For example, groupby instead of group_by and isnull instead of is_null.

Recently, I started using polars which is a new data manipulation library for Python. While the syntax is still clunky compared to SQL and dplyr, it is much more consistent and easier to read than pandas. I might add polars to the notebook in the future.

SQL vs dplyr vs pandas

Read Next

More than 40% of job postings on LinkedIn are fake

Shared remote development environment for a research lab