SQL vs dplyr vs pandas
Check out the code here: https://github.com/phibya/sql-dplyr-pandas
I have been using SQL, dplyr, and pandas for data manipulation and analysis for a while now.
When it comes to data manipulation, my favorite is dplyr
, with syntax that is very similar to SQL.
pandas
on the other hand, not so much.
I finished the notebook above in about an hour with the help of Github Copilot. I can easily write the SQL code and Github Copilot will automatically generate the dplyr and pandas code for me to review. Interestingly, I did not have to fix any of the dyplr code to make it work. The code might be verbose or not optimal, but it works. On the other hand, many times the pandas code generated by Github Copilot does not work.
I know many people love pandas, but for some one who learn SQL first, naming and syntax of dyplr are much more familiar and easy to remember for me. For example:
SQL | dyplr | pandas |
---|---|---|
select | select | [] operator |
where | filter | query |
group by | group_by | groupby |
order by | arrange | sort_values |
join | left_join, right_join, inner_join, full_join | merge |
union | union | concat |
ntile | ntile | qcut |
rank | rank | rank |
row_number | row_number | ? |
and many more.
For some queries, I also needed to use numpy
with pandas especially when it comes to vectorized operations for
conditional statements.
For a python package, pandas
is also not very pythonic. For example, groupby
instead of group_by
and isnull
instead of is_null
.
Recently, I started using polars
which is a new data manipulation library for Python. While the syntax is still clunky compared to SQL and dplyr, it is much more consistent and easier to read than pandas. I might add polars to the notebook in the future.