pandas python

I exorcise my pandas demons by complaining on the internet about API inconsistencies and bad naming.



>>> import pandas as pd

I have some poor unsuspecting raw data that's about to be panda'd. It looks like this:

common_name,scientific_name,rating
bogue,Boops bops,7.5
Tibetan blackbird,Turdus maximus,1234567891
large flying fox,Pteropus vampyrus,131719
western lowland gorilla,Gorilla gorilla gorilla,2
tiny sky-tyrant,,
>>> df = pd.read_csv('poor_unsuspecting_raw_data.csv')
>>> df.dropna(subset=['scientific_name'])
common_name scientific_name rating
0 bogue Boops bops 7.500000e+00
1 Tibetan blackbird Turdus maximus 1.234568e+09
2 large flying fox Pteropus vampyrus 1.317190e+05
3 western lowland gorilla Gorilla gorilla gorilla 2.000000e+00

Okay, seems reasonable enough. And yet, disappointingly, if I wanted to isolate for the specific data that's being dropped due to NaN values in that column:

>>> df.isna(subset=['scientific_name'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: isna() got an unexpected keyword argument 'subset'

I end up having to do this instead.

>>> df[df.scientific_name.isna()]
common_name scientific_name rating
4 tiny sky-tyrant NaN NaN

I was somewhat relieved that at least isna() and notna() seem to have the same signature, but I'm disappointed that they don't implement (almost) everything that dropna() does, because it seems like most of it would be useful. Clearly I can accomplish what I want to do with the different syntax, but there's so much value in consistency. Why make me type a whole new command from scratch when I could just change drop > is and rerun my code? Sigh.

Also while adding the links above, I realized there isn't an obvious way to link to a specific version of the docs. I'm running pandas v1.2.1, so my apologies if you're a reader from the Future™ and everything I'm yelling about has been fixed. From here and now: sigh.

It bothers me that pandas is about the only tool I use where I have to constantly go back to the docs to remind myself of the subtle differences between methods. What's even more frustrating about this is that I'm not even using methods that are rare or poorly documented. All of these are very common and basic use cases for pandas, and pretty much all of them appear in beginner guides. Here's another one. This time my data looks slightly different.

crane,270
crow,93
coot,65
curlew,76
crossbill,28
>>> df = pd.read_csv('cnames.csv', header=None, names=['bird', 'wingspann'])
>>> df
bird wingspann
0 crane 270
1 crow 93
2 coot 65
3 curlew 76
4 crossbill 28

Right, okay, so far so good, except for the fact that my dastardly butterfly keyboard has added a superfluous n to my second column. Nevermind, I can rename that with the handy pandas rename method.

>>> df.rename(names={'wingspann': 'wingspan'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vagrant/venv/lib/python3.7/site-packages/pandas/util/_decorators.py", line 312, in wrapper
return func(*args, **kwargs)
TypeError: rename() got an unexpected keyword argument 'names'

Sigh.

In this case what works is actually columns, not names.

>>> df.rename(columns={'wingspann': 'wingspan'})
bird wingspan
0 crane 270
1 crow 93
2 coot 65
3 curlew 76
4 crossbill 28

You see my point about how this is a pain in the ass I shouldn't have to deal with as a developer? I can't even imagine dealing with this if I was someone with less computational experience who was just trying to wrangle their gosh darn data. The inconsistency in the API practically guarantees that the special cases of these commands are never going to become muscle memory.

This was my trouble with find for a long time, actually. grep and sed both require you to specify a path after the pattern; find starts to froth at the mouth if you try that. Eventually I just used find often enough that it got its own little corner of my brain and I could remember the syntax without trying. What helps in this case is that it's an entirely separate binary and project that does something different (or, well, different enough) from the other utilities I mentioned. You know what doesn't have that excuse? pandas.

Okay one last one and I'm calling it a day. I'm just cooking up my own, unoriginal data for this example (birds? what birds?)

>>> df = pd.DataFrame([[1, 2, '3'], [4, None, '6'], [7, 8, '9']], columns=['a', 'b', 'c'])
>>> df
a b c
0 1 2.0 3
1 4 NaN 6
2 7 8.0 9

Since you saw both the raw data and this DataFrame-ified representation of it, we can pretty much assume that the first column's going to have an integer dtype, the third is (hopefully) a string (or rather an "object"), and the second one looks like it was parsed as floats even though I didn't intend for it to be. This was almost certainly because of my None in row 2.

Aside: I should probably sit down someday and puzzle out the differences in behaviour between Python's None, numpy's nan, and pandas' NA, but you know what? Today is not that day. I'm sighing just thinking about it.

This time I'm not even going to give you a story. I'm just going to show you some commands and sigh loudly.

>>> df.dtypes
a int64
b float64
c object
dtype: object
>>> df.c.str.isnumeric() # yay! these are numeric strings
0 True
1 True
2 True
Name: c, dtype: bool

>>> df.index.dtype
dtype('int64')
>>> df.index.isnumeric()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'RangeIndex' object has no attribute 'isnumeric'
>>> df.index.is_numeric() # see that? it's an underscore. I know.
True

>>> # How about trying this with a different thing that's an integer dtype
>>> df.a.is_numeric() # nope
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vagrant/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'is_numeric'
>>> df.a.isnumeric() # still nope
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vagrant/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'isnumeric'

Why provide convenience methods like is_numeric() and is_integer() for some but not all objects that pattern a certain way? And no, using dtype directly isn't equivalent because you aren't guaranteed to get int64 in every environment (sys.maxint in Python is what decides what you'll get, as I recall).

>>> df.isna()
a b c
0 False False False
1 False True False
2 False False False
>>> df.isnull()
a b c
0 False False False
1 False True False
2 False False False
>>> df.isempty() # of course this doesn't work
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vagrant/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'isempty'
>>> df.is_empty()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vagrant/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'is_empty'
>>> df.empty() # how could I possibly expect any semblance of consistency
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bool' object is not callable
>>> df.empty
False

Siiiiiiiiiiighhhhhhh.

>>> df.b.isna()
0 False
1 True
2 False
Name: b, dtype: bool
>>> df.b.isunique()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/vagrant/venv/lib/python3.7/site-packages/pandas/core/generic.py", line 5462, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'isunique'
>>> df.b.is_unique()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bool' object is not callable
>>> df.b.is_unique
True
>>> # At this point, I'm all out of sighs to give

Bonus fact: I didn't show you in this last snippet but unique() ALSO exists. 🙃

I wrote a little script that's my assumption of how they designed their API.

>>> from random import choice
>>> prefixes = ['is', 'is_', '']
>>> suffixes = ['', '()']
>>> content = ['numeric', 'unique', 'empty']
>>> for c in content:
... print(choice(prefixes) + c + choice(suffixes))
...
is_numeric()
unique()
empty

Postscript

Despite appearances I do actually quite like pandas and their conceptual model makes a lot of sense to me (way more than R, anyway). And I know that these are problems that sometimes plague large, open-source projects with many contributors, but I sort of hope they do something about it, yanno? I'm sure this isn't the first post ever written griping about the inconsistencies in their API and workflows that were broken because pandas developers don't follow semantic versioning (I actually didn't even talk about that today). And, yes, I also know that it's possible my time would be better spent actually contributing to the project rather than complaining loudly into the void, so you really don't need to remind me :D Anyway, this whole thing was an attempt at exorcising my pandas demons and hopefully the act of writing it down means that some of it will stick!