Slicing data in pandas
This is second in the series on indexing and selecting data in pandas. If you haven’t read it yet, see the first post that covers the basics of selecting based on index or relative numerical indexing. In this post, I’m going to review slicing, which is a core Python topic, but has a few subtle issues related to pandas. Then, using slicing, we’ll review the selection methods from the last post, but now with slicing applied.
What is a slice, and what is slicing?
In short, a slice is a portion of a sequence.
A slice is defined in the Python glossory.
An object usually containing a portion of a sequence. A slice is created using the subscript notation, [] with colons between numbers when several are given, such as in variable_name[1:3:5]. The bracket (subscript) notation uses slice objects internally.
Slicing is defined in the Python reference.
a slicing selects a range of items in a sequence object (e.g., a string, tuple or list). Slicings may be used as expressions or as targets in assignment or del statements.
A slicing has three parts: start, stop, and step. The slice
class is a Python builtin, so you can instantiate and use it with no imports. If you instantiante a slice
object explicitly, you can supply a single value (the stop), or you can supply the start and stop and optionally, the step. In terms of what is selected in the sequence, the start is inclusive, the stop is not, and the step determines how you count from start to stop. You can use slice objects or slice notation, but unless you need to store the slice object for re-use, you usually see slice notation in code.
So let’s look at few examples to clear this up. We’ll start with a string as a sequence object.
>>> w = "abcdefg"
>>> # all of these select the same slice of w, "abc"
>>> w[slice(0, 3, 1)] # This is a slice(start=0, stop=3, step=1)
'abc'
>>> w[0:3:1] # This is the same, but in slice notation
'abc'
>>> w[slice(3)] # start=0, step=1 are the defaults
'abc'
>>> w[:3] # same here
'abc'
>>> w[:] # select the entire sequence. start=0, stop=last element (inclusive), step=1
'abcdefg'
Enter fullscreen mode Exit fullscreen mode
Now slices can be used very effectively and concisely to get views of data. I’ll just cover a few examples here.
>>> w[::-1] # reverse a string (start=0, stop=end, stepping backwards)
'gfedcba'
>>> w[1:3] # a portion of a string
'bc'
Enter fullscreen mode Exit fullscreen mode
Now strings are special, you can’t modify them directly or delete their values. Same with tuples.
>>> try:
... w[2] = 'z'
... except TypeError as te:
... print(te)
'str' object does not support item assignment
>>> try:
... t = (0, 1, 2)
... t[0] = -1
... except TypeError as te:
... print(te)
'tuple' object does not support item assignment
Enter fullscreen mode Exit fullscreen mode
We’ll switch to using a list to show some more applications of slicing.
>>> l = [0, 1, 2, 3, 4, 5, 6]
>>> l[slice(0,3,1)] # just as with strings, these all work on a list, and return [0, 1, 2]
[0, 1, 2]
>>> l[0:3:1]
[0, 1, 2]
>>> l[slice(3)]
[0, 1, 2]
>>> l[:3]
[0, 1, 2]
>>> l[:] # this returns a full copy
[0, 1, 2, 3, 4, 5, 6]
>>> l[::2] # using step
[0, 2, 4, 6]
>>> l[1::2] = [-1, -1, -1] # can also assign
>>> l
[0, -1, 2, -1, 4, -1, 6]
>>> del l[:] # this empties the list, but it still exists
Enter fullscreen mode Exit fullscreen mode
On to pandas
OK, after a quick overview of slicing, let’s move onto pandas and see why we’re talking about slicing at all.
Just like the last post, I’ll grab a dataset from the Chicago Data Portal, but this time we’ll grab the list of Chicago public libraries.
>>> import pandas as pd
>>> import numpy as np
>>>
>>> # you should be able to grab this dataset as an unauthenticated user, but you can be rate limited
>>> df = pd.read_json("https://data.cityofchicago.org/resource/x8fc-8rcq.json")
>>> df.dtypes
name_ object
hours_of_operation object
address object
city object
state object
zip int64
phone object
website object
location object
:@computed_region_rpca_8um6 int64
:@computed_region_vrxf_vc4k int64
:@computed_region_6mkv_f3dw int64
:@computed_region_bdys_3d7i int64
:@computed_region_43wa_7qmu int64
dtype: object
>>> # trim down the columns
>>> df = df[['name_', 'hours_of_operation', 'address', 'city', 'state', 'zip', 'phone', 'website', 'location']]
>>> df.head(3)
name_ ... location
0 Vodak-East Side ... {'latitude': '41.70302747819179', 'longitude':...
1 Albany Park ... {'latitude': '41.975456', 'longitude': '-87.71...
2 Avalon ... {'latitude': '41.74671722160199', 'longitude':...
[3 rows x 9 columns]
Enter fullscreen mode Exit fullscreen mode
Slicing in pandas
Now in pandas (and the underlying NumPy data structures), slicing is also supported, and you can use it via the indexing operator ([]
), .loc
, and .iloc
. Just like in the last post, we’ll walk through these three methods and look at slicing for both Series
and DataFrame
objects. Also, as last time, to prevent confusion between location based indexing and label indexing, I’m going to update the index of our sample data to be string based for the initial explanations.
>>> # this will make our examples a bit more obvious
>>> df.index = "ID-" + df.index.astype(str)
>>> # Use the name_ column for a sample Series.
>>> s = df['name_']
Enter fullscreen mode Exit fullscreen mode
Slicing with []
Just like basic indexing, we’ll see that slicing with .loc
and .iloc
is preferred to using []
. Since the behavior of []
can depend on the index and arugments, it’s better to use the more explicit indexers. There’s a just a few things to keep in mind if you use it.
Series
Just like standard Python, you can use integer slicing on Series objects, with start:stop:step
.
>>> s[:3] # These all slice the first 3 elements in the Series
>>> s[0:3]
>>> s[0:3:1]
>>> s[slice(0,3,1)]
ID-0 Vodak-East Side
ID-1 Albany Park
ID-2 Avalon
Name: name_, dtype: object
Enter fullscreen mode Exit fullscreen mode
In a Series
, we also can slice on our index, which in this case is an index of objects (strings). Note that the slice start and stop are inclusive, which is different than regular Python slicing!
>>> s['ID-4':'ID-9']
ID-4 Popular Library at Water Works
ID-5 Little Italy
ID-6 Edgebrook
ID-7 Kelly
ID-8 Edgewater
ID-9 North Austin
Name: name_, dtype: object
Enter fullscreen mode Exit fullscreen mode
You can specify a step as well.
>>> s['ID-10':'ID-15':2]
ID-10 Canaryville
ID-12 Manning
ID-14 Whitney M. Young, Jr.
Name: name_, dtype: object
Enter fullscreen mode Exit fullscreen mode
Note that your labels have to be in the index or you’ll get a KeyError
.
>>> try:
... s['ID-10':'ID-999']
... except KeyError as ke:
... print(ke)
'ID-999'
Enter fullscreen mode Exit fullscreen mode
You can also (potentially) assign/update a slice. Note that you’ll get a SettingWithCopyWarning
(see below) when doing this. More on this in future posts (or read the pandas docs for details), but for now just know that when updating a subset of your data, you need to ensure you aren’t trying to update a view of it, but the actual underlying DataFrame
or Series
.
>>> s['ID-79':] = "Changed"
/Users/mcw/.pyenv/versions/3.8.6/envs/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value)
>>> s.tail()
ID-76 Woodson Regional Library
ID-77 Mayfair
ID-78 Chicago Bee
ID-79 Changed
ID-80 Changed
Name: name_, dtype: object
Enter fullscreen mode Exit fullscreen mode
DataFrame
In a DataFrame
, slicing with []
will slice on the rows. If you remember back to the last post, selection on a DataFrame
with[]
and a single argument selects the columns, so this is a bit confusing. Just remember, if you see a :
, it’s slicing on rows.
>>> df[1:3]
name_ ... location
ID-1 Albany Park ... {'latitude': '41.975456', 'longitude': '-87.71...
ID-2 Avalon ... {'latitude': '41.74671722160199', 'longitude':...
[2 rows x 9 columns]
Enter fullscreen mode Exit fullscreen mode
You can also use the step.
>>> df['ID-3':'ID-7':2]
name_ ... location
ID-3 Brainerd ... {'latitude': '41.732316757097045', 'longitude'...
ID-5 Little Italy ... {'latitude': '41.867166', 'longitude': '-87.66...
ID-7 Kelly ... {'latitude': '41.78217163315653', 'longitude':...
[3 rows x 9 columns]
Enter fullscreen mode Exit fullscreen mode
Slicing with .loc
.
Now, just like with basic selection, .loc
and .iloc
are meant for label and position respectively. With .loc
you cannot pass in locations unless those match the index you are using. In our case, we’ll get a TypeError
. The issue is our index doesn’t contain these values, and so we’ll be rejected.
Note that with .loc
, we will get both the start and stop values if they are in the index, just like with []
.
Series
For our example data, you can’t slice with integers (since they aren’t in the index), but we can slice by labels.
>>> try:
... s.loc[1:3]
... except TypeError as te:
... print(te)
...
cannot do slice indexing on Index with these indexers [1] of type int
>>> s.loc['ID-1':'ID-3']
ID-1 Albany Park
ID-2 Avalon
ID-3 Brainerd
Name: name_, dtype: object
Enter fullscreen mode Exit fullscreen mode
Assignment or updates (can) work with .loc
. We’ll talk about updates in more detail in the future.
>>> s.loc['ID-79':] = "Changed again"
/Users/mcw/.pyenv/versions/pandas/lib/python3.8/site-packages/pandas/core/indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
iloc._setitem_with_indexer(indexer, value)
>>> s.tail()
ID-76 Woodson Regional Library
ID-77 Mayfair
ID-78 Chicago Bee
ID-79 Changed again
ID-80 Changed again
Name: name_, dtype: object
Enter fullscreen mode Exit fullscreen mode
DataFrame
And things are similar for a DataFrame
.
>>> try:
... df.loc[1:3]
... except TypeError as te:
... print(te)
...
cannot do slice indexing on Index with these indexers [1] of type int
>>> df.loc['ID-1':'ID-3']
name_ ... location
ID-1 Albany Park ... {'latitude': '41.975456', 'longitude': '-87.71...
ID-2 Avalon ... {'latitude': '41.74671722160199', 'longitude':...
ID-3 Brainerd ... {'latitude': '41.732316757097045', 'longitude'...
[3 rows x 9 columns]
Enter fullscreen mode Exit fullscreen mode
Now with a DataFrame
, you can slice both on the index and the columns, by label.
>>> df.loc['ID-10':'ID-20':2, "name_":"address"]
name_ hours_of_operation address
ID-10 Canaryville Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu.... 642 W. 43rd St.
ID-12 Manning Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu.... 6 S. Hoyne Ave.
ID-14 Whitney M. Young, Jr. Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ... 415 East 79th St.
ID-16 Scottsdale Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ... 4101 W. 79th St.
ID-18 West Lawn Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu.... 4020 W. 63rd St.
ID-20 Rogers Park Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu.... 6907 N. Clark St.
Enter fullscreen mode Exit fullscreen mode
I’ll also point out a common idiom in pandas code to use a full slice (:
) to select the entire object when not using slicing to limit selection.
>>> df['name_'] # this is one way to select a single column, for example
>>> df.loc[:,'name_'] # but this is how you select that column using .loc
Enter fullscreen mode Exit fullscreen mode
Slicing with .iloc
Remember, .iloc
is for strictly integer based indexing, so as you’d expect, it doesn’t work with labels. But also note that .iloc
works like standard Python slicing, i.e. the stop value is not included.
Series
>>> try:
... s.iloc["ID-1":"ID-3"] # not this way with .iloc
... except TypeError as te:
... print(te)
...
cannot do positional indexing on Index with these indexers [ID-1] of type str
>>> s.iloc[1:3] # this way
ID-1 Albany Park
ID-2 Avalon
Name: name_, dtype: object
Enter fullscreen mode Exit fullscreen mode
DataFrame
Similarly, DataFrame
selects from the index as the first argument, and the columns with the second argument.
>>> df.iloc[1:3]
name_ ... location
ID-1 Albany Park ... {'latitude': '41.975456', 'longitude': '-87.71...
ID-2 Avalon ... {'latitude': '41.74671722160199', 'longitude':...
[2 rows x 9 columns]
>>> df.iloc[1:3, 0:2]
name_ hours_of_operation
ID-1 Albany Park Sun., Closed; Mon. & Wed., 10-6; Tue. & Thu., ...
ID-2 Avalon Sun., Closed; Mon. & Wed., Noon-8; Tue. & Thu....
>>> df.iloc[1:10:3, 2:-4]
address city state
ID-1 3401 W. Foster Ave. Chicago IL
ID-4 163 E. Pearson St. Chicago IL
ID-7 6151 S. Normal Blvd. Chicago IL
Enter fullscreen mode Exit fullscreen mode
Some special cases
There’s always some special cases to consider, so I’ll go through those last.
Changing behavior for integer based indexes
You’ll remember that I changed the index on our first test DataFrame
to be a string based index instead of the standard RangeIndex
of sequential integer values. This following example shows why, but you should now understand why it works this way.
>>> df2 = pd.DataFrame({"a": [0, 1, 2, 3], "b": [4, 5, 6, 7], "c": [8, 9, 10, 11]})
>>> df2
a b c
0 0 4 8
1 1 5 9
2 2 6 10
3 3 7 11
>>> df2.loc[0:2] # Remember, .loc is label based
a b c
0 0 4 8
1 1 5 9
2 2 6 10
>>> df2.iloc[0:2] # And .iloc is position based
a b c
0 0 4 8
1 1 5 9
>>> df2[0:2] # So here [] behaves like position based, even though we may be trying to use our index
a b c
0 0 4 8
1 1 5 9
>>> df2.index = ["w", "x", "y", "z"]
>>> df2["w":"y"] # now, [] behaves like it's label based.
a b c
w 0 4 8
x 1 5 9
y 2 6 10
Enter fullscreen mode Exit fullscreen mode
Hopefully this example reinforces for you why you’ll want to use .loc
and .iloc
in production code, especially when you’re taking values to select rows or columns from unknown sources!
Label slicing and sorted indexes
We saw that using .loc
will include both the start and stop values of a slice, which is not the way standard Python works. Well, you can end up with indexes that are not sorted and will behave slightly differently than expected. Essentially, what you need to notice is that with a sorted index, all values in between the start and stop will be included (including the value for the stop label)
>>> s2 = pd.Series([1, 2, 3, 4, 5], index=[5, 2, 1, 9, 7])
>>> s3 = pd.Series([1, 2, 3, 4, 5], index=[0, 1, 2, 3, 4])
>>>
>>> s3.loc[0:3] # works just as you'd expect for .loc (not standard python slicing, includes end label)
0 1
1 2
2 3
3 4
dtype: int64
>>> s3.loc[0:9] # no exception raised, even though 9 is not in the index, and includes all values between them
0 1
1 2
2 3
3 4
4 5
dtype: int64
>>> try:
... s2.loc[0:9] # hmmm, this won't work
... except KeyError as ke:
... print(ke)
...
0
>>> s2.sort_index().loc[0:9] # this works though.
1 3
2 2
5 1
7 5
9 4
dtype: int64
Enter fullscreen mode Exit fullscreen mode
The reason for this functionality is that it is too expensive to do the sorting first, and it may not return the values you’d expect anyway since sorting may not make sense for your index. Just be aware of this behavior when you are dealing with non-sorted indexes. You’ll realize that it’s important to understand the indexes on your data and ensure they are accurately reflecting your data and what you’re trying to accomplish with indexing and slicing it.
The Ellipsis and NumPy slicing
This is a little known part of Python, but in working through these examples I did come across the Ellipsis
keyword in looking at the slicing documentation. Ellipsis
is just a singleton (like None
) and is intended to be used in extended slicing syntax by user-defined container data types. It is represented by the ellipsis literal, ...
.
It doesn’t appear to be supported much by pandas (beyond returning a full series when passed invoked like [...]
). It also isn’t going to be more useful than using a full slice, .i.e. :
for most cases, but it is more useful in NumPy with higher dimensional arrays.
The way to think of it is as a special value that will insert as many full slices as needed to extend the slice at the point of insertion in an index in all directions. I think this makes more sense with an example.
>>> ...
Ellipsis
>>> m = np.arange(27).reshape((3,3,3)) # 3 x 3 x 3 multi-dimensional array
>>> print("Full array:\n", m)
Full array:
[[[0 1 2]
[3 4 5]
[6 7 8]]
[[9 10 11]
[12 13 14]
[15 16 17]]
[[18 19 20]
[21 22 23]
[24 25 26]]]
>>> print("First element in last dimension:\n", m[:,:,0])
First element in last dimension:
[[0 3 6]
[9 12 15]
[18 21 24]]
>>> print("With ellipsis:\n", m[...,0])
With ellipsis:
[[0 3 6]
[9 12 15]
[18 21 24]]
>>> print("Last element in last dimension:\n", m[...,-1])
Last element in last dimension:
[[2 5 8]
[11 14 17]
[20 23 26]]
Enter fullscreen mode Exit fullscreen mode
This could come in handy if you end up dealing with highly dimensioned arrays and want to save some typing. Or if you just want to try to stump someone with some obscure Python knowledge.
Wrapping it up
OK, that should be enough for now. Hopefully you’ve learned something about slicing, I know I learned a few things in doing this writeup.
Next I’ll move on to boolean indexing, a very powerful way to select any bit of data in your Series
or DataFrame
using any logic you can think of.
暂无评论内容