Python’s an incredibly versatile language. In this post (probably the first of many, we’ll see) I’ll walk through one of the major workhorses of functional programming in Python: map
.
The Basics
Feel free to skip this if you already know what map does and just want to get to the part where I describe common usage patterns.
map
is one of a couple of builtin higher-order functions, meaning it takes a function as one of its arguments. The second argument map
takes is a sequence. All it does is apply the function to the sequence.
def add_one(n):
return n + 1
x = [1, 2, 3]
y = map(add_one, n)
print(list(y))
# >>> [2, 3, 4]
Hopefully what map
‘s doing is pretty obvious. What may not be obvious is that map
doesn’t return a list, it returns a generator. I’m converting it to a list manually to print it.
When map
is not the Right Choice
It’s actually a lot simpler to write the above as a list comprehension. For example:
z = [n + 1 for n in x]
print(z)
# >>> [2, 3, 4]
So … what’s the point of map
if comprehensions are simpler and take less code? The fact that map
returns a generator is a clue. Generators don’t materialize the sequences into memory, meaning y
is basically “free” in terms of memory. For the above example map
is a bad choice because it’s operating on a list. But what if the sequence is itself a generator?
File Processing
Here’s an example where map
is a good choice. It’s from a script in this repository that scrapes and processes Bigfoot Sightings from the BFRO sighting database. What the script does is take a CSV file with the processed Bigfoot sightings and load it into Elasticsearch. I like to use Elasticsearch and Kibana for checking data quality and light exploratory analysis.
Elasticsearch takes JSON, and requires a pretty specific schema to load it (at least the streaming bulk helper I’m using does). I’ll need a function that takes a dictionary (representing the csv row) and embed it within a dictionary that Elasticsearch’s bulk loading mechanism can understand. This is that function, it’s not too fancy:
def bfro_bulk_action(doc: dict) -> dict:
return {
"_op_type": "index",
"_index": bfro_index_name,
"_type": bfro_report_type_name,
"_id": doc["number"],
"_source": {
"location": {
"lat": float(doc["latitude"]),
"lon": float(doc["longitude"])
} if doc["latitude"] and doc["longitude"] else None,
**doc # This is the rest of the doc }
}
Hopefully you can see where map
can be useful here:
reports = DictReader(report_file)
# Create the report documents. report_actions = map(bfro_bulk_action, reports)
# Note there has been zero processing thus far, and no data is in memory.
# client here is the Elasticsearch client. for ok, resp in streaming_bulk(client, report_actions):
# If there's a failure print what happened. if not ok:
print(resp)
The streaming_bulk
function takes a client (for the HTTP connection to the Elasticsearch instance) and an iterable, which could be a list or a generator or an infinite stream (more on that in a minute). In our case, it’s the generator returned by map
, which is itself operating on the generator created by the DictReader
from Python standard library csv
package.
The most important thing to note here is that only one record’s being held in memory at a time. That wouldn’t be true if we’d used pandas read_csv
, or if we’d loaded the file into a list. In those cases we’d be constrained to operate only on files small enough to be held in main memory. In this implementation, the only significant resource constraint we have is our patience. The map
+ DictReader
combo only ever loads one record into memory at a time. This enables map
to be very effective at operating on infinite sequences, more commonly known as streams.
Stream Processing
The final example I’ll walk through in this post is inspired by this script, which is part of a project I wrote to collect profane tweets about people on Twitter. More info here, though consider yourself warned: obviously the language is strong. Kinda the point.
What the script I’ve linked above does is subscribe to the Twitter Streaming API with a list of tracking targets, filter out the tweets containing profanity, then load them into Elasticsearch (can you tell I’m a fan?). Here’s how that works. Let’s assume for simplicity that the stream already has the profanity filtered – I’ll write another post giving more detail about how I did that later. This leaves one thing to do: wrap the tweet (a dictionary) in a larger dictionary to use with the streaming_bulk
function. I’ve omitted a few things for simplicity, but you can see the whole script in the link I provided above.
def _tweet_to_bulk(tweet):
return {
"_index": "profanity-power-index",
"_type": "tweet",
"_id": tweet["id_str"],
"_source": {
"id": tweet["id_str"],
# _extract_text is just a small helper function so we get # the retweeted statuses too. "text": _extract_text(tweet),
"created_at": tweet["created_at"],
},
}
Care to guess what the script looks like?
# track is the list of targets. # api is an authenticated Twitter API client. tweet_stream = api.GetStreamFilter(track=track)
# From the perspective of our code, tweet_stream is an infinite # sequence. It doesn't matter to us how it gets its contents. bulk_action_stream = map(tweet_to_bulk, tweet_stream)
# We are still not processing anything at this point. for ok, resp in streaming_bulk(client, bulk_action_stream):
if not ok:
print(resp)
And that will continue running until you hit Ctrl-C. At no point does it accumulate memory (at least not because of the streams). Twitter stuff aside, it’s almost exactly the same as the file example, and that’s the point. An iterable is just something you can loop over; it doesn’t matter how long it is or where the data comes from. If we were trying to collect the tweets into a list, we’d need to add a way to deal with the memory. But because we’re using map
and generators, it doesn’t matter.
What’s Next
Operating on infinite streams of data doesn’t necessarily require map
. In fact we could have implemented both of the above examples with regular for loops:
for tweet in stream:
tweet_doc = _tweet_to_bulk(tweet)
# ... other stuff here ...
There’s even a way to write a generator as a comprehension:
# Note the parens rather than brackets. tweet_stream = (_tweet_to_bulk(tweet) for tweet in tweet_stream)
Personally, when it comes to streams I tend to prefer map
over comprehensions, even generator comprehensions. Not only is it more succinct, but map
has one very significant advantage over loop constructs and comprehensions: it’s a function, and that means it can be composed with other functions. I’ll cover that in another post later. For now, hopefully the distinction between map
and comprehensions, including when to use one or the other, is a little clearer.
暂无评论内容