Mark Richards

System Outline

As research starts going dark I had a need for a way to download thousands or millions of items from websites. Initial attempts were with Google Colabs and grabbing a single file at a time. However, this was too slow and time was a factor. So, I needed to find a way to keep a running list of files to download and mark them as complete when done, and also to grab as many in parallel as possible. This forced me to finally learn how to leverage queues to drive functions and actually be event driven.

Implementation Lessons

While it seems simple, setting it up for the first time has some pitfalls that I didn't expect. Things like needing to strip quotes form queue message bodies. Enabling or disabling SQS triggers from the lambda function itself. Mostly for testing so you don't process all records while still developing, but also this is where you set concurrency limits for the function. A default of 10 was too low for my needs. Also when processed as a trigger the lambda function ill delete the SQS message without you needing to explicitly remove it.

Loading a single queue item at a time is slow

When trying to load several million items. I had written code to send a single item to SQS. This took a very long time to process as API calls. It would be better to create lists of queues and send them in as batches. There is support for calls with a 256kb payload.

Building an inventory

When dealing with hundreds of thousands of files it is helpful to have a way to get a list of the ones you have already processed in case you need to stop and restart your pulls. S3 has inventory reports that you can enable but they lag by about 24 hours. This is a great free options but not good if you have time sensitive needs. You can pull lists of files in a bucket but the API will only return 1000 at a time.

Empty files

Using subprocess in python to CURL files can lead to creating empty files if not told to fail silently with the flag -f. This initially seemed like a bug that was corrected but in the end may have been a preferred case. When restarting a large pull the list of files was used as a crude way to trim the queue. When the blank files were removed the queue would then try to pull them again on restart. So keeping them in and going back to delete them alter may have been preferred.

Lambda Configuration Triggers

Setting the active or disabled status in the lambda config was a much preferred way to control the start and stop of a function for pulling from the queue. This is also where you can set your execution concurrency. The default of 10 was far too low for the millions of files I wanted to grab. There is a standard limit of 1000 for all of your lambda functions at once. You can request a limit increase but for me, setting to 500 was sufficient to move fast, not overwhelm the server I was pulling from, and also leave room for my other functions to operate.

Auto delete SQS message with lambda

When pulling items from a queue on successful exit of the function the queue item is removed. This saves a little bit of work by not needing to add an api call to delete the item at the end of the function code. Which was a pleasant surprise.

S3 egress

In my case I found myself with 655gb of data at the end of my scraping. This should cost $15/month in storage for standard tier. My thought was to let the function run and store the data in S3 until I knew its size and then could allocate hard drive space locally for it. I purchased an external drive and set up a syncing operation to pull the data down and store it local. However, I failed to think about the egress cost. This maneuver cost me $50. On top of the $50 for the external drive.

Retrospective

I could have done this cleaner with better code. I could have built a better file checking system to limit reintroduction to the queuing system. I was mostly worried about speed as there was a day or two when the server went down, and nobody was sure if it was coming back. So, I wanted to grab as much data as I could as fast as I could. There are also IAM permissions that are too expansive simply to limit issues and get functions running as quickly as possible. There has also been real money introduced in this project by misjudging the cost of S3 and some API operations.

In the future I think this could be a useful took to have if I generalize it more. Make the messages going into the queue be more explicit so that I can point it at any webpage not simply be hard coded for one system. Build a more robust file monitoring solution. And be more mindful of cost and storage tiering/replication.

Return To AWS Projects