This morning I got sucked into a small Twitter exchange about face surveillance, sparked by this tweet asserting it’s not inevitable.
I’m skeptical that individual technologists can avert national-scale face surveillance, if states feel like building it.
If it were hard to build such a thing, and enough engineers refused, maybe. But I don’t think it’s that hard. I had an exchange with someone who disagrees with me about the difficulty. Sounds like a fun challenge.
Over breakfast, I sketched out a rough architecture for a large-scale face surveillance system.
The system has 4 main tables of information:
- Persons: a set of authoritative information about each person in the country, including identifying information and a good-quality face photo. Maybe you take a few of these.
- Photos: a dataset of photos captured throughout the country, annotated with useful metadata, like timestamp, location, ID of the camera that captured the image, …
- Faces: the results of detecting and extracting faces from the Photos dataset, carrying along all the metadata, because it’s small relative to the imagery and in large-scale systems you want denormalized data to enable local processing.
- Appearances: the good stuff. This is the result of processing an extracted Face for it’s facial features and using those to match against the Persons dataset.
Estimating 1MB of data for each entry in the Persons table, and 500mm people, that’s 500TB of data. Easy. Amazon will charge you $11.5k/month to store that (pricing).
For simplicity, I’m using blob storage pricing as a rough estimate, because the imagery will use most of the data.
The Photos are heftier. At 5MB/photo, we can easily do 50TB/day. That could be tricky. Here, for simplicity, I might just discard the Photos imagery immediately after processing. That’s the easiest way. But you could also cycle it into cold storage and progressively downsample it over time until eventually deleting it. Depends on how much of a packrat you are.
Like I said though, I’d just toss it. In a large-scale system like this, the more local processing you can do, the better. Batch processing is going to be a frigging nightmare and everything will back up and fall over if you don’t fix it quick enough. I’d want each camera to have local compute power to do real-time face detection and facial feature extraction. This exists.
When I say real-time, I mean fast, but it doesn’t have to be 60fps. If you can process faces out of a video once per second, that’s probably just fine. Depends on the purpose of the system, really. In fact, if you have persons of interest you’re actively looking for, it may be possible to deploy a model based on those to every camera in the region and only have the cameras watch for hits against that model, rather than dragnet logging of every face as fast as possible.
When the camera/computer registers faces (of interest), it can slice them out of the whole photo, and directly output to the Faces dataset. This is a large simplification and cuts out the main data storage consumer (Photos).
I estimated 10kb per Faces entry. That might actually be too generous. Here’s a 52×88 pixel image of my face, blown up. If you know me, this is recognizable, especially with the sunglasses off. The file is 2.5kb in size.
If you log 1 billion 10kb Faces entries per day, you use 10TB of storage per day. If you hold them for 1 year at a time, that’s 3.65PB storage in active use per day. If I haven’t bungled the math, that will cost you $77k/mo at AWS rates.
Faces pictures themselves probably aren’t that interesting. What you really want to know is who the face represents. You get there by processing those Faces images to extract features and match against the Persons database. You want to keep the Faces around for a while in case you develop improved features to match with.
I’m not up-to-date on facial features and recognition. If you’re just checking for a small set of persons of interest, it’s probably easy. Let’s assume that due to the large data set it’s going to take some work to search for a match. The raw Persons data was 500TB, but after extracting features from the photos, you’re going to shave that down by orders of magnitude and can just make a bunch of local caches in each region.
We’re not talking far-future tech here. This stuff is doable. And with these cost estimates, I can be off by many orders of magnitude and the system would still be affordable with a state’s resources.