I recently tested Apple’s Vision framework and Ultralytics’ YOLOv11n model for a face detection feature in a video processing app I'm working on. Thought I’d share my findings and a few useful code snippets in case you’re trying to decide between the two or just looking into face detection options for your iOS app. This is a practical, high-level writeup, so I won’t get into all the technical details of each framework. I'll be updating this writeup as I get more data from my tests.
iOS Target Versions
- iOS 15.0+ (required for Vision framework revision 3)
- Xcode 14.0+ for development
Please note YOLOv8n could be better for face detection, but I had to use YOLOv11n model for this project due to its improved handling of partial face occlusions and better performance on edge cases. Some of the code snippets are simplified for clarity, so make sure to adapt them to your specific use case.
Setting Up Both Detectors
First, let's look at how to get each detector running. I'll include the complete, working code that you can copy and start testing with.
Vision Framework Setup
This first snippet is the basic Vision detector setup I ended up using:
VisionFaceDetector Implementation
// Common structures and enums used by both detectors
enum DetectionError: Error {
case modelNotInitialized
case imageProcessingFailed
case memoryLimitExceeded
}
struct DetectionMetrics {
var processedFrames: Int = 0
var totalProcessingTime: TimeInterval = 0
var avgFPS: Double { processedFrames > 0 ? Double(processedFrames) / totalProcessingTime : 0 }
var peakMemoryUsage: UInt64 = 0
var lastMemoryReset: Date = Date()
}
struct VideoProcessingConfig {
var detectionQuality: String
var sceneType: String
var trackingPriority: String
var confidenceThreshold: Float = 0.5
var frameInterval: Double = 1.0
}
final class VisionFaceDetector {
private var sequenceRequestHandler: VNSequenceRequestHandler
private var metrics: DetectionMetrics
private let ciContext = CIContext(options: [
.cacheIntermediates: false,
.useSoftwareRenderer: false
])
init() {
self.sequenceRequestHandler = VNSequenceRequestHandler()
self.metrics = DetectionMetrics()
}
private func configureDetectionRequest(config: VideoProcessingConfig?) -> VNDetectFaceRectanglesRequest {
let request = VNDetectFaceRectanglesRequest()
// Use latest revision - big difference in accuracy
request.revision = VNDetectFaceRectanglesRequestRevision3
if let config = config {
switch config.detectionQuality {
case "fast":
request.usesCPUOnly = true
if #available(iOS 16.0, *) {
request.preferBackgroundProcessing = false
}
case "accurate":
request.usesCPUOnly = false
if #available(iOS 16.0, *) {
request.preferBackgroundProcessing = true
}
default: // balanced
request.usesCPUOnly = false
if #available(iOS 16.0, *) {
request.preferBackgroundProcessing = false
}
}
}
return request
}
func detectFaces(in pixelBuffer: CVPixelBuffer, confidenceThreshold: Float) throws -> [VNDetectedObjectObservation] {
let request = configureDetectionRequest(config: nil)
let requestHandler = VNImageRequestHandler(
cvPixelBuffer: pixelBuffer,
orientation: .up,
options: [VNImageOption.ciContext: ciContext]
)
try requestHandler.perform([request])
guard let observations = request.results as? [VNFaceObservation] else {
return []
}
return observations
.filter { $0.confidence >= confidenceThreshold }
.map { faceObservation in
// Add some padding to the bounding box
let box = faceObservation.boundingBox
let verticalPadding = box.height * 0.25
let horizontalPadding = box.width * 0.1
let paddedBox = CGRect(
x: max(0, box.minX - horizontalPadding),
y: max(0, box.minY - verticalPadding),
width: min(1 - box.minX, box.width + (2 * horizontalPadding)),
height: min(1 - box.minY, box.height + (2 * verticalPadding))
)
return VNDetectedObjectObservation(
boundingBox: paddedBox
).withConfidence(faceObservation.confidence)
}
}
}
YOLOv11n Setup
For YOLOv11n, first you need to convert the model. Here's the Python script I used in an M1 Mac:
Check the YOLOv11n Model by Ultralytics in Huggingface page for more details. I used the smallest model for this project(n). The model is trained on the COCO dataset, so it's a good starting point for general object detection.
from ultralytics import YOLO
# Load YOLO model
model = YOLO("yolo11n.pt")
# Export to CoreML
# Pro tip: Run this on a machine with 16GB+ RAM
model.export(format="coreml",
nms=True, # Enable Neural Network Management System, I left it off for now
input_shape=(1, 3, 640, 640), # This is important - include batch dimension
scales=1/255.0, # Normalize inputs
biases=[0, 0, 0])
# Quick test
coreml_model = YOLO("yolo11n.mlpackage")
results = coreml_model("test_image.jpg")
Core ML Model Conversion
When examining the YOLO11n Core ML model's category mapping, always check the output categories. Here's an example of the YOLO11n model's category mapping where the first category is "person", so we use index 0 for face detection:
{
"MLModelVersionStringKey": "8.3.28",
"MLModelDescriptionKey": "Ultralytics YOLO11n model trained on /usr/src/ultralytics/ultralytics/cfg/datasets/coco.yaml",
"MLModelCreatorDefinedKey": {
"com.github.apple.coremltools.version": "8.0",
"stride": "32",
"com.github.apple.coremltools.source_dialect": "TorchScript",
"docs": "https://docs.ultralytics.com",
"task": "detect",
"com.github.apple.coremltools.source": "torch==2.4.0",
"imgsz": "[640, 640]",
"date": "2024-11-09T19:33:44.912605",
"batch": "1",
"names": "{0: 'person', 1: 'bicycle', 2: 'car', 3: 'motorcycle', 4: 'airplane', 5: 'bus', 6: 'train', 7: 'truck', 8: 'boat', 9: 'traffic light', 10: 'fire hydrant', 11: 'stop sign', 12: 'parking meter', 13: 'bench', 14: 'bird', 15: 'cat', 16: 'dog', 17: 'horse', 18: 'sheep', 19: 'cow', 20: 'elephant', 21: 'bear', 22: 'zebra', 23: 'giraffe', 24: 'backpack', 25: 'umbrella', 26: 'handbag', 27: 'tie', 28: 'suitcase', 29: 'frisbee', 30: 'skis', 31: 'snowboard', 32: 'sports ball', 33: 'kite', 34: 'baseball bat', 35: 'baseball glove', 36: 'skateboard', 37: 'surfboard', 38: 'tennis racket', 39: 'bottle', 40: 'wine glass', 41: 'cup', 42: 'fork', 43: 'knife', 44: 'spoon', 45: 'bowl', 46: 'banana', 47: 'apple', 48: 'sandwich', 49: 'orange', 50: 'broccoli', 51: 'carrot', 52: 'hot dog', 53: 'pizza', 54: 'donut', 55: 'cake', 56: 'chair', 57: 'couch', 58: 'potted plant', 59: 'bed', 60: 'dining table', 61: 'toilet', 62: 'tv', 63: 'laptop', 64: 'mouse', 65: 'remote', 66: 'keyboard', 67: 'cell phone', 68: 'microwave', 69: 'oven', 70: 'toaster', 71: 'sink', 72: 'refrigerator', 73: 'book', 74: 'clock', 75: 'vase', 76: 'scissors', 77: 'teddy bear', 78: 'hair drier', 79: 'toothbrush'}"
},
"MLModelAuthorKey": "Ultralytics",
"MLModelLicenseKey": "AGPL-3.0 License (https://ultralytics.com/license)"
}
YOLOVideoDetector Implementation
final class YOLOVideoDetector {
private var model: yolo11n?
private let modelInputSize = CGSize(width: 640, height: 640)
private let personClassIndex = 0 // Using person detection for faces
private let context: CIContext
private var metrics: DetectionMetrics
private var pixelBufferPool: CVPixelBufferPool?
init() {
self.context = CIContext(options: [.useSoftwareRenderer: false])
self.metrics = DetectionMetrics()
setupPixelBufferPool(width: Int(modelInputSize.width), height: Int(modelInputSize.height))
}
func initializeModel(completion: @escaping (Result<Void, Error>) -> Void) {
DispatchQueue.global(qos: .userInitiated).async { [weak self] in
do {
let config = MLModelConfiguration()
// Using cpuAndGPU instead of .all for more consistent performance
// .all can sometimes cause frame drops when switching between compute units
config.computeUnits = .cpuAndGPU
config.allowLowPrecisionAccumulationOnGPU = true
self?.model = try yolo11n(configuration: config)
DispatchQueue.main.async {
completion(.success(()))
}
} catch {
DispatchQueue.main.async {
completion(.failure(error))
}
}
}
}
func detectFaces(in pixelBuffer: CVPixelBuffer, confidenceThreshold: Float) throws -> [VNDetectedObjectObservation] {
guard let model = model else {
throw DetectionError.modelNotInitialized
}
// Check memory usage every 100 frames, this is tricky and could be improved
if metrics.processedFrames % 100 == 0 && reportMemoryUsage() > 500_000_000 { // 500Mb
throw DetectionError.memoryLimitExceeded
}
let resizedBuffer = try resizePixelBuffer(pixelBuffer)
let input = yolo11nInput(image: resizedBuffer)
let output = try model.prediction(input: input)
return try processYOLODetections(
predictions: output.var_1227,
confidenceThreshold: confidenceThreshold,
originalSize: CGSize(
width: CVPixelBufferGetWidth(pixelBuffer),
height: CVPixelBufferGetHeight(pixelBuffer)
)
)
}
private func processYOLODetections(
predictions: MLMultiArray,
confidenceThreshold: Float,
originalSize: CGSize
) throws -> [VNDetectedObjectObservation] {
// Extract person detections and convert to face observations
// Format: [batch, anchor, x, y, w, h, confidence, class_scores]
var detections: [VNDetectedObjectObservation] = []
let numPredictions = predictions.shape[1].intValue
for i in 0..<numPredictions {
let confidence = Float(predictions[[0, i, 6]].doubleValue)
let classIndex = Float(predictions[[0, i, 7]].doubleValue)
guard confidence >= confidenceThreshold,
classIndex == Float(personClassIndex) else { continue }
let x = CGFloat(predictions[[0, i, 0]].doubleValue)
let y = CGFloat(predictions[[0, i, 1]].doubleValue)
let w = CGFloat(predictions[[0, i, 2]].doubleValue)
let h = CGFloat(predictions[[0, i, 3]].doubleValue)
// Convert to normalized coordinates
let boundingBox = CGRect(
x: x / originalSize.width,
y: y / originalSize.height,
width: w / originalSize.width,
height: h / originalSize.height
)
detections.append(
VNDetectedObjectObservation(
boundingBox: boundingBox
).withConfidence(confidence)
)
}
return detections
}
private func resizePixelBuffer(_ pixelBuffer: CVPixelBuffer) throws -> CVPixelBuffer {
var resizedBuffer: CVPixelBuffer?
if let pool = pixelBufferPool {
CVPixelBufferPoolCreatePixelBuffer(
kCFAllocatorDefault,
pool,
&resizedBuffer
)
}
guard let resizedBuffer = resizedBuffer else {
throw DetectionError.imageProcessingFailed
}
let ciImage = CIImage(cvPixelBuffer: pixelBuffer)
let scaleX = modelInputSize.width / CGFloat(CVPixelBufferGetWidth(pixelBuffer))
let scaleY = modelInputSize.height / CGFloat(CVPixelBufferGetHeight(pixelBuffer))
let scale = min(scaleX, scaleY)
context.render(
ciImage.transformed(by: CGAffineTransform(scaleX: scale, y: scale)),
to: resizedBuffer
)
return resizedBuffer
}
private func setupPixelBufferPool(width: Int, height: Int) {
let attributes = [
kCVPixelBufferPoolMinimumBufferCountKey: 3,
kCVPixelBufferPixelFormatTypeKey: kCVPixelFormatType_32BGRA,
kCVPixelBufferWidthKey: width,
kCVPixelBufferHeightKey: height
] as CFDictionary
CVPixelBufferPoolCreate(
kCFAllocatorDefault,
attributes,
nil,
&pixelBufferPool
)
}
private func reportMemoryUsage() -> UInt64 {
var info = mach_task_basic_info()
var count = mach_msg_type_number_t(MemoryLayout<mach_task_basic_info>.size)/4
let kerr: kern_return_t = withUnsafeMutablePointer(to: &info) {
$0.withMemoryRebound(to: integer_t.self, capacity: 1) {
task_info(
mach_task_self_,
task_flavor_t(MACH_TASK_BASIC_INFO),
$0,
&count
)
}
}
return kerr == KERN_SUCCESS ? info.resident_size : 0
}
}
Real-World Performance
Let me share my practical experience from testing both detectors across a range of real video inputs during development:
Processing Speed
Vision demonstrated consistently better performance:
Vision typically maintained 25-30 FPS on most devices, even with multiple faces in the frame YOLO showed 15-20 FPS on the same devices, dropping to 8-12 FPS on older models
// Configuration that worked best for Vision
let visionConfig = VideoProcessingConfig(
detectionQuality: "balanced",
sceneType: "default",
trackingPriority: "balanced"
)
visionConfig.confidenceThreshold = 0.12 // Sweet spot for faces
visionConfig.frameInterval = 0.75 // Process every 3/4 frames
// YOLO needed more conservative settings
let yoloConfig = VideoProcessingConfig(
detectionQuality: "balanced",
sceneType: "default",
trackingPriority: "detection"
)
yoloConfig.confidenceThreshold = 0.25 // Needed higher to avoid false positives
yoloConfig.frameInterval = 1.0 // Process every other frame
Memory Usage
My testing showed Vision to be more memory-efficient:
Vision:
- Stable memory usage around 80-120MB
- Peak usage rarely exceeded 150MB
- No significant memory growth over 1-hour sessions
YOLO:
- Base memory footprint of 200-250MB
- Peak usage up to 400MB during detection
- Required memory reset after ~2 hours of continuous use
Power Consumption
Not tested in detail, but Vision seemed to consume less power overall.
Accuracy
Both frameworks performed well, with some notable differences:
Vision:
- 95% accuracy on front-facing faces
- 85% accuracy on profile views
- Reliable detection down to 64x64 pixels
- False positive rate < 0.1%
YOLO:
- 92% accuracy on front-facing faces
- 88% accuracy on profile views
- Minimum reliable face size: 96x96 pixels
- False positive rate ~0.5%
Common Issues & Fixes
Vision Framework
Memory Leaks: Always use autoreleasepool when processing frames:
while let sampleBuffer = output.copyNextSampleBuffer() {
autoreleasepool {
// Process frame
CMSampleBufferInvalidate(sampleBuffer)
}
}
Poor Performance: Make sure to handle orientation correctly:
let requestHandler = VNImageRequestHandler(
cvPixelBuffer: pixelBuffer,
orientation: .up, // Critical for correct detection
options: [VNImageOption.ciContext: ciContext]
)
YOLOv11n
Conversion Issues: Clear GPU memory before converting:
import torch
torch.cuda.empty_cache()
import gc
gc.collect()
model.export(format="coreml", ...)
Random Crashes: Proper error handling is crucial:
func detectFaces() throws {
guard let model = model else {
throw DetectionError.modelNotInitialized
}
if reportMemoryUsage() > 500_000_000 { // 500MB
throw DetectionError.memoryLimitExceeded
}
// Process frame here
}
What to use
After all this testing, Vision shows an edge over YOLO for basic face detection on iOS:
- Faster processing
- Lower memory usage
- Better accuracy on faces specifically
- Native iOS integration
- No need to manage model updates
That said, YOLO could be your better choice if:
- You need general object detection too
- You want to train custom models
- You need cross-platform compatibility
Final Tips
For Vision:
// Always add padding to face boxes
let paddedBox = boundingBox.insetBy(
dx: -boundingBox.width * 0.1,
dy: -boundingBox.height * 0.25
)
For YOLO:
// Reuse pixel buffers for better performance
private var pixelBufferPool: CVPixelBufferPool?
private func setupPixelBufferPool(width: Int, height: Int) {
let attributes = [
kCVPixelBufferPoolMinimumBufferCountKey: 3,
kCVPixelBufferPixelFormatTypeKey: kCVPixelFormatType_32BGRA,
kCVPixelBufferWidthKey: width,
kCVPixelBufferHeightKey: height
] as CFDictionary
CVPixelBufferPoolCreate(
kCFAllocatorDefault,
attributes,
nil,
&pixelBufferPool
)
}
Testing Videos
I used these videos to test both detectors:
- Vision: 28 FPS, 98% detection rate, 110MB avg memory
- YOLO: 18 FPS, 95% detection rate, 280MB avg memory
- Scene characteristics: Good lighting, multiple face angles
- Vision: 27 FPS, 92% detection rate, 115MB avg memory
- YOLO: 17 FPS, 94% detection rate, 285MB avg memory
- Scene characteristics: Variable lighting, fast motion
- Vision: 29 FPS, 0% detection rate, 105MB avg memory (no faces)
- YOLO: 19 FPS, 0% detection rate, 275MB avg memory (no faces)
- Scene characteristics: Outdoor lighting, 0 faces
- Vision: 26 FPS, 91% detection rate, 120MB avg memory
- YOLO: 16 FPS, 89% detection rate, 290MB avg memory
- Scene characteristics: Outdoor, motion blur, varying distances
References and Further Reading
Official Documentation
- Vision Framework Documentation
- Apple's official documentation for the Vision framework
- Includes VNDetectFaceRectanglesRequest API reference
- Ultralytics YOLO Documentation
- Official documentation for YOLO models
- Implementation guides and best practices
Academic Publications
- Redmon, J., & Farhadi, A. (2018). "YOLOv3: An Incremental Improvement." arXiv preprint arXiv:1804.02767.
- Foundational paper on YOLO architecture
- Presents core concepts still relevant to modern implementations
- Bochkovskiy, A., Wang, C. Y., & Liao, H. Y. M. (2020). "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934.
- Comprehensive analysis of YOLO architecture improvements
- Performance optimization techniques
Technical Resources
- Core ML Tools Documentation
- Official documentation for converting models to Core ML format
- Best practices for iOS deployment
- WWDC 2023 Sessions
- Latest updates on Vision framework capabilities
- Performance optimization techniques for iOS
Note: For the most up-to-date information on YOLOv11n, please refer to the Ultralytics documentation and GitHub repository, as this represents ongoing development work.
Conclusion
Both Vision and YOLO are solid choices for face detection on iOS, but Vision's better performance and lower memory usage. If you need more flexibility or cross-platform compatibility, YOLO could be the better choice.
Thanks for reading, and I hope this help even if just a bit, take the stats with a grain of salt since I used LLMs to get average numbers from my app test logs; there could be a lot of factors that could affect the performance of the detectors, but this is a good starting point.
Happy coding!