A Wiggish History of Computer Vision

For an interactive experience of this history, please grant permission to use your webcam when asked.

You're not being recorded. No data leaves your computer.

Introduction Seeing is harder than it looks

In 1966, Seymour Papert proposed the Summer Vision Project, bringing together artificial intelligence researchers for “the construction of a significant part of a visual system” over the course of just a few months. But the goal of solving many problems of computer vision proved overambitious and instead the researchers rediscovered a fact long familiar in vision science: seeing is harder than it looks.Histories don't write themselves. For credits and other histories of computer vision, scroll to the bottom of the page.

Since the optical studies of Ibn al-Haytham in the 11th century, scientists recognized a gap between the confused mess of visual information hitting the eye and visual experience, where segmented objects appear arrayed in space with clear differences between foreground and background. The assumption for the last millennium has been that there must be unconscious judgments engaged in processing the visual information: segmentation, spatial configuration, object recognition, and so on.

Papert recognized this in part. His optimism stemmed from the idea that the different unconscious judgments necessary for understanding an image could be instantiated in different computer programs. Thus the labor could be divided among different teams, with one team writing a program to detect edges, corners, and other pixel-level information in an image, another forming continous shapes out of these low-level features, a different group arranging the shapes in three-dimensional space, and so on. While the summer project failed, the general approach remained: treat vision not as a single problem, but as a number of discrete subproblems which can be stacked one on top of another in a hierarchy, from edge detection all the way up to robust object recognition.

Funding for computer vision was often generous because the military was being overwhelmed with spy plane and later satellite images (it was the Cold War, after all). The funding for this approach typically came from either the Department of Defense or the Advanced Research Projects Agency (ARPA, later DARPA), the US military research slush fund.

An early computer vision success came from Philco's tank detector in the 1960s, which could locate tanks (given their regular, box-like shape) from statistical analysis of satellite images.

In 1964 the military began investigations into a facial recognition system with Woody Bledsoe, Charles Bisson and Helen ChanWired's Secret History of Facial Recognition delightfully recounts the Bledsoe, Bisson, and Chan story. which detected a number of identifying features — how far apart your eyes are, where your hairline starts — which could be matched with a face.

Although the research funding waxed and waned over the last 70 years, militaries across the world see more effective computer vision as essential to their national security.
Harun Farocki's Eye/Machine film trilogy examines “intelligent” image processing techniques in warfare and civilian life.
Many academics and industries, more interested in basic research into human vision or self-driving vacuums, often found the pull of government money irresistable. This lead to enormously innovative work, some of it anthropocentric and some designed solely for mechanical eyes, into ways of interpreting the world of light.

The original Perceptron can only provide a binary classification — either A or B. In this video they classify faces into the categories men and women, revealing the simplistic assumptions surrounding gender male computer scientists operated with in the 1950s.

1943 – 1969 Early Neural Nets

In 1943, even earlier than the Summer Project, the neuroscientist Warren McCulloch and mathematician Walter PittsAmanda Gefter's The Man Who Tried to Redeem the World with Logic is a lively portrait of Pitts' and McCulloch's lives. provided a model for how a neuron in the brain might engage in basic logical operations. Their model suggested the massive interconnected networks of neurons in the brain were computing — performing simple mathematical functions resulting in differential results.

In the 1950s, AI researchers began to build simply artificial neural networks, attempts to simulate how groups of neurons could compute complex functions. The neuroscientist Frank RosenblattRosenblatt was a polymath with numerous interests and a special fascination with creating endless variations of "three blind mice" on piano. His interest in neural networks faded and he later looked for different, chemical bases for learning in biological species — specifically, mice. developed the most famous model, the Perceptron. The Perceptron is an algorithm which takes various numerical values as inputs, connects these inputs with a number of neurons in a single “hidden” layer, and then transmits the result to an output. The connections between inputs and outputs are all weights, a number which designates how relevant each input is to the overall output.

An illustration of Rosenblatt's Perceptron. The system was at once an algorithm and a physical machine, which is perhaps why the diagram (from the Perceptron Operator's Manual) draws from both biology and electrical engineering.
Read more about the history of neural network illustrations in Blueprints of Intelligence. Maya Indira Ganesh and Nils Gilman have written about the visual representation of AI.

It is impractical for a human to determine the various weights by hand. Instead, the weights are set randomly at first, and then slowly "trained" by slightly modifying them whenever they produce the wrong output for an input. This forces the machine to detect patterns in the input which are predictive of an outcome. This was a suggestive way of approaching computer vision and was how Rosenblatt originally approached it.

But single-layer networks can only pick up on superficial patterns, not patterns within patterns — the kind of increasingly abstract relations expected of the hierarchy of visual processing. Moreover, training multilayer networks proved challenging. If it is a single layer, then the error signal from the output direct corresponds to the single set of weights. But if there are multiple layers, the error signal only corresponds to the last set of weights — not earlier sets. For many computer vision researchers, this meant the approach was a dead-end and most simply ignored it — or mocked it to limit funding and research on the topic.Marvin Minsky wrote his dissertation and built the first ANNs in 1950. But, in 1969, he and Seymour Papert wrote Perceptrons, effectively burying ANN research for decades. Some of their criticisms were good science but, in no small part, they were also just trying to muscle out AI funding for approaches besides their own.

Outside of neural networks, statistical techniques for computer vision were rare until the 1980s. An important exception was texture detection. Texture-detectors typically worked through statistical analyses, as in Bela JuleszJulesz also invented the stereogram to evaluate stereoscopic vision. These became famous toys later in The Magic Eye series of autostereograms. (1962) and Robert Haralick (1979), finding areas where there were differences in the distribution of pixels, changes in orientation, and so on. Aside from this, though, most approaches stuck with model-based methods, focusing on building up objects out of primitives — edges, corners, colors — and then comparing the result with a pre-existing template.

Webcam access denied or unavailable. Click to Try Again

The Sobel Operator (1968) measures directional changes in intensity or color in an image, also called a gradient. It it particularly useful to detect edges.

1960s – 1970s Block World

In a surprising experiment from 1959, David Hubel and Torsten Wiesel discovered in the cat's visual system specific neurons which fired when exposed visually to lines of a specific orientation. They called the oriented edge detectors “simple,” and argued these fed into more “complex” neurons which combined them together into more complex shapes, such as corners.

This discovery provided a model for the lowest level of the vision system's hierarchy: detecting edges, lines, blobs, and either shape-identifying information. This early level of visual processing is often referred to using David Marr's term, the “primal sketch.” Developing this in computer vision involved creating small, often 3×3 pixel, filters which scanned across the image — a process called convolution. These filters would look for specific patterns in the pixels, such as those indicating an edge.

The first such filter was the Roberts Cross in 1966, a simple design which would detect the gray-scale level of a the nine pixels covered in the filter and look for differences between them. But while the Roberts Cross did well at detecting edges in very well-marked images with no background, it struggled as images became noisier. This is because most images do not display clear objects with well-defined boundaries marked by strong contrasts between neighboring pixels.

An example illustrating Adolfo Guzmán Arenas' “Decomposition of a visual scene into three-dimensional bodies” (1968)

At a slightly higher level, Adolfo Guzman-Arenas (1968) worked with line drawings of three-dimensional block, effectively assuming some other system would handle edge-detection.
Horn's Shape from Shading system
His program inferred from the intersection of lines the orientation of each surface and, with it, the overall shape of the object.

Other approaches looked for regularities besides edges in images. In 1970, Berthold Horn showed that shape could be inferred from the shading, since light affects a surface of an object in a consistent way. Other approaches involved using multiple images to infer what is similar between them, such as for stereo vision (Dev 1974) and motion (Horn and Schunck 1981).

Most of these projects focused on a "block world" made up of line drawings of familiar geometric shapes like cubes or pyramids, with attempts at discerning geometric shapes and their alignments. Part of this rested on the limited resolution and general noisiness of digital images at the time. But another part rested on an overall assumption solving block world vision would apply to the real world.
An outdoor photo processed with the combined corner and edge detector.
This was a bad assumption, and new and better edge detectors — such as Sobel (1968) and Canny (1986) — as well as methods for detecting corners, such as Harris (1988) — were needed for natural image processing. Although progress was made, it wasn't clear how to get from the primitive sketch to object recognition proper outside the block world.

1970s – 1990s Theory-driven Approaches

"Attempts to construct computer models for the interpretation of arbitrary scenes have resulted in such poor performance, limited range of abilities, and inflexibility that, were it not for the human existence proof, we might have been tempted long ago to conclude that high-performance, general-purpose vision is impossible." H.G. Barrow and J.M. TenenbaumWhile many labored on different ways to detect two-dimensional shape, others began to ponder how this information contributed to the real task—recovering the actual, three-dimensional shape and properties of an object. There needed to be some logic governing how the various lines, edges, colors and textures could be “fleshed out” into 3D models of objects, but there wasn't a lot of clarity on how this worked. The general project pushed towards what David Marr called the “2½-dimensional image,” A 2½-dimensional image after Marr. Arrows represent surface orientation. which would provide a three-dimensional reconstruction of the visual scene out of the primal sketch.

This puzzle wasn't confined to computer vision researchers, either; everybody studying vision struggled to explain the steps following Hubel and Wiesel's simple and complex cells. Researchers often appealled to theory to infer what must be happening in the visual system between simple cells and full-blooded object recognition.

Various creatures represented using generalized cylinders

An example of the theory-driven approaches was T.O. Binford's generalized cylinder in 1970. This was one of many "geometric approaches," trying to infer the view-independent shape of the object which was assumed to be cylinders stacked and ordered to represent the girth, orientation, and position of the parts of an object. This approach received its clearest application in Rodney Brooks' ACRONYM (1981), a program picked up by the CIA to detect aircraft from satellite images.

Other ideas focused on "global appearance approaches," which assumed the object should be treated as the sum total of its appearances from different angles. An early version of this — still principally focused on block worlds — was aspect graphs, originally proposed by Stephen Underwood and Clarence Coates (1975) and Jan Koenderik and Andrea van Doorn (1976). This approach labelled the aspects of the object in an image and tried to find correspondence with the same aspects in a different image, allowing it to infer the invariant shape and also map the transition between aspects.

While many other geometric and global appearance approaches were proposed, all ran into the same problem: they didn't scale. Their struggles coincided with a funding crunch as the early promises of seeing and thinking machines gave way to a broad pessimism that any of these approaches were viable. This didn't mean computer vision died off, but it did find it easier to work in the field of cognitive science — effectively, treating neuroscience and psychology as matters of computation.

An attractive approach was proposed by H.G. Barrow and J.M. Tenenbaum (1978), who argued the visual system must recover different “intrinsic images” of a scene “by a noncognitive and nonpurposive process.”
A set of intrinsic images derived from a single monochrome intensity image
Their point is that humans do not perceive colors and edges and then infer the sizes and shapes of objects, but intuitively grasp objects as having invariants — stable sizes, shapes and colors. These invariants remain constant despite incidental changes to their appearance, such as those caused by the perceiver moving, lighting changing, the object being turned, and so on. The assumption is that we could create many of these different intrinsic image detectors separately and that some higher-level can "bind" them into an object.

Another, but related, approach came from David Marr's pioneering 1982 work, Vision (from which the terms primal sketch and 2½D image are taken). This presented a clear hierarchy of different stages, brining together multiple approaches — edge-detection, stereo-vision, generalized cylinders, and so on — into a single project of reconstructing the visual world. These works were heavy on theory and light on accompanying technical demos, but they still provided a hopeful paradigm for what needed to be done in cognitive science — and, as a result, what would presumably work for computer vision.

1960s – 1980s Recognizing an Object

Recovering the three-dimensional shape from the two-dimensional image was an important step, but it still fell short of recognizing the object. Recognizing, as the word suggests, is a matter of cognizing again — with the assumption that you know what you're looking for. This meant the machines doing object recognition needed to have some template for what it was detecting. This could be in many forms, such as a description — like “triangles have three connected vertices” — or a memorized shape which could be overlaid on the object — like an outline of a car.

Philosophers instantly see the historical parralel to 17th century debates: is our ability to recognize objects based on having a visual image we can compare in the head? Or are we detecting marks, like corners and curves, and then forming lots of logical judgments and procedures for identifying an object: “this object has three enclosed sides, three vertices, internal angles equal 180.” The former suggests an infinite number of possible visual images are needed for every object to capture every possible angle, variation, slight change, and so on. The latter, however, struggles for anything beyond very simple shapes which can be exhaustively described geometrically. After all, what would be a broadly general description that all chairs have in common?

An early application of computer vision was in industrial settings, for example when a robot arm needs to detect and interact with a specific part in a factory. Since mass-produced parts are standardized, and the location of the parts relatively fixed, the machines could basically just apply an outline of what it was looking for on the various edges and corners in the image, turning it as necessary to properly fit. But these systems weren't building up a visual model of the world; they were hacks, simple tricks applied to solve a problem without theoretical justification. If the situation is narrow enough — as on a production line — hacks are a passable solution. Many robots during these early days, such as Charles Rosen's Shakey"We worked for a month trying to find a good name for it, ranging from Greek names to whatnot, and then one of us said, 'Hey, it shakes like hell and moves around, let’s just call it Shakey.'" - Charles Rosen in the early 70s, had no choice but to rely on hacks given the computational demands of real-time processing.

Deformable template of a face

An elegant theoretical solution was proposed by Martin Fischler and Robert Elschlager (1972), called deformable templates or pictorial structures. The idea itself goes back to D'arcy Wentworth Thompson who recognized minor variations between individuals or species could often be understood as slight deformations of a general template.

Fischler and Elschlager proposed a simple mathematical model for this concerning faces. They proposed treating the different parts of the face — eyes, nose, mouth, hair, etc. — as parts of a template, with the template originally set to a statistical average of faces. Then the deformations could be taken as springs between these parts that tend to move together: if one eye is closer to the nose, so is the other. Different faces could then be recognized by the deformations: how far apart the eyes are (i.e., how taut or loose the spring is), the distances between eyes and ears, and so on. Thus a simple model could be deformed in innumerable ways to capture the diversity of faces.

While a clever approach it fundamentally depended on the system already recognizing the eyes, nose, ears, hair, and so on, and also plotting them at relative locations. And detecting these features depended on the work being pursued by geometric and global appearance approaches. In other words, unless they resorted to hacks, the system needed to recognize a lot before recognizing a face as a face, much less as a particular person's face, even got started. As such, this was another theory which held immense potential, but also required a number of other problems to be solved before it could get in place.

This was the general situation for templates for much of its history: insights and potential, but a general recognition of how challenging it would be to come up with templates for every possible view or position. Things like deformable templates suggested more effective versions, but there were questions about the entire approach.

The soda can collecting robot “Herbert” was built at MIT's AI Lab in the late 80s

1980s – 1990s Return of Neural Nets and the Critique of Pure Vision

A resurgence of interest in neural networks — clunkily called "parrallel distributed processors" — began in the 1980s. This effort was led by Geoffrey Hinton, David Rumelhart, and James McClelland, who brought together a number of cognitive scientists and artificial intelligence specialists in a massive, two-volume manifesto for the approach.

Two things were needed for this resurgence to be possible: the (re-)discovery of backpropogration and the building of bigger data sets. Backpropogation, or more properly the backwards-propogation of errors, is a trick discovered a few different times in the history of neural networks but one that only stuck in the 1980s. Using the derivative chain-rule, it permitted different error signals to be assigned to each layer in a neural network.While backpropagation involves some fancy math, 3Blue1Brown's Grant Sanderson provides an excellent and intuitive explanation for what is happening under the hood of these machines. This allowed for training "deep" networks, which permitted learning non-linear features and high-level abstractions.

Equally important, thought, was the rise of bigger datasets. Machine learning works by combing through numerous examples and discovering relevant features that are shared by an object. With small datasets, machines often pick out relevant features that do not generalize well to objects outside the training set.Missing Datasets by artist Mimi Onuoha
Datasets encode human biases inherent in data and its collection. The artist Mimi Ọnụọha collects Missing Datasets, writing “That which we ignore reveals more than what we give our attention to. It’s in these things that we find cultural and colloquial hints of what is deemed important.”
With larger datasets, the machine is forced to discover more general features that correspond to a broader set of objects. This makes the network more robust to nuance variations and samples outside the training set.

A central criticism of neural nets remained: they are only detecting statistical patterns — effectively picking out keypoints — and not actually reconstructing the visual world. The new proponents of neural nets turned the criticism into a virtue: vision is not a matter of reconstructing the visual world by creating of model of the object or scene in the head. Instead, vision is about action and thus fundamentally attuned to predictive patterns in the environment. In other words, picking up statistical patterns in the environment is how biological vision works, and so it should be the model for human vision as well. Even better, from this perspective, is that neural nets would discover the most predictive patterns by learning them — effectively forming their own template for each item based on what is most effective for recognizing it.

This approach formed the backbone of Rodney Brooks's Herbert, an R2-D2 looking machine made up of numerous sensors and manipulatars connected to neural nets. The sensors processed separately — each doing their own task — and the overall responses were coordinated. Thus no reconstruction of the environment was needed; just lots of different sensors each detecting their own "affordances" (using the ecological psychology term introduced by JJ Gibson) that directly move the body before it runs into something and automatically extends its arm to grasp something. The bot went around Brooks's lab picking up soda cans and taking them to a recycling bin — a precursor to Brooks's later robotic vacuum, the Roomba.

Critiques argued perpcetion-as-action was just a hack for robots; there was no path from Herbert to human vision. And the growth of datasets and sufficient computing power was gradual, so critics could dismiss each new neural net as just another neat trick rather than a long-term path. Still, the successes were enough to encourage researchers from the older tradition — now dubbed "Good Old-Fashioned Artificial Intelligence" or GOFAI by John Haugeland — to embrace machine learning at the margins of their research.

Webcam access denied or unavailable. Click to Try Again

The histogram of oriented gradients is a technique to represent elementary characteristics of objects within an image through edge directions. The image is divided into square cells. For each cell, the directions of edges are compiled, and their relative distribution captured as a histogram. This video is a detailed walkthrough of the process.

1990s – 2000s Finding Features

In the 1980s and 90s, the early theory-driven approaches — based around overall 3d models first and then decomposing them into features — gave way to discovering local features first and then composing them into models. A number of approaches were taken showing the value of focusing at local-level data, such as differences between pixels, to detect features. This was possible in part because digital images were so much higher quality that investigating pixel-level data became possible.

This approach found a connection with theory-driven views in the search for keypoints by effectively finding salient points which would always appear in a regular way no matter which way the object faces. For example, Cordelia Schmid and Roger Mohr (1996) found that if you mark all the Harris corners on a tea pot and record their relative position, these will remain broadly invariant when the teapot is turned a little bit. In effect, if you know what you're looking for and you discover a few key points on an object are, you can infer the geometry — the shape, orientation, and so on — based on what those few points tell you. This provided a kind of fusion of appearance and geometric approaches, where discerning the former allowed you to guess the latter.

This led to an interest in developing systems performing two tasks: a lower-level "descriptor" that tried to render the appearances of the scene into various edges, and then a higher-level "detector" that could find the keypoints — the salient mix of edges — in the appearnces to infer the geometry of the object. An early example of a descriptor by Robert McConnell is the histogram of oriented gradients, or HoG (the webcam app above). Introduced in 1982 but not implemented until the mid-1990s, it established a grid of gradients which extract the most salient changes in intensity at different points in an image. Out of these some keypoints would be specified for the object to be detected, and these would be detectable even if slightly changed.

A similar trick in the late-1990s, by David Lowe, was the scale-invariant feature transform (SIFT),
Model images for 3D objects (with outlines found by background segmentation) are used to recognize 3D objects with occlusion.
which was capable of detecting keypoints like corners at many different sizes and still relate them to one another. This allowed for the same objects to be detected both when turned and at different distances away. A host of other descriptors, each with their own perks and uses, arose during this period, each bringing out local-features.

This strategy worked well but required the machine already have some insight into what the keypoints it was looking for are. This is easier if it is merely a single object to be recognized, since it is possible for the engineer to just program this in as a template, but scaling was human-intensive.

Facial Detection in Open CV using the Viola-Jones Framework. Video by Adam Harvey. While the face on the left is unoccluded, the face on the right is using an anti-facial-recognition mask, such as those used by protestors against police using facial recognition. While ineffective against modern face detection, the work highlights the problems being faced by those confronting the increased use of computer vision by companies and governments.

2000s – 2010s Random Feature Generation and Selection

While descriptors and keypoints helped, the detection process still could only differentiate those few objects it was trained to look for.
A learnt person model based on deformable parts from Felzenszwalb et al (2010)
As datasets became larger, it became increasingly untenable to try and solve these problems by hand.

A different approach was to use HOG, SIFT, or some other low-level descriptor for detecting local-features and then using machine learning to figure out which local-features are predictive of which objects. An example of how low-level descriptors could discover these features was the Viola-Jones facial recognition system, developed by Paul Viola and Michael Jones (2001). This system was especially clever because Viola and Jones did not tell the system what features of a face were most salient; the system figured it out by training itself on images of faces. The algorithm overlays the image with a descriptor that seek out distinct patterns.

There was an overlooked precedent for this in the neural netowrk world: Kunihiko Fukushima's 1979 Neocognitron. The system mixed hand-crafted local-feature descriptors with a fully-connected, machine learning neural network — though an idiosyncratic one since it didn't use backpropogation. Computer vision researchers in the 2000s followed a similar approach, bringing together a hand-crafted descriptor and then allowing a supervised neural net to perform the work of detection.
The short film Robot Readable World (2012) catalogues the aesthetics of machine vision in the 2000s.

These approaches typically split the machine into three stages: something like a HOG or SIFT descriptor at the lowest-level, followed by a number of layers engaged in unsupervised learning, a process of clustering together features that are similar into different categories solely based on their superficial similarity. This unsupervised learning effectively discovered the various keypoints which recur in multiple images. Whenever one of these keypoints appeared they would be marked as present in that image, with the result that each image corresponded to a "bag of features" — a score of all the local features present in the image. The third stage could then simply be a supervised network that would associate that bag of features with a particular name. This allowed for large-scale object recognition where a system could detect thousands of different objects because each object consisted of a different set of features.

There were obvious problems to these approaches, not least that a list of features present doesn't tell you where in the image those features are present — the spatial locations are all lost, rendering scene recognition near impossible. Various tweaks were proposed for this, such as the "visual pyramids" of Svetlana Lazebnik, Codelia Schmid, and Jean Ponce (2006), which detected features both in the global image and in "quadrants" of the image, discovering both what is relevant as a whole and what matters in the different parts. This allowed multiple levels of feature detection, with the most minute scaling up into the most global. In effect, multi-level approaches like Marr's were finally being realized.

Webcam access denied or unavailable. Click to Try Again

This convolutional neural network identifies objects in images. It demonstrates the impressing capabilities of modern computer vision, as well as some of its limitations. For example, there may be objects that are common in your location that the system can't recognize: only 80 object categories are represented in the dataset.

1990s – Present Convolutional Neural Nets and the Computer Vision Revolution

One of the biggest innovations in neural nets came from Yann LeCun's LeNet convolutional neural network (CNN), first introduced in 1989 and achieving notoreity in 1998. Like the Neocognitron, the first layers were small convolutional feature-detectors which scanned over the image looking for things like oriented edges, corners, curves, and so on. Another layer directly after "pooled" the results of the convolution, passing along the best matches of edges and ignoring poorer matches.

But unlike the Neocognitron, LeNet learned from “end-to-end,” without any hand-crafting of low-level descriptors or which features should be pooled. This meant lower levels needed to learn their own descriptors, middle levels needed to cluster their own sets of features, and the highest levels needed to make the appropriate match between the features detected and the proper label — a seemingly insurmountable task. But LeNet showed that machines could pull this off, using backpropogration and a massive dataset, such as MNIST, a massive collection of hand-written digits.

But CNNs ran into a major problem: outside of hand-written digits, there wasn't a dataset big enough to show off its power. Thus people generally dismissed it as a neat trick rather than a real alternative to hand-crafted descriptors. The prevailing assumption was that hand-written characters could be dismissed as the neural net equivalent of "block world," an overly simple domain which could not be scaled up. Tested on natural images, the assumption went, the system would not be able to settle on common descriptors for any and all images — and thus could not group local-features to be bagged up to aid detection, much less figure out which bags of features correspond to which label.

The dataset CNNs needed came only in 2010. Fei-Fei Li spent years putting together a much larger dataset, ImageNetThe genealogy and politics of datasets like ImageNet have been widely publicized, e.g. in Lines of Sight by Alex Hanna, Emily Denton, and colleagues. They are joined by numerous artistic explorations, e.g. in Caroline Sinders' Feminist Dataset project, Everst Pipkin's On Lacework, and Anna Ridler's Myriad (Tulips)., consisting of 14 million natural images downloaded from the internet and hand-labelled by an assortment of human contract workers. The ImageNet Challenge began in 2010, allowing a yearly test of the best computer vision program at recognizing the objects in images. While the first two years went to mixed hand-crafted and machine learning systems, in 2012 Alex Krizhevsk's AlexNet end-to-end CNN handily defeated all comers.

This proved a revolution in computer vision, and many researchers simply dropped their old projects and embraced end-to-end systems. As the networks got larger, many more tweaks needed to be added to CNNs for them to learn increasingly abstract patterns. Some of these had biological inspiration, such as Kaiming He's residual networks (ResNets), which mimic the way some neural connections skip over layers. This ensures relevant low-level features aren't forgotten in downstream processing. Other approaches, such as Joseph Redmon's (YOLO) system demoed here, ensures spatial information is retained by breaking the object up into regions (akin to the visual pyramid approach), ensuring both an object and its location are recognized together. The results are often nothing short of jaw-dropping — both in how well they perform, but also in the counterintuitive errors and mistakes they are prone to.

CNNs also proved effective in domains besides vision and quickly turned “artificial intelligence” from an abstract, fantastical term to a concrete reality — one with deep social, economic, and political implications. The initial excitement over machines detecting faces and adding cat ears to them gave way to instances of algorithmic discrimination and fear over the capacity of governments to use facial recognition to identify dissidents. The field of AI ethics, which previously consisted of little beyond philosophical discussions of Blade Runner, gave way to very practical worries over the deployment and usage of this novel technology.

Spearheaded by outspoken practitioners, a growing number of researchers are working towards acknowledging, diagnosing, and mitigating biases in data, algorithms, and people. Among them is computer scientist Timnit Gebru,Gebru's and her colleague Margaret Mitchell's ousting from Google in 2020 highlights the issues of a tech industry under pressure to address the flaws of its highly profitable technology. whose team proposed “model cards” — nutrition labels of sorts to contextualize AI models and their shortcomings. Gebru also worked with Joy Buolamwini, who publicized that most commercially-available facial recognition systems performed worse on dark-skinned faces.
Coded Bias, a documentary by Shalini Kantayya, follows Buolamwini's work and the push for legislating facial recognition.

They are joined by a growing number of people — across academia, industry, law, journalism, and art — whose research, activism, and booksVery much non-exhaustive, this list includes Cathy O'Neil's Weapons of Math Destruction, Race After Technology by Ruha Benjamin, and Artificial Unintelligence by Meredith Broussard. address the numerous failures of algorithmic decision-making, not just in computer vision but also in commerce and law enforcement. Advocacy groups such as Black in AI and Queer in AI, and non-profits have sprung up to address many concerns, including the lack of diversity in AI research and industry.

The upshot is that computer vision just isn't a concern for people in a lab anymore. It is ubiquitous and becoming essential to our everyday lives, integrated into the photo-obsessed social media worlds of Instagram and Snapchat. And, despite its increasing familiarity, much of the move from R&D to the everyday happened entirely within the last decade — far faster than anyone could have imagined.

Conclusion Future of Computer Vision

Computer vision, as it becomes more powerful, will find increasingly many applications. Self-driving cars are perhaps the most exciting and visible, though also one that show the challenges still facing the field. The problem facing these cars is not just detecting cars but also reasoning about them.

Most people have intuitions that these happen at different levels: vision just passively interprets what is in the scene but some central processor determines what it means. But part of what neural networks reveal is that these are less separable than initially thought: what objects and events means is necessary for interpreting the visual data, and what is important and worth interpreting in the visual data depends on what the perceiver expects to matter. The old boundary between perception and cognition has given way to a more integrated process where the stages form an inseparable whole.

This makes sense from an evolutionary perspective: rather than specifying a bunch of discrete, fully-specified modules and their interactions in the genes, just set up the right neural architectures and let the system self-tune through practice. But it does mean the next step for computer vision is to integrate reasoning into perceiving, beginning to perceive causal relationships and predict interpersonal relationships between persons.

Although it is possible to create datasets highlighting these features, it seems unlikely this can be done through supervised training. Most researchers have explored "self-supervised" approaches. The impressive generative adversarial networks (GANs, originally developed by Ian Goodfellow in 2014, are an example, putting two neural networks — one generating images and one recognizing objects in an image — into competition, with the generator trying to trick the discriminator into recognizing computer-generated objects as genuine.Generative Adversarial Networks (GANs) have proven a popular technique for creating "deep fakes," as is testable from https://thispersondoesnotexist.com/

This site provides an uncannily familiar (though usually celebrity-level pretty) face despite being fradulent. More worryingly, it has become possible to create fake videos, such as this DeepFake video from Buzzfeed and Jordan Peele of Obama warning of the dangers of deepfakes: https://youtu.be/cQ54GDm1eL0

Other approaches, such as contrastive methods, have proven better at some basic reasoning tasks, though there are also worries about scaling. Many figures in computer vision have suspected older methods — such as the theory-driven models involving generalized cylinders — might combine usefully with neural networks to produce a more accurate understanding of the real-world. But these approaches are all still in their infancy and what will end up scaling and what will end up a mere line or two in a whiggish history of computer vision are currently unknown.