谷歌人工智能黑科技:用照片做小电影 Google’s Deep Learning Machine Learns to Synthesize Real World Images

Deepstereo

据美国科技博客Gizmodo报道,本周《麻省理工科技评论》(Technology Review)杂志发布的一篇论文披露了谷歌研发的新系统DeepStereo,该系统可以通过人工智能技术将一系列照片无缝组合成为视频。

论文作者名为约翰-弗林(John Flynn),是一名谷歌工程师,其他三位合著者也都在谷歌工作。在论文中,弗林阐述了谷歌研发DeepStereo系统的全过程。

早在DeepStereo之前,就有类似利用静态图片输出动画的技术存在。美国计算机协会计算机图形专业组(SIGGRAPH)就曾通过网上图像制作过延时动画。

但与其他静态图像生成动画技术相比,DeepStereo系统最大的不同在于,它可以猜测出图像的缺失部分,在空白处创造出来源图片中没有的新图像。据英国媒体Register报道,和传统动画利用视觉暂停的原理不同,DeepStereo可以“想象出”两幅静止图像之间的画面。

弗林和他的合著者在论文中写道,“这项技术与之前的产品截然不同,我们尝试采用新型深度架构直接合成新图像,不需要预先设置景深、焦距等训练数据。”

该系统背后的网络架构原理十分复杂,借鉴了各种先例。但作者在文中介绍了该技术的独到之处:系统在工作时会采用两套独立的网络架构。其中之一会根据已有的2D数据预测各个像素的景深。另外一个则会对色彩作出预测。两者共同以2D图像的形式完成对景深和色彩的预测,最终合成视频。

DeepStereo仍有不足之处:视频角落的画面很不清晰。“算法没有涉及到的区域往往是模糊的,无法被覆盖,也无法使用像素填充,”开发团队解释说。不过,这套系统暗藏了一个通过模糊的图源生成物体的小技巧:“移动对象在训练数据中非常常见,我们的模型可以优雅地完成这个动作:开始出现的时候是模糊的,然后逐渐转换为运动模糊效果。”

虽然该系统生成的最终产品与通过图像简单合成的动画区别不大,但该技术能够为谷歌的街景技术锦上添花。同时也能为谷歌的人工智能技术提供一个更加实用的范例。

本月,谷歌的“梦想机器人”在互联网上走红,这是该公司超级先进的人工神经网络,由谷歌的工程师团队开发而成。设计初衷是要找到一种切实可行的方法,让计算机辩认出图像中的内容。谷歌工程师正在教这些不可理喻的人工“大脑”辨识动物或架构,顺便也做做“梦”,此举令人感到震惊和恐惧。

转自新浪科技

Google Street View offers panoramic views of more or less any city street in much of the developed world, as well as views along countless footpaths, inside shopping malls, and around museums and art galleries. It is an extraordinary feat of modern engineering that is changing the way we think about the world around us. But while Street View can show us what distant places look like, it does not show what the process of traveling or exploring would be like. It’s easy to come up with a fix: simply play a sequence of Street View Images one after the other to create a movie. But that doesn’t work as well as you might imagine. Running these images at 25 frames per second or thereabouts makes the scenery run ridiculously quickly. That may be acceptable when the scenery does not change, perhaps along freeways and motorways or through unchanging landscapes. But it is entirely unacceptable for busy street views or inside an art gallery.

So Google has come up with a solution: add additional frames between the ones recorded by the Street View cameras. But what should these frames look like? Today, John Flynn and buddies at Google reveal how they have used the company’s vast machine learning know-how to work out what these missing frames should look like, just be studying the frames on either side. The result is a computational movie machine that can turn more or less any sequence of images into smooth running film by interpolating the missing the frames. The challenge Flynn and co set themselves is straightforward. Given a set of images of a particular place, the goal is to synthesize a new image of the same area from a different angle of view. That’s not easy. “An exact solution would require full 3-D knowledge of all visible geometry in the unseen view which is in general not available due to occluders,” say Flynn and co. Indeed, it’s a problem that computer scientists have been scratching their heads over for decades and one that is closely related to the problem of estimating the 3-D shape of a scene given two or more images of it.

Computer scientists have developed various ways of solving this problem but all suffer from similar problems, particularly where information is lacking due to one object occluding another. This leads to “tearing,” where there is not enough information, and to the disappearance of fine detail. A particular challenge is objects that contain fine detail and are also self-occluding, such as trees.

Flynn and co’s new approach is to train a machine vision algorithm to work out what the new image should look like having been trained on a vast dataset of sequential images. The task for the computer is to treat each image as a set of pixels and to determine the depth and color of each pixel given the depth and color of the corresponding pixels in the images that will appear before and after it in the movie. They trained their algorithm, called DeepStereo, using “images of street scenes captured by a moving vehicle.” Indeed, they use 100,000 of these sequences as a training data set. They then tested it by removing one frame from a sequence of Street View images and asking it to reproduce it by looking only at the other images in the sequence. Finally, they compare the synthesized image with the one that was removed, giving them a kind of gold standard to contrast it with. The results are impressive. “Overall, our model produces plausible outputs that are difficult to immediately distinguish from the original imagery,” say Flynn and co. It successfully reproduces difficult subjects such as trees and grass. And when it does fail, such as with specular reflections, it does so gracefully rather than by “tearing.” In particular, it handles moving objects well. “They appear blurred in a manner that evokes motion blur,” they say.

The method isn’t perfect, however. “Noticeable artifacts in our results include a slight loss of resolution and the disappearance of thin foreground structures,” say the Google team. And partially occluded subjects tend to be overblurred in the output. It is also computationally intensive. Flynn and co say it takes 12 minutes on a multicore workstation to produce a single newly synthesized image. So these images cannot be produced on the fly. However, the team expects to improve this in future by optimizing the image generation process.

That’s impressive work that once again shows the potential of deep learning techniques. The team show off their results in the video posted here, which shows movies made from Street View data. But it should also have other applications in generating content for teleconferencing, virtual reality and cinematography. It could even reduce the work load for stop frame animators. Either way, expect to see Google Street View travel movies flood the Web in the not-too-distant future.

关于dixiaocui

无人驾驶汽车研究员 半吊子摇滚混子乐手 Researcher on Intelligent Vehicle Guitarist/Singer/Rocker
此条目发表在Autonomous Driving分类目录,贴了, 标签。将固定链接加入收藏夹。

发表评论

电子邮件地址不会被公开。

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据