Sikuli Rethinks Programming

main image
Bookmark and Share

Scripting with screenshots

For as long as there have been computers, there has been coding. And with coding comes repetition—lots of it. That's always been the basic fact of a programmer's existence, even as computers have become ever more friendly from a user's perspective. That's where Sikuli comes in. The latest from CSAIL's User Interface Design Group, it's a programming tool that has the ability to see like a human being. Not only does it put the graphical user interface (or GUI) in the hands of programmers, but it may one day put programming in the hands of everyday computer users.

Sikuli stemmed from the research of Associate Professor Rob Miller, Ph.D. student Tsung-Hsiang (Sean) Chang, and University of Maryland post-doctoral researcher Tom Yeh. It's a software agent that, through the use of screenshots, allows one to automate just about any task—so long as there's a GUI involved. Using the Python programming language as its base, Sikuli allows the programming of tasks through a combination of screenshots and simple commands. 

In the Wixarica language of Mexico's Huichol Indians, Sikuli literally means "God's eye." It's a tribute to the tool's all-seeing approach to computer vision.
 
"It means using vision in a sort of God-like way, although frankly it's just trying to be similar to the way human beings look at their screens," explains Miller. The User Interface Design Group has always been creative in its nomenclature, with monikers for past programs like Chickenfoot, Froggy, and Potluck.
 
The secret to Sikuli's appeal is how intuitive it is, something that has rarely if ever been true of programming before. When scripting, Sikuli allows you to write what look like function calls, except with screenshots between the parentheses instead of code.
 
This type of interface allows for use by beginners and seasoned programmers alike. A simple example, outlined on the Sikuli website, involves inputting screenshots of the Mac Spotlight symbol and a few simple commands ("click," "type") in order to automate a Spotlight search for a specific phrase. It's something a complete novice could do without assistance. From there, applications can get as complex as the Sikuli user wants, depending on his or her creativity and depth of knowledge.
 
Sikuli was born out of Yeh's Ph.D. research at MIT, which looked at new ways to use computer vision in user interfaces. His work involved sending photographs to the World Wide Web in order to glean information about their subject matter from other users.
 
This research became the basis for Sikuli Search, the first iteration of Sikuli. Instead of using pictures from the real world, Sikuli Search used screenshots. For example, a computer user looking to learn the meaning of a particular icon in the Microsoft Word toolbar could take a screenshot of the button and send it to Sikuli Search, which would then visually scan the Web to find a verbal answer to the question.
 
Meanwhile, Chang and other members of the User Interface Design Group were working on new forms of automation in programming. They created a process that automated redundant and time-wasting tasks on the Web, such as searching for headshots on Google. "What our system allowed you to do was to write a script that would do all of those steps for you," says Miller.
 
When Chang began to work with Yeh on perfecting Sikuli Search, he looked for ways to incorporate his work on automation. The result was Sikuli Script, the software agent that may signal the next great leap in programming.
 
"One of the great things about this collaboration between Sean and Tom is that they each brought complementary skills to the table," Miller says. "There was Tom doing the back-end image processing stuff and having the deep knowledge there, and Sean putting a great user interface in front of it."
 
The group has made Sikuli available as a free download, creating a real-world test environment of the software's capabilities and applications. It has fostered a rich exchange between users and creators. There's even a blog tracking programs written using the software. They range from the fun and frivolous (playing virtual piano, automating tasks in the game Mafia Wars) to the just plain useful (cleaning unwanted files from a system).
 
All of this innovation on the part of users only further drives Miller, Chang and Yeh to strive for improvement. "We've gotten lots of feedback and scripts contributed by our users that have really impressed us," says Chang. "We feel we have to do more on Sikuli so we can make it better, and let users do what they really want on this automation platform."
 
The recently released Sikuli 0.10 has taken much of this feedback into account to create a faster, smarter piece of software. Among other upgrades, the new version boasts a new, more flexible Application Programming Interface and the ability to script actions that respond to visual changes on the screen.
 
At this point in its development, more involved Sikuli Script use requires some understanding of Python. It's something inexperienced users can currently bypass with the help of a visual recording feature. Miller hopes to eventually build new applications on top of Sikuli's Python-and-screenshot base that would create even greater ease of use.
 
A streamlined, novice-friendly Sikuli could one day put programming into the hands of the average computer user. It would mean a sort of democratization of computing, and would have far-reaching cultural implications. "You can look at it as an augmentation of human capability," Miller observes. "Which is pretty exciting, because we're not really getting much smarter biologically. I think we need to find ways to make ourselves smarter technologically."

April 22, 2010
Jenna Scherer, CSAIL