“HeyGitHub!” — Metamodel Subgraphs and the Evolution of GitHub’s Conversational Copilot

Machine-Learning Code Generation Will Not Replace Developers; It Will Free Us To Work At the Proverbial “Next Level”

Jim Salmons
GitHub Copilot for Disabled Developers

--

Preface Note to GitHub CEO Thomas Dohmke, VP of GitHub Next Oege de Moor, and the GitHub Next Team: As a 71-year-old post-cancer #DigitalHumanities #CitizenScientist and now #DisabledDeveloper thanks to a recent spinal cord injury, I don’t have the luxury to be subtle and patient. So, I will state plainly here that my goal is to be hired as a contract researcher on the Copilot team to work with Rahul Pandita and other Copilot developers; first, to implement the study/support program described by this “GitHub Copilot for Disabled Developers” publication, and second, to pursue the research agenda described in this article.

Preface Note to General Reader: This rather long and somewhat “deep weeds” article is purely optional reading if your interest is focused on the proposed “GitHub Copilot for Disabled Developers” research study and support program. In this article I ruminate on the potential intersection between Copilot as an exemplar of code-generating assistive technology and my Digital Humanities research that predates my spinal cord injury.

November 9th Update: See the note at the end of this article for information about the announcement at the #GitHubUniverse conference about the “Hey, GitHub!” Technology Preview based on Conversational Copilot. 🤓👍

Ten years ago I survived a horrific cancer battle during which too many of my treatment buddies died while I somehow did not. To assuage my perhaps inappropriate guilt, I embarked on a Life trajectory-changing Pay-It-Forward journey to reinvent myself as a Digital Humanities independent Citizen Scientist. I and my fellow cancer-surviving wife, Timlynn Babitsky, set our sights on funding the digitization of the 48-issue run of Softalk magazine into the massive computer magazine collection at the Internet Archive. From 1980–84, Softalk uniquely chronicled the early days and diverse story of the dawn of the microcomputer era and Digital World we live in today.

Not content to simply see this publication scanned only then to gather virtual dust on the digital shelves of the Archive, Timlynn and I were determined to pursue a mission to encourage broad public and scholarly interest in this fascinating magazine’s content. To this end, I began pursuit of my vision for creating a Ground-Truth Storage format, MAGAZINEgts, to provide an integrated document structure and content depiction model for digitized serial publications. This XML-based digital storage format is based on a “self-descriptive” metamodel subgraph design pattern.

Our poster accepted for the DATeCH 2017 (Digital Access to Textual Cultural Heritage) conference was the first of our peer-reviewed publications about the #MAGAZINEgts format. This event marked our official “coming out” as independent Digital Humanities Citizen Scientists. 🤓🥳

This Ground-Truth format not only supports integration of a magazine’s complex document structures and content depiction models, it also provides a means to encode the archival data and metadata discovery and curation workflows that enrich and enable text- and data-mining access to a MAGAZINEgts-compliant digital collection. My purpose here is not to delve deeper into the workflow-encoding features of the MAGAZINEgts format. Rather, I provide this information as background context for the next sections of this article which explore the exciting potential and implications of GitHub’s Conversational Copilot research agenda.

An Early Peek at Conversational Copilot

As readers of this Medium publication or my Twitter feed (@Jim_Salmons) likely know, I have been writing about the need for, and my thoughts on the implementation of, a “GitHub Copilot for #DisabledDevelopers” research study and support program. In response to this thread of communication I was contacted by Rahul Pandita, an #AI/#ML engineer on the Copilot development team at the decentralized and distributed global GitHub Next research lab. The purpose of his contact was to explore potential areas of mutual interest and collaboration related to the evolution of the GitHub Copilot code-generating assistive technology.

A visit to the Github Next homepage will be your entrance down the “rabbit-hole” of your exploration of the vision and experimental projects that embody the vision that GitHub has for their contribution to the evolution of software design and development concepts and methods.

My first chat with fellow Coloradan Rahul was a Zoom interaction where we had a chance to get to know each other a bit. In preparation of our meeting I did a deep dive — as I encourage you to do — of the GitHub Next website. On the Next team page, each research engineer’s headshot and role description leads to a personal profile subpage. A Current Projects button on Rahul’s subpage brings up a page that includes a very brief mention of his involvement in development of Conversational Copilot.

In the course of our Zoom chat I had an exciting opportunity. Under assurance between us that I was not to disclose the particulars of what I saw, Rahul gave me an interactive screen-sharing demo of Conversational Copilot.

First, let me state that my chat with Rahul did not dig into the underlying design and implementation of what he showed me regarding the current features and design of Conversational Copilot. Rahul was much more interested in gauging my interest and reaction to the potential of this project from my perspective as a dexterity- and mobility-challenged #DisabledDeveloper. While our conversation certainly provided additional nuance to my reaction, I am sure my broad grin and unblinking wide-open eyes told him as much… I had a glimpse of what will soon be the future of developers’ routine human-computer interaction within the emerging and redefined software design and development industry.

It is not so much that Copilot on its own will provide the tipping point that revolutionizes the daily practice of work as a software designer/developer. Rather, Copilot is among the tsunami wave of new development technologies and design practices that have been building for the last few years due largely to the impact of cheap powerful hardware and widely available Machine Learning models together with vast troves of publicly available, curated model training datasets. These recent advances are leveraged onto the myriad of tools and best practices that have incrementally moved the state of the art of the software industry ever forward from the early days of my first involvement in the industry in the early 1970s.

As one example let me reflect that in my early days in the industry we would never have envisioned the rich, expressive, and widely available design and idea-communication medium of Jupyter Notebooks. The fluid intermingling of natural language and graphics together with cells of beautifully formatted and “live” source code was nowhere on our imagined horizon of the environments in which we could design and communicate our development ideas and implementation. Coding was largely about writing applications as tools for doing specific tasks. You went to your local bookseller to purchase, then read, books about the latest programming languages and software design and development methods. The Internet and its sites like StackOverflow were nowhere in the realm of our experience.

The number and diversity of software development tools and design methods is far beyond the scope of this article. That is why I chose to focus my comments on the impact of Jupyter Notebooks as an exemplar of the advances that have been changing the nature of our activities as software designers and developers. Jupyter Notebooks provide an innovative multimodal extension to our ability to use document-style natural language and graphics as part of the creative design and development process of creating software. Based on Rahul’s demo of Conversational Copilot, I could clearly see how this project brings that expressive, creative potential to the voice-enabled interactive experience of the developer designing and developing software.

Rahul fired up the Conversational Copilot VSCode-based development environment and carried on a very high-level “conversation” with the Copilot code-generation engine. His interaction was a verbal exchange very much like you would expect to hear between two members of a project development team. Rather than an explicit dictation-like interaction of detailed directives of the code that Rahul was wanting to produce, his interaction was roughly like this:

“Copilot start a new file and import NumPy, Pandas, and Pillow. Then create a Dataframe of items to hold the parameters of the width and height of a sample of ten rectangles. Add a function to compute the area of each rectangle and add that field to the Dataframe. Next add a function to iterate through the Dataframe and draw a Pillow Image of each rectangle and label each rectangle with its area centered inside the rectangle. Add a main function to bring this all together and run the program.”

This is a trivial example, but you can see where Rahul and the Conversational Copilot researchers are heading with this project. Imagine once this system and its Copilot model are fully developed and trained to understand the high-level idioms and design patterns for a wide range of software design and development methods.

We are in the midst of a paradigm shift in software development from “Code is King” to “Data is King.” In the code-dominant world, developers spent much of their time crafting applications and transactional pipelines to get work done. Today and even more in the future, our activities will move to data modeling, data preparation, machine learning model configuration and training, and workflow execution. This trend toward more data-centric Machine Learning model selection, configuration, and training is well-suited to the interaction style implemented by Conversational Copilot.

Just as we have seen data scientists embrace Jupyter Notebooks as a creative medium for their work, so too will they and the wider community of software developers welcome the evolution of Conversational Copilot as a means to get this data-centric New Work done as the need for application-crafting tasks diminish.

How a Metamodel Subgraph Could Take Conversational Copilot to the Next Level

My first look at Conversational Copilot was an eye-opener for sure. But my initial reactions were at the relatively superficial level of seeing this project’s capabilities in the context of its current implementation. The behaviors that Rahul demonstrated indeed did move Copilot up a significant level of expressive productivity in terms of the developer’s ability to generate new code by voice. But it took me a few days of rumination to have a Eureka Moment that truly made me excited about the Conversational Copilot project.

Certainly my initial interest in collaboration with Rahul and the GitHub Next Copilot team was focused on my passion to see the “GitHub Copilot for Disabled Developers” research study and support program implemented. And I strongly maintain that interest. And I see the value of Conversational Copilot as a potent addition to the assistive technologies that will enable the even more significantly disabled developers to achieve new or regain lost abilities to create code for work or avocational commitments.

My Eureka Moment, however, came when I realized the potential intersection between Conversational Copilot and my Digital Humanities-inspired research on the development of the #MAGAZINEgts Ground-Truth Storage format. This insight allowed me to see an additional and more broadly applicable level of interest for my collaboration with the GitHub Next team consistent with the lab’s mission to explore the frontiers of tools and methods for software design and development.

A Brief Dive into MAGAZINEgts’ Metamodel Subgraph Design Pattern

In the introductory section of this article I described the MAGAZINEgts format as a “self-descriptive” data model that uses a metamodel subgraph as a means to encapsulate meta-level information about the document structures, content depiction model, and access/manipulation workflows of a serial publication’s digitized collection such as are found at the Internet Archive.

To date my activities have been largely focused on the document structure and content depiction branches of the proposed subgraph. This focus has been due to my need to create a reference example of the proposed format based on the Softalk magazine collection at the Archive. The first order of business has been to flesh out the data-specific aspects of the MAGAZINEgts format. The encoding of data and metadata workflows within the subgraph is a more abstract dimension to this proposed model and not the focus of interest of my Digital Humanities and Cultural Preservation research community colleagues. To produce this reference implementation I have simply one-off crafted the specific Python applications that allowed me to develop and publish sufficient portions of MAGAZINEgts for peer-reviewed documentation of this format within the Digital Humanities digitization research community.

For example, here is a screenshot collage of the portion of the metamodel subgraph that describes the PRESSoo Issuing Rules for the document structure of the Advertisement model of Softalk magazine:

For additional information about the MAGAZINEgts format see my article “PRESSoo: Unmasking the Insideous Document Structures of Magazines”.

Here is an animated GIF of the one-off Python application I developed to discover and curate the Advertising Specifications of the over 7,000 ads in Softalk:

This GIF captures the use of the FactMiners Toolkit using the MAGAZINEgts format to discover and curate the bounding boxes for advertisements in Softalk magazine. This animation is focused on the toolkit’s use to generate page images and object masks of the ads for use as Ground-Truth examples to train Machine Learning models to recognize magazine ads.

The full vision for the MAGAZINEgts format is to support encoding of data and metadata discovery and curation workflows. With this capability, application-like tools — like the “Ad Ferret” of the FactMiners Toolkit in the above GIF animation — will be able to be dynamically generated by MAGAZINEgts-compliant frameworks. The current implementation does not include full workflow encoding. It does, however, generate the user interface widgets, and constrains the state change interactions among these widgets that describe the size, shape, and allowable positions for advertisements in the magazine.

My inspiration and confidence that encoding these workflows can be dynamically generated is based on my insight having been introduced to the CIDOC-CRM — that is, the international standard Conceptual Reference Model of the International Council of Museums (ICOM). The CIDOC-CRM is used by museums, archives, libraries, and their scholarly research communities to describe and maintain the artifact metadata within their collections. What is unusual, and perhaps somewhat unexpected, is that this ontology includes a branch of Temporal Entities to describe the metadata curation activities used to create and maintain these artifact collections.

Among the ontologists in the CIDOC-CRM community, the Focus of use is within the Persistent Items branch of the class/entity hierarchy. Temporal Entities are most often used to capture time-date and time-span metadata artifact descriptive fields.
This Class/Entity hierarchy diagram shows the clear division between the Persistent Item and Temporal Entity branches of the CIDOC-CRM Conceptual Reference Model.

When I was first exposed to the CIDOC-CRM, I was immediately reminded of my prior research and development as a customer engagement-based Executive Consultant in the Object Technology Practice at IBM in the 1990s. I was a thought leader and developer in a “skunkworks” team developing a set of Smalltalk-based frameworks supporting an innovative vision for executable business models (EBMs) inspired by David Gelernter’s remarkable 1993 book, “Mirror Worlds: or the Day Software Puts the Universe in a Shoebox…How It Will Happen and What It Will Mean.”

Our EBM frameworks were designed as a collection of objects to implement a metamodel-based “construction set” used to build role-based business process models that “auto-magically” generated and dynamically updated the “applications” needed to run these processes. Change the business model and the applications would instantly reconfigure themselves consistent with the new instance of the business model.

When I saw the class/entity hierarchy of the CIDOC-CRM, I could clearly see how this ontology’s classes could be mapped to an EBM-style metamodel framework. In this UML Class model, I have identified key classes in the CIDOC-CRM that fit into such an executable process model:

When I first started my post-cancer reinvention as a Digital Humanities Citizen Scientist in 2014, Machine-Learning technologies had not yet ”invaded” the digitization research community. My original vision was to create the MAGAZINEgts format to support web-based crowdsourced social games.

With this skeletal overview of the design and use of a self-descriptive metamodel subgraph design pattern, we can explore the first level of my Eureka Moment that envisions the complementary relationship between Conversational Copilot and my MAGAZINEgts research.

How a Metamodel Subgraph Can Help Conversational Copilot Become a Dialogue Rather Than a Monologue

I want to finish this article by considering how the developer’s user interaction with GitHub Copilot will evolve and how my metamodel subgraph ideas might contribute to that evolution.

At present the interaction model between developer and Copilot is essentially a series of “this or next” exchanges. The developer begins the exchange by entering a bit of code or a descriptive comment and Copilot responds with a best guess suggestion as to the code the developer may be intending to write in the current context. If that initial suggestion is not accepted, the developer may reject it and be shown an alternate code block implementation. With Conversational Copilot this interaction is made more productive based on a higher level of natural language interaction with the Copilot code generation engine.

As remarkable as Copilot has become already, this incredible technology is still in its infancy. I have no doubt that its capabilities will improve by leaps and bounds in the years ahead. The clearest indication that Copilot is making enormous progress will be when its interaction model becomes a much more incremental, exploratory “call and response” exchange. I believe the current Conversational Copilot project is a significant step in this direction to where the Copilot use case protocol becomes much more of a dialogue rather than its current “Wait… let me guess” suggestion model.

To make this qualitative leap forward in the development of code-generative assistive technology, we need to begin exposing Copilot to a richer “learning environment” rather than having it just look and learn basic code patterns from a massive trove of commented source code. In an early article about my vision for a metamodel subgraph enhanced Ground-Truth storage format, I used a metaphorical allusion asserting that we Digital Humanities researchers needed to add to our toolkit digitization tools that “think” more like Sherlock Holmes rather than savant-like Rainman.

That is, we need a way to share our contextual or prior knowledge about a domain of interest rather than rely on the most base level bits of information that can be gleaned by brute force inspection pattern recognition. For example, when we read a modern print era commercial magazine, we know that the first few pages will include document structures such as the table of contents, the masthead, a letter from the editor, recurring editorial columns, and of course, some advertisements. Moving further into the magazine we find the feature articles and more ads, then finish with article continuations, lesser important recurring editorial content and… more ads.

Beside the document structures we know that a magazine will have a thematic or topical editorial content focus. Together, this prior knowledge of magazine document structure and content depiction means that we will instantly know whether we’ve opened a book, newspaper, or magazine, etc. Knowing this, we bring different expectations and content consumption skills to our reading experience.

In order to bring this level of “worldly knowledge” to the learning experience of Copilot-like machine-learning models, we will need to create a number of human-curated reference model implementations of, for example, the MAGAZINEgts Ground-Truth storage format for a number of typical magazines such as those already in the digital collections of the Internet Archive. Once we have a training curriculum of such metamodel-enhanced learning materials, Copilot-type ML models will begin to appreciate the “Sherlockean” level of knowledge about an information source built on their Rainman-refined base understanding.

What Would a Future Conversational Copilot Dialogue Be Like?

I’ll conclude this article by imagining what a Conversational Copilot conversation might sound like as we envision the future of software design and development as aspired to by the researchers of the Github Next innovation lab.

This hypothetical conversation is a bit long and assumes an obviously high-level of mutual shared knowledge between Copilot and the Developer. While this hypothetical conversation reflects the verbal interaction between this Person-Machine Programming Pair, I do not explicitly suggest what the visual exchange will be within this multimodal conversation. Suffice to say, however, that I do not believe that line-by-line source code display and editing will be a central focus of this collaborative interaction.

Developer: YoCo [Ed: Assume Developer now has Copilot’s attention and does not need a prefacing attention grabbing command.]

Copilot: Yes, you have my attention until explicitly released.

Developer: Check out the Softalk Apple collection from the Internet Archive and save the full collection locally.

Copilot: Okay…done. I have saved 48 subdirectories, each with a single PDF document, a single scan-data XML file, and from 32 to 408 JPG image files.

Developer: Good. These subdirectories contain documents I will refer to as magazine issues.

Copilot: Okay.

Developer: Attach a metamodel subgraph to this collection using the MAGAZINEgts template.

Copilot: Done.

Developer: Using the metamodel’s Map Page Numbers task from the Workflow branch, build a mapping between the JPG image file IDs and the issue’s print page numbers.

Copilot: Done.

Developer: Using the metamodel’s Document Structures branch, scan the issues and build a dictionary of the found types.

Copilot: Done.

Developer: Did you find Advertiser Index structures?

Copilot: Yes, each issue has an Advertiser Index.

Developer: Good. Using the metamodel’s Workflows, do the Build Advertising Specifications task.

Copilot: Done.

Developer: Is there a Discover and Curate Advertisements task in the metamodel’s Workflows?

Copilot: No such Workflow task was found.

Developer: Okay. Create a new Toolkit Python program called Ad Ferret using the default import libraries specification in the Workflow Toolkit template.

Copilot: Done.

Developer: Create a view and edit interface based on the items in the Advertisement Specifications branch of the Document Structures branch of the metamodel.

Copilot: Okay, to iterate over these Advertisements, I will create a Group in the Widget Panel supporting selection of the Advertisements grouped by Advertiser using a Drop-down Listbox with Previous and Next buttons to iterate through each Advertiser’s Advertisements ordered by Issue date and page number.

Developer: Okay.

Copilot: According to the Advertisement Specification document structure, I can capture the Size, Shape, and Position of each Advertisement. What interface widgets would you like to use for these parameters?

Developer: Use Radio Buttons for the Advertisement Size and Shape, and a Drop-down Combobox for the Position, then show me the Toolkit window.

Copilot: Okay. [Copilot correctly infers the widget contents for the Issue selection Combobox and the Previous and Next buttons move between Advertisements mapped by the prior indexing of this document structure. The Group of Size, Shape, and Position widgets display their default state.]

Developer: Good. Constrain the Size, Shape, and Position widgets based on the Allowable Positions PRESSoo Issuing Rules in the metamodel Document Structures branch. And add a Group to the Widget Panel with ‘Save and Next’ and ‘Save and Previous’ buttons. Then show me this tool.

Copilot: Okay. [Copilot shows the Ad Ferret window with properly constrained Size, Shape, and Position widgets. The ‘Save and Next’ and ‘Save and Previous’ buttons do nothing.]

Developer: When pressed, the ‘Save and Next’ and ‘Save and Previous’ buttons save the current Advertisement Specification to the collection’s Document Structures Data branch and iterate to the respective next or previous Advertisement in the Advertisement Specifications index. When done, show me the updated Ad Ferret window.

Copilot: Done. [Copilot shows the Toolkit window. Developer confirms the properly constrained and persistent data discovery and curation behavior of the Ad Ferret window.]

Developer: Excellent. Now, above the ‘Save and Next’ and ‘Save and Previous’ button Group, add a Group called Products Mentioned containing a Drop-down Combobox to enter and maintain a list of items mentioned in the currently displayed Advertisement. Include Add and Delete buttons in this Group to allow editing of the Products Mentioned list. Persist the Products Mentioned list in the Content Depiction Data branch of the metamodel subgraph grouped by Advertiser. Then show me the updated Ad Ferret window.

Copilot: Okay. [And the updated Ad Ferret window is displayed.]

Developer: Excellent. YoCo, version this to GitHub and end our interactive session.

Copilot: Yes, done. Zzzzzz…

While the above Conversational Copilot interaction may seem like a whimsical pipe dream today, I believe that — given the explosive growth of Transformer-based Machine Learning multimodal models such as Copilot — that software design and development will be radically transformed to something similar to the conversation above.

What will be needed to get us to this Future Nirvana of Conversational Copilot?

The answer to this provocative question, I believe, will be two-pronged. Through the creative and diligent work of core GitHub Next Copilot developers including Rahul Pandita, the interactive conversations supported by Copilot will get increasingly high-level and productive in larger, more-multistep increments. But to achieve the proposed “Sherlockean” intelligence of our hypothetical conversation above, we will need to create a small and representative collection of Metamodel Subgraph Reference Ground-Truth Datasets on which to train the Copilot model.

Just as Current Copilot has gotten near-prescient levels of performance for the generation of source code based on exposure to a vast trove of commented Open Source code repositories, so too will Future Copilot learn from exposure and refinement by interacting with these Metamodel Subgraph Reference Datasets. Extrapolating from the explosive creativity of LLM transformer models, we can fairly assume that Copilot-type code-generation models will be able to self-generate and refine the metamodel subgraphs that prescribe the relationship between information structures and those structures’ content depiction.

While GitHub’s Open Source software repositories are sufficient for the training of Copilot’s first-order code-generation capabilities, they are not equally useful for sourcing the training materials needed to incorporate deeper contextual knowledge such as contained in Metamodel Subgraph Ground-Truth Reference Datasets. While the value of metamodel-aware Future Copilot will be of tremendous value to the world of Business, the proprietary nature of that domain will constrain the availability of material for creating the needed reference datasets for model training. Fortunately, the Digital Humanities domain within the Cultural Heritage preservation community is an abundant fertile source for such source data and the expertise to craft and validate these reference dataset metamodels.

I welcome comments and questions in-line here and via interaction on Twitter @Jim_Salmons.

Happy Healthy Vibes from Colorado,
Jim Salmons

Update November 9th — GitHub Announces #HeyGitHub Preview at #GitHubUniverse Based on Conversational Copilot!

It was exciting to see the announcement of public access via a Technology Preview of “Hey, GitHub!,” a dramatic advance in GitHub developer support based on the Conversational Copilot project. This happened during the kick-off Keynote presentation at #GitHubUniverse, the major annual 2-day GitHub conference in San Francisco and virtually on-line. Find out more and sign up for the waitlist here:

Jim Salmons is a seventy-one year old post-cancer Digital Humanities Citizen Scientist. His primary research is focused on the development of a Ground Truth Storage format providing an integrated complex document structure and content depiction model for the study of digitized collections of print era magazines and newspapers. A July 2020 fall at home resulted in a severe spinal cord injury that has dramatically compromised his manual dexterity and mobility.

Jim was fortunate to be provided access to the GitHub Copilot Technology Early Access Community during his initial efforts to get back to work on the Python-based tool development activities of his primary research interest. Upon experiencing the dramatic positive impact of GitHub Copilot on his own development productivity, he became passionately interested in designing a research and support program to investigate and document the use of this innovative programming assistive technology for use by disabled developers.

--

--

Jim Salmons
GitHub Copilot for Disabled Developers

I am a #CitizenScientist doing #DigitalHumanities & #MachineLearning research via FactMiners & The Softalk Apple Project. Medium is my #OpenAccess channel.