PRESSoo: Unmasking the Insidious Document Structures of Magazines

Published in

FactMiners’ Musings

17 min readMar 13, 2021

Timlynn Babitsky and I are working to develop the **MAGAZINEgts ground truth storage** format for serial publications along with a reference implementation of this format for the 48 issues Softalk magazine that chronicled the dawn of the microcomputer and digital revolution between 1980–84.

I am an independent, untrained Citizen Scientist on a mission to unmask and wrestle into submission the insidious nature of the document structures of print era commercial magazines. I started down this circuitous road as a Pay-It-Forward commitment to atone for my inexplicable survival of a catastrophic battle with cancer nine years ago. While I have rounded the bend on this road of my rebirth as Citizen Scientist, I remain on this journey despite the pandemic and my recent debilitating spinal cord injury from a fall as the result of one of our cats’ tails being in the wrong place at the wrong time, that place being under my foot and the time being before my morning coffee.

To explain the topic that I want to explore in this article, I will deconstruct my opening sentence…

I am an independent, untrained Citizen Scientist on a mission to unmask and wrestle into submission the insidious nature of the document structures of print era commercial magazines.

To do justice to my purpose in this article, I bring your attention to four key phrases in that sentence.

A Citizen Scientist Looks at Magazines

Let’s start with the first phrase and get it behind us so we can focus on the proverbial meat of my editorial intent. My wife Timlynn Babitsky and I distinguish between the terms Citizen Science and Citizen Scientist. The former refers to the movement to involve the general public as non principal participants in large scale scientific research experiments or studies. The designs of these experiments are often described by the term crowdsourcing. The latter term, Citizen Scientist, refers to an individual human role that describes the non-traditionally trained and independent researchers who aspire to join and contribute as peers in their domain-specific scientific community of interest. To more fully understand this distinction in terms, I encourage you to read my article, A Roadmap for #CitizenScientist Participation in the Time Machine Organization (#TMO).

With that first phrase behind us, let’s jump to the last — print era commercial magazines. My core editorial mission is to explain the nature of these pre-Internet communication monsters and how we can reveal their insidious nature via PRESSoo. PRESSoo — as I will explain in more detail later in this article — is the international standard ontology used by archivists and Digital Humanities researchers to precisely describe and document the complex designs of the seemingly benign reading material many of you have peacefully lying in wait on your coffee table or the back of the commode. They lay there waiting to draw you into their lairs of mischievous reading orders.

Soon after my miraculous cancer battle survival, my wife Timlynn had similar good fortune. Wanting to Pay-It-Forward for our opportunity to enter the Bonus Rounds of our life journeys, we initially chose a rather modest goal. We would complete my collection of the 48 issues of Softalk magazine and fund their digitization onto the virtual shelves of the Internet Archive in celebration of our having made it to our 25th wedding anniversary.

Why Softalk? Published between 1980 and 1984, Softalk magazine uniquely covered the breadth and depth of the early days of the microcomputer and digital revolutions that shape our lives today. Not a distant past of historic research fascination, this is quite literally our story, the era when Timlynn and I were young adults finding our places in a world that was rapidly changing before our very eyes.

During this time Timlynn was globetrotting, first as a Department of Defense teacher in Pacific Rim military bases and later teaching English as a second language to businesspeople in Japan. I was Stateside living a Forrest Gump-like life in the emerging microcomputer industry. My experience was intimately entwined with Softalk magazine, first as reader and advertiser having been a co-founder of one of the first Apple Computer software companies. Later, I was a writer and eventually a management executive for Softalk Publishing. In my management role, I designed and helped develop the microcomputer software that ran the back office advertising and production business processes of the magazine. That experience is the foundation on which I am building our research to document and illuminate the structure and content of print era commercial magazines.

These are some of the design documents that I and my buddy Dave “Bear” Fitzgerald created while developing the Apple software that ran the back office and production workflows of Softalk magazine. You’ll notice that the pages on the left are handwritten source code. Back in the late 1970s and early 80s, it was easier to write and view your large programs on paper laid end to end rather than try to do this work on computer screens with 40-character wide by 25 lines of code. Once you felt good about your code, you transcribed it into your computer to see if it worked. :-/

Unmasking the Insidious Beast of Magazine Document Structures

Most young folks today only have a glancing familiarity with print commercial magazines. The Born Digital among us missed the arcane pleasures of subscribing to and regularly renewing print magazine subscriptions. Weaned on the instant gratification of the 24/7 news and publishing cycles, it is hard for them to imagine a time when our daily sojourn to the mailbox was one of giddy anticipation for delivery of our favorite serial publications.

Not only have the Born Digital among us missed the subscription and delivery processes of print magazines, they are increasingly unfamiliar with the devilish reading experience of devouring the latest issue of our print era news and information sources. The young are much more familiar with interpreting and navigating the structure of a website than they are experienced in the cognitive challenge of devouring a print magazine. So let’s take a closer look at the typical structure of a print magazine before diving deeper into how this structure is used to maximize a reader’s exposure to the advertisements within. Ads, after all, were the means to subsidize print magazine content creation, production, and delivery.

The Three Part Basic Magazine Structure

The colloquial saying that “Good things come in threes” gives us a hint of what to expect in most communication exchanges; a beginning, middle, and end. We see these structures in storytelling and musical forms as well as in the structure of our written communication material. Sandwiched between its covers, the basic book contains front matter, content, and back matter.

Front Matter
Content (depending on type of document)
Back Matter

The Front Matter typically includes title page, table of contents, and preface. The content of most books is then sliced into a linear structure of chapters with headings if needed. Back Matter may include an index, footnotes, colophon, and epilogue, etc. Regardless of the exact number of these various piece parts — with the exception of footnotes referenced from inline citation in the main content — the basic book is cognitively consumed in a straightforward manner.

Print era commercial magazines are intentionally designed to thwart this natural cognitive process that wants to consume the rhythm of start, elaboration, and end of a communication message. Without the ability to interject a tolerable level of reading order chaos into a commercial print magazine, the business model of advertising subsidies would collapse and readers would be required to shoulder the full cost of creating, producing, and delivering their favorite reading materials.

Piece-Parts within the Basic Three-Part Magazine Structure

Front Matter
— Cover 1 & 2
— Table of Contents
— Masthead
— Publisher/Editor Welcome-Intro
— Letters to the Editor
— Advertisements
— Recurring columns
Feature Well
— Feature articles
— Advertisements
— continuations from Front Matter
— continuations from Back Matter
Back Matter
— Recurring columns
— continuations from Front Matter
— Advertisements
— Classified Ads
— Advertiser Index
— Cover 3–4

In addition there are a number of universally available sub-elements that can optionally enhance the appearance and multi-modal communication messages of the primary structural elements:

— Images, including photographs (with optional captions)
— Figures & Tables (with optional captions)
— Typographical embellishments (block quotes, drop caps, etc.)
— Graphical embellishments (column & content separators, etc.)

Just listing the many different types of editorial and advertising content elements that may be found in print era commercial magazines suggests the challenges facing Digital Humanities researchers working on digitization technologies to reveal the structure and content of these serial publications. But element type is just the tip of this text- and data-mining challenge. A magazine’s piece-part composition is subject to the editorial and art director’s creative expression that can be reflected in non-binding common practice or guidelines that interlink these structural elements. And once these structures are revealed, context must be inferred to give the most value to the text and data mined from these serial publications.

Magazine Design Common Practice

When digitizing a magazine, for example, we know that the first two pages are the front and inside cover, commonly known as Cover1 and Cover2. And the last two pages — regardless of how many intervening pages for a particular issue — are the inside back cover and back cover commonly known as Cover3 and Cover4. While the front cover is unique and critical to the design and editorial profile of the magazine, Covers2–4 are typically sold as full-page advertising space. A digitization pipeline using a whole-issue rather than within-page perspective would take this common practice into account.

While spotting the four cover pages of a magazine is quite obvious, there are many other design and editorial related common practices that can be taken into account when digitizing an issue of a magazine. For example, we can safely assume that we will find the table of contents, or TOC, within the first few pages of an issue of a magazine. Other document structures we can anticipate finding in the Front Matter of the issue include the masthead, a welcoming or introductory message or letter from the Editor or Publisher, letters to the editor, and recurring themed columns — in magazine design practice, columns meaning relatively short editorial articles and not the number of textual columns of the page layout.

As we delve further into the pages of the magazine — an area roughly in the middle third of the issue — we enter the Feature Well where we find a small number of longer and more graphically enhanced articles that are, as their name suggests, the editorial features of this issue of the magazine. Often the most important of these feature articles will be presented in a “two-page spread” where the left and right hand pages are visually linked through dramatic typography or images.

Moving further into the roughly back third of the magazine — the Back Matter section — we find another set of shorter recurring columns or individual non-feature articles. In most commercial magazines where advertisements are crucial to the magazine’s business model but not necessarily of interest to the reader, we may find an Index of Advertisers. In many magazines, the Index of Advertisers is there as a convenience for advertisers who want to check that their ads appear as intended.

In certain industry-driven reader communities — for example branded microcomputer publications such as our beloved Softalk magazine — the Advertiser Index may be found in the Front Matter. In this case, readers are as likely interested in finding and consuming the latest ads from their vendors of choice as they are interested in the TOC’s listing of locations of editorial content.

The Unruly Beast, Advertisements

And this brings us to the funding enabling beast of commercial magazine document structures — advertisements. Ads are the only content within a print era commercial magazine that is truly independent content. Advertisers have to live within the constraints of the publication’s acceptable content guidelines, but within these bounds anything goes. A publisher can accept or reject an ad, but they do not exercise fine-grained editorial control over an ad’s content.

Advertisements are interesting to data-mining researchers deconstructing magazines for two very good reasons. First, there is the content and medium itself. Ads are arguably the most longstanding form of meme-creative space in our human communication history. From the welcoming and often pictographic signs hanging above pub doors to the modern ubiquity of magazine and today’s website advertisements, cogent multi-modal messaging is fertile ground for saying as much as possible as quickly and efficiently as possible. For this reason advertisements are of great interest to historians, cognitive scientists, social psychologists.

For the data-mining Digital Humanities researcher, advertisements hold an additional interest. While the publisher and editors of a magazine cannot exercise control over ad content, they do have control over the size, shape, and within-page placement of the ads. And that control is the basis for why ads hold additional interest to data-mining researchers.

The size, shape, and placement of ads provide useful hints about the overall layout of a magazine page. Ads are like the filled-in numbers in a Sudoku puzzle. A quarter-page vertical ad will always appear on a two-column page layout grid and not a three-column layout. A half-page horizontal ad, however, lends itself to inclusion on grids of any number of columns for the other half of the page. And while Lay-out hints are a useful feature of interest to data-miners, an additional feature is based on the set-theoretic nature of the magazine’s content. That is, once we have spotted all the ads in an issue, we know the rest of the page space is editorial and administrative content.

The Hide-and-Seek Game of Article Continuations

The Born-Digital Generation has rarely faced the information consumption beast of “continued on page X”, the ubiquitous phrase sprinkled throughout print era commercial magazines. Today’s digital medium readers are more often served a reading experience that looks like a miniature version of the dystopian cityscape of a Bladerunner film scene replete with moving images and eye catching ads that transform from one to the next alongside the editorial content of primary interest. Print era commercial magazines, however, depend on the reader-powered dynamic of physical page turning to ensure maximum exposure of readers to the advertisements that subsidize creation, production, and delivery of the issue in their hands.

While there are no hard and fast rules for the placement of page continuations, they do tend to respect some basic guidelines. Columns and articles in the Front Matter of an issue will most often be continued in the Feature Well or Back Matter. Features are most often woven into a series of consecutive pages in the Feature Well with occasional leaps to the Back Matter. Back Matter content is likewise constrained to consecutive pages interspersed in the Back Matter with occasional dips forward into the Feature Well.

The intent of all these content continuations is advertisement exposure. Given our cognitive information seeking skills we go looking for page numbers in the upper margin of pages as we traverse these continuation breadcrumbs. This is why you will find that ads on the right hand page, especially smaller ads that are located in the outer margin of those pages cost more to reserve than the same size ad on the left hand page and inside column of the right hand page.

While page continuations have a functional purpose relative to the magazine’s business model, they present nontrivial challenges to Digital Humanities researchers seeking to data-mine historic magazine content. To date, the state of the art in layout recognition remains focused on within-page contexts. Projects such as Ben Lee’s Newspaper Navigator, Zejiang Shen’s Python-based Layout Parser, and the monumental work of the multi-pronged OCR-D project are representative of the progress being made in this within-page layout recognition context.

The Road Ahead: Raising the Bar to Whole-Issue Digitization

Once we take stock of the complex document structures of magazines, it becomes clear why raising the bar of layout recognition will benefit from moving our digitization pipelines from a within-page to whole-issue perspective. Clustered content types based on relative position within the issue, and common practices for ad placement and article continuations, etc. — all these telltale features of modern print era commercial magazines lend themselves to expansion of layout recognition to a whole-issue focus.

Using a whole-issue digitization pipeline for magazines, the scanning technician will indicate that the source document is a magazine of a specific number of pages. Once scanning begins, we’ll know that the first two scans are Cover1 and Cover2, and the last two are Cover3 and Cover4. We’ll be on the lookout for the Table of Contents, masthead, and typical Front Matter document structures as the initial pages of the magazine are scanned. When we spot the first layout that is typical for a major feature article, we’ll know we’re entering the Feature Well. Page continuation hints will be noted and metadata properties entered to enable content depiction modeling that will unravel the provenance and context of editorial content for data-mining purposes.

The more we run magazines through a whole-issue digitization pipeline, the more we will be building up massive datasets of the variations observable for these document structures. These increasingly rich structure-specific datasets will then be useful for training new and improved Machine Learning models used in whole-issue digitization pipelines.

PRESSoo Issuing Rules: Taming the Whole-Issue Digitization Beast

Fortunately, we already have internationally accepted standard ontologies for crafting metadata modeling standards for data-mineable magazine digital collections. To provide the integrated complex document structure and content depiction models needed to fully describe a magazine, the MAGAZINEgts ground truth storage format that Timlynn and I are developing is built on a foundation of the CIDOC-CRM. This Conceptual Reference Model is the international standard ontology used by museums, libraries, and archives to create interoperable metadata systems for documenting their physical and digital collections.

To further tailor MAGAZINEgts to magazine description, we refine the CRM with FRBRoo, the object oriented edition of the Functional Requirements for Bibliographic Records. FRBRoo is designed to facilitate metadata information exchange between libraries and museums. To drill even further into the unique character of magazines, we add PRESSoo into the ontological stack of MAGAZINEgts. PRESSoo is the international standard ontology for describing serial publications such as newspapers and magazines.

This CIDOC-CRM/FRBRoo/PRESSoo ontological stack takes care of the “magazine as whole issue” perspective of MAGAZINEgts. To drill down further we complete our ontological stack with PAGE, the Page Analysis and Ground-Truth Elements format framework from the brilliant minds of the PRImA Lab at the University of Salford. PAGE is the widely used schema for ground truth storage of detailed page layout at the within-page level.

This four-layer ontological stack is the basis of the MAGAZINEgts format. An essential feature of our ground truth storage format is use of PRESSoo’s Issuing Rule and Issuing Rule Change class entities to provide a fine-grained means to capture and refine the complex document structures used by magazine designers.

PRESSoo is the formal ontology developed by the IFLA, the International Federation of Library Associations, as a descendant of the CIDOC-CRM and FRBRoo for expressing the metadata related to the unique character of serial publications. Within this ontology the Z12 PRESSoo Issuing Rule class is defined as:

Z12 Issuing Rule. Subclass of: Design or Procedure. This class comprises plans that specify bits of the issuing policy followed at some point in time for instances of F18 Serial Work. The notion of issuing policy may include: regularity, frequency, sequencing pattern, the language of the linguistic objects contained in each issue, dimension of each issue, the font used to print each issue, the layout and editorial rules adhered to in each issue, etc.

PRESSoo Issuing Rules are most often used by librarians and archivists for administrative purposes in descriptive metadata relative to handling their physical and digital serial publication collections. In this context Issuing Rules are most useful at the coarse-grained level to specify the serial publication’s name, issue naming/numbering convention, language used in editorial content, etc.

The class Z5 Issuing Rule Change is used to describe the transformations that a serial publication may go through in the course of its publication lifecycle. Name changes, issue numbering convention changes — for example a publishing frequency change from monthly to quarterly that necessitates a title and issue number change — are frequent uses for managing serial publication collections.

For MAGAZINEgts, however, we are using the less often used aspect of the Issuing Rule class definition that includes the fine-grained publication design features of a magazine. The cover design, font choices, page grid layouts, and typographic embellishments are features that magazine editors and art directors use to give their publications a unique character that breeds reader familiarity and trust from issue to issue.

Using a metamodel subgraph design pattern, the MAGAZINEgts format supports a self-descriptive metadata format that will be most useful in developing whole-issue digitization pipelines. The Issuing Rules branch of the Metamodel section of a MAGAZINEgts format file is essential to deep data-mining of historic serial publications.

A Closer Look at PRESSoo and Softalk Advertisements

To explore how Issuing Rules are used in the MAGAZINEgts format, we’ll take a look at their use in documenting the Advertisements in Softalk magazine.

The MAGAZINEgts format is implemented as an XML file with four main branches; Metadata, Metamodel, DocumentStructure, and ContentDepiction. The first two sections are subgraph meta; administrative metadata and the metamodel branch that contains a model of the document-specific data in the third and fourth sections of this ground truth format file. The IssuingRules section is found in the DocumentStructure section of the of the Metamodel branch of this file. Within the IssuingRules section you find the AdvertisingModel section as shown in this screenshot:

As can be seen, the Advertising Model consists of a set of parameters that constrain an ad’s size, shape, and position within the page layout grid for a page. For a given ad size and shape there are only so many possible positions as shown in this screenshot where we’ve drilled down into the possible positions for a one-third of a page, vertical ad appearing on a two column page grid:

An ad shape in this context is defined to mean that a vertical ad is taller than wide, and conversely a horizontal ad is wider than tall. The interesting part of this AdPosition entry is the PageGrid section where a mini DSL, Domain Specific Language, is used to describe the allowable positions for a specific size and shape ad based on non-numeric page proportions. In this way, the Metamodel subgraph’s Issuing Rules guide the identification of a magazine’s advertisements regardless of the absolute dimensions of the scanned page image.

Given this metamodel for Softalk magazine’s Advertising Model, here is an instance for a one-third page, horizontal (wider than tall) ad that ran on a two-column grid, page 91, in the April 1984 issue of Softalk magazine:

The **MAGAZINEgts AdSpec for a 1/3-page horizontal ad** that ran in the April 1984 issue of Softalk. Note an interesting pre-digital/camera-ready-art aspect of this example ad. To affordably extend the run-series of an ad, advertisers would routinely run ads at more expensive larger sizes then photo-reduce them to smaller sizes for succeeding placements. A full-page ad would frequently be “shot down” by 50% to run as a quarter-page ad. In this example, Artemis Systems has reduced a half-page and a quarter-page ad to create a single one-third page horizontal ad. The “From Artemis Systems” headline of the 1/4 page ad was changed to “Also from Artemis” and the display type phrase, “All software unprotected”, was then pasted in to the ad combination to “tie the room together” to borrow **The Big Lebowski** turn of expression.

This AdSpec entry is one of 7,164 specifications found in the Advertisements branch of the DocumentStructure branch of the MAGAZINEgts file. Note, too, that the AdSpec for this ad includes cross referencing elements that link the ad to the ContentDepiction branch of the MAGAZINEgts file. The references identify the Organization playing the Advertiser role in the ad placement and a list is maintained of the Products mentioned in the ad.

With this fine-grained metamodeling of the Advertising Model and the specifications of all the ads appearing in Softalk magazine, we are able to provide ground truth specification for the bounding box dimensions of these ads. These document structure dimensions can then be used to generate the page image and corresponding image masks for Machine Learning model training. Such useful dataset information is then made available in the Metadata section of the MAGAZINEgts file as shown in this screenshot:

Importantly for ML model training dataset generation efficiency, the MAGAZINEgts file contains all the information needed to programmatically generate new datasets based on a researcher’s specific interest. The first of such useful “seed” model-training datasets is available in a FactMiners GitHub repository for the ads in Softalk magazine. This potential for the flexible use of the MAGAZINEgts format for Machine Learning model training was the subject of our DATeCH2019 poster paper.

In Closing

While the road ahead is long and challenging, Timlynn and I look forward to many exciting days ahead in our quest to create the MAGAZINEgts ground truth storage format and its reference implementation for Softalk magazine.

Thank you for reading this article. We welcome your interest and look forward to exploring opportunities for collaboration.

-: Jim Salmons & Timlynn Babitsky :-
Broomfield Colorado USA

— (end) —