November 28, 2012

Reuse

When you have to maintain a sprawling documentation set, the ability to reuse content can be a lifesaver. It can also be a disaster. The wall staving off disaster is strategy, consistency, and discipline. Without a good strategy, reuse becomes a rat's nest. If the writers are not consistent in applying the strategy, reuse creates snarls. If the writers are not disciplined, reuse exacerbates the problems.
The first thing that needs be done in forming a good reuse strategy is defining what reuse means. One definition of reuse is that content modules are used in multiple published documents. For example, a standard warning message is placed in a module and then that module is imported into any document that requires it. Another common definition of reuse that content is cloned (copied) to other modules where it may be useful. For example, the standard warning is simply pasted into all of the modules that require it.
In my mind the first definition of reuse is the one that knowledge set maintainers should aspire. It truly leads to reduction in workload and possibility for error. Writers only need to maintain a single copy of a module and when that module is updated, all importing modules take advantage of the update. The second definition, in contrast, saves writers some amount of work up front since they do not have to originate any content, but increases the maintenance load on the backend. A change in one of the cloned sections requires the writer to not only update the original, but to hunt down all of the copies, determine if the change is appropriate, and then make the update. It is more work and far more error prone.
The idea of cloning content is not without merit and does have a place in a solid reuse strategy, but by itself is not a solid reuse strategy. Cloning is useful when the content in a module is a close, but not exact, fit. For example, two products may use a common log-in module and have a similar log-in procedure. However, the username/password requirements may be very different or one of the products may require an additional step. In this case, it may make sense to maintain two copies of the log-in procedure.
Cloning is also routinely used to perform versioning. Each product library I work on has at least three versions. Each of these versions is a clone of the other versions. The entire collection of modules is cloned into a single version such that module A and module B will share the same version in an instance of a library instance. Trying to make an update to multiple version of a library will highlight the issues with cloning as a primary reuse strategy.
So, if cloning is not a useful primary reuse strategy what is? Reuse is complex and any strategy will require many tactics:
* constraining writing to make all of the content stylistically uniform
* wise use of variables
* wise use of conditional text
* sensible chunking rules
* open communication channels
* sensible cloning rules
* scoping rules
* clear versioning policies for shared content collections
Using variables and conditional text make it easier to share modules between products or in places where there are minor variations required in the content. They are useful for places where a product name changes or when two products use different version schemes. Conditional text can allow for slight variances in procedures. Variables and conditional text can have pitfalls as well. They can hinder translation and can get convoluted and hard to manage. When a module becomes too heavy with conditionals and variables, it might be a good idea to consider cloning.
One of the most important parts of a reuse strategy is the size of the modules allowed. They must be fine grained enough to maximize reuse. For example, a book is not a great a great module size since a books reusability is limited to one per library. The modules need to be large grained enough to maintain. For example, a phrase, or even a sentence, does not make a great content module because there would simply be too many of them to manage. I generally think that the DocBook section element is a good module delimiter. Sections are fine grained enough to be shared in multiple places in a library or set of libraries and rough grained enough to hold a useful amount of information. In case by case instances, tables, lists, admonitions, and examples also make good modules.
In situations where you are only dealing with one product library a strict versioning policy may not be critical. All of the modules will ostensibly share the same version for an entire library. However, if you are working in an environment where products share large components, it makes sense to have a strict and well understood policy in place about how component and product versions work. We currently have two products that share a number of common components and the products can at any one time be using different versions of the components. To handle this we version the documentation for each component independently of the products in which they are used. Each product imports the required version of the component sets and when a release is built tags are added to the component libraries to mark the product revision. This allows us to make on going updates to all of the content with reasonable assurance that we won't accidentally cross contaminate a product library. It does, however, add administrative overhead and require some extra caution on the part of the writers.
There is not a one-size-fits-all answer for how to implement these things. Every team has slightly different requirements, slightly different content, and slightly different capabilities. If your requirements put ease of translation over ease of reuse you will choose a different set or parameters. If your team is made up of newbies, you will choose a less strict set of parameters. Your tools will also alter the parameters you choose. The trick is to choose wisely and honestly.
Once you have chosen stick to the plan and look for ways to improve the process. Just know that once you have chosen, changing your mind becomes harder and harder. Reuse creates tangles and dependencies that are not easily unwound.

November 2, 2012

Attribution and Provenance

I was recently involved in a discussion about reuse and one of the recurring issues was that writers didn't want other writers modifying their modules without being consulted.
In a distributed, multi-writer team there is always the chance that two writers will make a change to the same module. When reuse is added into the mix, there is the problem that one writers changes are incompatible with at least one use of the module and the only real solution to the problem is to branch the module. It is naive to think that everyone will always do the right thing and so there is a requirement to be able to track changes and have writers names attached to changes. This requirement makes it possible to easily rollback mistakes and to hold writers accountable such that mistakes are less likely to happen. Attaching a writer's name to a change also makes it easier to coordinate future changes because the next writer to come along can see who has been working on a module and coordinate with them to decide if updates require a branch or not.
Attaching a writers name to a module's change log was not an issue for this group, partly because they are working in a system that really doesn't support branching or any robust change tracking mechanism, but mostly because they were more hung up on the fact that another writer can change their modules. It was an issue of ownership that is exacerbated by a system that lists all of the writers that ever contributed to a book as an author. Much of the discussions about how to manage the issue of modifying reused topics focused around how manage the ownership issue and devolved into a discussion about how to keep track of the authors of a module.
This is unproductive. In order for reuse to work, in fact for modular writing to be effective at all, the concept of ownership needs to be extended to all of the modules that make up the content base. No one writer can own a content module. Technical writers in a group project, regardless of if the group is a corporate writing team or an open source project, cannot, if they want to create good content in an efficient manner, retain the ownership of any one piece of the whole. Ownership of pieces can be destructive because it makes writers reluctant to make changes to modules they don't own, creates situations where writers are upset when a change is made to a module they own, and fosters an environment where writers focus on making their modules great instead of making the whole project great. In the end technical writers working on a team are not authors; they are contributors. Authors are entities that publish complete works that are intended for standalone consumption.
I know writers generally don't like to hear that they are not authors. I know that I don't. I like to get credit for my work and see my byline. I worked as a reporter for several years and I write several blogs. In both cases, I am an author and own the content. In both cases, I produce complete works that are intended for standalone publication and consumption. As a reporter, I did work on articles with other reporters and how the byline, and hence ownership of the work, was determined depended largely on how much each reporter contributed. If it was a two person effort and both split the work equally, the byline was shared. In teams bigger than two, typically, at least one of the reporters was relegated to contributor.
However, I also work as a technical writer and contributor to a number of open source projects. In both cases, I write content that is published and in which I take pride. The difference is that they are large group efforts of which my contributions are only a part (sometimes a majority part, sometimes a tiny part). Publicly, I cannot claim authorship for the content produced by the efforts. There is little way to distinguish my contributions from the others and attempting to do so does not benefit the reader. Do I get credit for this work? Within the projects I do because all of the change tracking systems associate my changes with my name. I do not make contributions to a project that requires personal attribution for my contributions, nor do I make contributions that prohibit derivative works. Both feel detrimental to the purpose of the projects. How can one make updates if no derivatives are allowed on a content module? Most of the efforts do use licenses that restrict redistribution and derivative works, but these are for the entire body of work.
There is the issue of provenance in environments that accept outside contributions or produce works that are an amalgam of several projects. This is largely a CYA legal issue, but it is a big issue. Fortunately, it is a problem with several working solutions. The open source communities have all developed ways of managing provenance as have any company that ships functionality implemented by a third party. One of the most effective ways of managing the issue of provenance is to limit the types of licenses under which your project is allowed to accept.
Personally, I would restrict direct contributions to a single license that doesn't require direct attribution of the contributor and allows derivative works. Ideally, contributions should be only be accepted if the contributor agrees to hand over rights to the project which eliminates all of the issues.
For indirect contributions, the issue is a little more thorny. You want to maximize the resources available to your project while minimizing your exposure to legal troubles and unwanted viral license terms. For example, Apache doesn't allow projects to use GPL licensed code because it is too viral. However, they do allow the use of LGPL binaries since they don't infect the rest of the project. This also means knowing what is and isn't allowed by the licenses which which you want to work. For example, if your I project wants to use a work that requires attribution and doesn't allow derivative works, you need to have a policy in place about how you redistribute the work, like only distribute the PDFs generated by the author.
Tracking provenance need not be hard. For direct contributions, you just need to ensure that all contributors accept the terms required for contribution and that is that. For indirect contributions, they should be handled like third party dependencies and have the license terms associated directly with the third-party in a separate database. They only need to be consulted when the project is being prepped for a release to ensure that legal obligations are being met.
The take away:
* the concept of ownership is destructive and counter productive to large group projects
* provenance is an issue, but not a problem if properly scoped