1. Introduction
GitHub is an online service for source code hosting and software project collaboration. It provides features for coordinating work, such as the issue tracker for a repository to discuss software features and bugs and the pull-request mechanism for software developers to make contributions to other repositories [
1,
2]. It also includes social features, such as following other GitHub users to make connections and receiving updates of others from activity traces [
1,
2].
Recently, software developers began to create GitHub repositories that systematically organize and index the Internet resources (
Figure 1), and many gained notable popularity. Studies of this phenomenon have investigated the motivations of curators and user experiences with curation repositories [
3,
4]. The analysis of GitHub features for supporting curation, as well as the role that curation repositories have been playing in the software developers’ community, are still missing. Therefore, in this paper, we are interested in adding to the literature by understanding the way that GitHub features are applied in curation repositories as well as the functions of curation repositories in the software developers’ community.
In recent years, curation behavior has been investigated in the context of social media sites, such as Twitter and Pinterest, and with respect to media contents, such as videos, images, text (tweets), and hyperlinks to other online resources [
5,
6]. Curation in GitHub is distinct from curation in those websites of in the following ways. First, hyperlinks posted in GitHub curation repositories are directed at the software developers’ community, whose members share and cultivate a professional interest in software development. Second, the curation on GitHub is embedded in the context of GitHub, which is an ensemble of social coding features intended for software development and collaboration rather than for curation. The GitHub context raises interesting questions concerning how its features support such an appropriation for curation and what role it serves for this specific community. Therefore, this paper attempts to answer the following overarching research question:
RQ: How does GitHub support curation practices?
To answer this research question, we first compare how curation repositories are different from typical software repositories. As a relatively new way to utilize GitHub, users are likely to participate in such kind of repository differently. This relates not only to the number of GitHub users who star a repository, which usually shows a user’s interests towards a repository [
7,
8], but also to the different types of activities that take place inside a curation repository, such as pull requests [
8]. Thus, we explore our initial research question through the specific sub-research questions outlined below.
RQ1: How are curation repositories different from the typical software repositories of GitHub?
Prior literature has documented software practices on GitHub well [
1,
2,
8], and some literature has provided insights into the curator’s motivations and the users’ experiences with curation repositories [
3,
4]. However, a gap still exists in how GitHub features are utilized in curation repositories and how they are adopted differently, as compared to the intended software practice. This research question intends to close this gap and to provide an account of the categories and user participation that make curation repositories different.
In addition to the comparison with software repositories, currently, we have little understanding of the details of curation repositories in terms of what kind of needs they address and what role they play in the software developers’ community. Thus, the following research question will be investigated next.
RQ2: What is the emerging role of curation repositories in the GitHub community?
Specifically, this research question examines the function of curation by examining the contents, format, the owner’s characteristics, and collaboration pattern of curation repositories on GitHub. The answer to this research question can elucidate why curation repositories are useful, why they have suddenly drawn great attention, and what kind of impact they bring to the software developers’ community.
Through a statistical analysis of the activity logs to compare curation repositories with software repositories and a content analysis of the most popular curation repositories on GitHub, we find that curation repositories are more popular than software repositories, and GitHub users participate in curation repositories in a qualitatively different way. Most curation repositories are maintained by individual software developers and they intend to collect and preserve high-quality resources, originated from either inside or outside of GitHub, about the technology industry. Our findings suggest that curation repositories become an essential way for the software developers’ community to centralize fragmented information and share knowledge. This study contributes to the understanding of curation in GitHub and sheds light on the potential ways to better support the practice.
3. Method
To characterize popular curation repositories hosted on GitHub, we collected a dataset of activity logs on GitHub, identified the top curation and software repositories, compared them with top software repositories, and coded the contents of the curation repositories. The dataset was collected from GitHub Archive (
https://www.githubarchive.org/), which captures a comprehensive GitHub timeline data. GitHub Archive data has been actively used for analysis in academic publications [
10,
14,
15,
16]. However, the dataset was influenced by a bug report about a crawler issue on 22 September 2013 (
https://github.com/igrigorik/githubarchive.org/pull/37), which resulted in a loss of events. For consistency and data quality, we collected 109,782,635 events on 7,079,847 repositories that occurred between 1 October 2013 and 31 August 2014.
In order to find the commonalities of the curation repositories that are of interest to others, we selected the ones based on indicators of popularity. Trending repositories are displayed on GitHub by day, week, or month. The trending repositories typically average about 500 stars per repository (
https://github.com/trending?l=all&since=weekly). Given that a repository that trends can be considered a relatively popular repository on GitHub, we then selected all repositories that had more than 500 stars within a date range. At the same time, many software repositories with more than 500 stars have been established for years. In order to have a fair comparison of curation and software repositories, only repositories that were created after January 1, 2013 were retained, resulting in 1929 repositories.
To identify curation repositories within this sample, we first identified 1384 software projects, whose programming languages were automatically detected by GitHub. For the remaining 545 repositories, we manually labeled each. The criterion used to determine if a repository was a curation repository was whether the primary content of the repository was a collection of Internet resources. As a result, we identified 49 curation repositories from the 545 repositories. As we could not verify the nature of the other 496 repositories, and since our immediate interest for this paper focused on popular curation repositories and their comparison with software repositories, we discarded these 496 repositories from the sample.
After identifying the most popular curation and software repositories, to answer RQ1, i.e.,
how are curation repositories different from the typical software repositories of GitHub, we aggregated the activity log data into 49 curation repositories and 1384 software repositories, respectively, and applied a quantitative method to compare them. Specifically, for each type of activity, we analyzed whether the number of the activities for curation projects was different from the number of activities for software repositories. To answer RQ2, i.e.,
what is the emerging role of curation repositories in the GitHub community, we performed content analysis on the 49 repositories. For each curation repository, we coded the repository name, description, curated items, pull requests, and owners’ profiles, which were retrieved from GitHub on September 1, 2014. Open coding strategies as suggested by Strauss (1987) were applied to developing a coding scheme [
17]. Themes and concepts were identified, discussed, and refined iteratively among researchers [
18]. The results are presented in the following section.
5. Discussion
Our results illustrate the characteristics of curation repositories regarding: (1) their differences from software repositories on GitHub, (2) the topics and data provenance of the curated items, (3) the leadership, and (4) the collaboration patterns. The emergence of curation repositories and their high popularity have significant implications, which are discussed in this section.
5.1. Implications for the Software Developers’ Community
Our results and analysis show that most curation repositories on GitHub select, organize, and preserve different types of high-quality resources, grouping them into different categories that are useful for software developers. The wide popularity of curation repositories indicates that they are well-received in the software developers’ community and attract enormous attention. It is likely that curation repositories will become an important way for software developers to share knowledge within the community.
Software developers are increasingly active in participating in a set of different social media sites [
22,
23,
24]. Although engaging in different sites creates vast opportunities for software developers to find information that is relevant and useful, it also introduces many burdens and challenges, such as (1) the fragmented resources spread over a set of social media sites, (2) the overhead to learn and master different social media channels, and (3) the difficulties in evaluating the quality of information in a large information space [
24,
25].
Curation on GitHub is likely to be a starting point to address these challenges. Curation repositories centralize fragmented resources all over the Internet. They are located in GitHub, a site many developers are already familiar with, where no other media literacy is required to master the tool, and they involve a collaborative human effort to evaluate the quality of the curated contents. In this way, curation repositories become an important way for the software developers’ community to communicate quality resources so that millions of developers do not have to follow different social media sites and filter resources themselves.
5.2. Internal Curation on GitHub
A relatively large proportion of curation repositories are internal-based, which suggests that one important function of curation on GitHub is indexing GitHub-orientated resources.
GitHub has been reported to be a successful tool for self-hosting software repositories and increasing the effectiveness of collaboration [
1,
2]. As a result, many more developers and organizations began to host software projects on GitHub to allow contributions from others, while also contributing to other software projects. This action led the fast growing of the number of repositories on GitHub. The number of repositories on GitHub reached 10 million in 2013 (
https://github.com/blog/1724-10-million-repositories). However, not all repositories hosted on GitHub are of high quality and can be appealing to software developers to use in their own projects or to contribute to. Curation repositories provide valuable navigational support for software repository retrieval, with the help of GitHub features as well as human effort. The high popularity of curation repositories and the large quantity of internally curated resources suggest that such attempt is highly welcomed in the software developers’ community. They are likely to save the time and efforts software developers spend in locating the desired software repositories.
Meanwhile, as each curation repository usually supports a single (or several related) software development topic, it also raises the question about the scalability of curation practice. Particularly, with the fast progressing of software engineering, new programming languages, frameworks, and libraries are emerging daily, and thus the number of curation repository will grow as well. As a result, curation repositories as a whole will be fragmented. Some meta curation repository has already emerged (
https://github.com/sindresorhus/awesome), which indexes and organizes curation repositories. However, the usage and effectiveness of such meta-lists are unknown. The user evaluation of such curation repository and design efforts for organizing curation repositories can be an interesting future research direction.
5.3. Implications for the Owners of Curation Repositories
In the review of the characteristics of the owners of curation repositories, we found that most of them are individuals rather than organizations. This implies that the creation and maintenance of a curation repository do not require group efforts. It also suggests that curation in the social coding environment is different from that in the enterprise context, which tends to have a small leadership team that creates and maintains curation repositories [
12].
We were also curious to understand if the owners of curation repositories were leaders within the community and found that many were little known prior to their creation of the repositories, for they did not have many followers, which is an indication of the leadership status in GitHub [
1,
8]. This is an interesting distinction, considering that the GitHub community often favors the work of reputable, well-known developers [
1]. The reputation of these curation repositories shows that curators do not have to be community leaders within the social coding site for their curation repositories to be well received.
This result has important implications. Given the popularity and attention the curation repositories receive, it is an opportunity for not well-known software developers to create good curation repositories and make an impact in the community. In addition, the role of curator may become important in the software developers’ community, because (1) currently there is no easy way to deal with the information fragmentation nor to address the difficulty in evaluating information [
23,
24], and (2) the software industry is changing fast, and new technologies are developed while old ones are deprecated every day [
16]. It is likely that more curation efforts will be required in the software developers’ community.
However, as shown by our results, as most owners of curation repositories are individuals, interesting questions arise on how well a curation repository can scale. The more popular a curation repository becomes, the more contributions it will receive, and the larger it will become. It will become increasingly hard for the owners to add new curated items, track existing ones, and at the same time, evaluate the ones suggested by contributors as the repository expands. It will be interesting to see if organizational efforts will be invested in a curation repository as it expands, or if a community, like open source projects, where there are core and peripheral members, will emerge around a certain curation repository.
5.4. Collaborative Curation on GitHub
The appropriation of GitHub for collaborative curation is of particular interest to this study because GitHub provides a number of collaborative features, such as issue-tracker and pull-request mechanisms, which become standard features in software practices.
Typical curation efforts include selecting, organizing, evaluating resources from multiple resources [
5]. In addition to these activities, the owner of a curation repository will also interact with other contributors to curate resources that match the description of the repository. In general, curation repositories adopt the existing practices on GitHub intended for collaborative software development, in which contributors send pull requests (or issues for some curation repositories) to the owner to submit a change to an existing file (specifically, to add a new resource hyperlink). The owner will then evaluate the resources recommended by the contributors and decide whether to merge the change or not. In this way, curation repositories are collaboratively developed by a number of GitHub users. This kind of collaborative curation follows very similar contribution patterns as software repositories [
1,
2,
8]. Therefore, curation repositories not only adopt GitHub features, but also appropriate a part of the software practices on GitHub.
However, this type of appropriation of GitHub for curation differs from that in the enterprise context described by Matthew et al. (2014), which combines a number of tools to organize and curate resources to cope with information overload, and the community leaders usually curate the bulk part of the resources [
12].
Further, our results show that most collaborative curation happens only between two persons, the owner of a curation repository and the contributor. This raises some doubts on whether the opinions of two persons can be well representative for an artifact intended for a large community. It suggests that GitHub features are underutilized in terms of evaluating resources for reaching community consensus. In addition, most pull requests to curation repositories add new resources, rather than deleting the existing ones. This suggests that the contributions to curation repositories rarely consider whether the existing resources are still up to date or appropriate to be included in the list. In the long run, if a curated list keeps growing, it can increase the navigational difficulties and affect the overall quality of the curated resources.
5.5. Design Implications
Our results demonstrate that curation repositories have become an important type of artifact developed in GitHub. The characteristics of such repositories have important implications.
From a design perspective, there are opportunities to design a better interface and provide a better user experience for curation repositories. Open source software projects are a major type of resources for curation, and software projects are created, flourished, and perished all the time. Under the current curation paradigm, there is no effective way to monitor if a curated item in a curation repository is under active development or not without manually checking. As suggested by recent literature, software developers leverage a set of features to make social inferences: for example, recent activity signals the activeness of a repository [
1], and the number of stars indicates the community’s interest in a project [
8]. As most curation repositories are a single page with lengthy content, and, most of the time, the information contained in a curated item is brief, including only the name of the resources and simple description, these types of signals, such as the number of stars and activeness, can be appended to each curated item to help software developers evaluate curated items inside a curation repository.
7. Conclusions
Curation on GitHub is an innovative appropriation of an existing tool in the software developers’ community. In this paper, we studied the characteristics of curation in the software developers’ community by investigating curation repositories in the following aspects: (1) the GitHub features used in curation repositories, (2) the characteristics of contents, formats, and owners of curation repositories, and (3) the collaboration patterns in curation repositories. Our results show that curation repositories make use of existing GitHub features to collect, organize, and retain resources about the technology industry. They centralize resources that are spread both inside and outside of GitHub. The comparison of activities between curation repositories and software projects illustrates that curation repositories have a more stable structure, receive more contributions from the community, and do not have multiple owners to lead them.
The emergence of curation on GitHub and its wide popularity has important implications. It suggests that curation may become an important way for software developers to communicate knowledge, as the challenges of participating in multiple social media channels to face a large volume of resources are increasing [
15,
24]. Also, the curator role may become more important in the software developers’ community, and software developers can curate resources to make an impact. Last, there is potential for appending different pieces of information signals to each curation item inside a curation repository to reduce the navigational cost inside a curation repository.