Author name disambiguation

Author name disambiguation is a type of disambiguation and record linkage applied to the names of individual people. The process could, for example, distinguish individuals with the name "John Smith".

An editor may apply the process to scholarly documents where the goal is to find all mentions of the same author and cluster them together. Authors of scholarly documents often share names which makes it hard to distinguish each author's work. Hence, author name disambiguation aims to find all publications that belong to a given author and distinguish them from publications of other authors who share the same name.

Methods[edit]

Considerable research has been conducted into name disambiguation.^[1]^[2]^[3]^[4] Typical approaches for author name disambiguation rely on information to distinguish between authors, including (but not limited to) information about the authors such as: their name representation, affiliations and email addresses, and information about the publication: such as year of publication, co-authors, and the topic of the paper. This information can be used to train a machine learning classifier to decide whether two author mentions refer to the same author or not.^[5] Much research regards name disambiguation as a clustering problem, i.e., partitioning documents into clusters, where each represents an author.^[1]^[6]^[7] Other research treats it as a classification problem.^[8] Some works constructs a document graph and utilizes the graph topology to learn document similarity.^[7]^[9] Recently, several pieces of research^[9]^[10] aim to learn low-dimensional document representations by employing network embedding methods.^[11]^[12]

Applications[edit]

There are multiple reasons that cause author names to be ambiguous, among which: individuals may publish under multiple names for a variety of reasons including different transliteration, misspelling, name change due to marriage, or the use of nicknames or middle names and initials.^[13]

Motivations for disambiguating individuals include identifying inventors from patents, and researchers across differing publishers, research insitutions and time periods.^[14] Name disambiguation is also a cornerstone in author-centric academic search and mining systems, such as AMiner (formerly ArnetMiner).^[15]

Similar issues[edit]

Author name disambiguation is only one record linkage problem in the scholarly data domain. Closely related, and potentially mutually beneficial problems include: organisation (affiliation) disambiguation,^[16] as well as conference or publication venue disambiguation, since data publishers often use different names or aliases for these entities.

Resources[edit]

Scholia has a profile for author disambiguation (Q25052136).

Several well-known benchmarks to evaluate author name disambiguation are listed below, each of which provides publications with some ambiguous names and their ground truths.

Source Codes

References[edit]

^ ^a ^b Khabsa, Madian; Treeratpituk, Pucktada; Giles, C. Lee (2015). Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL '15. pp. 37–46. doi:10.1145/2756406.2756915. ISBN 9781450335942. S2CID 14068285.
^ Mann, Gideon S.; Yarowsky, David (2003). "Unsupervised personal name disambiguation". Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 -. Vol. 4. pp. 33–40. doi:10.3115/1119176.1119181. S2CID 29759924.
^ Han, Hui; Giles, Lee; Zha, Hongyuan; Li, Cheng; Tsioutsiouliklis, Kostas (2004). "Two supervised learning approaches for name disambiguation in author citations". Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries - JCDL '04. p. 296. doi:10.1145/996350.996419. ISBN 1581138326. S2CID 1089260.
^ Huang, Jian; Ertekin, Seyda; Giles, C. Lee (2006). Knowledge Discovery in Databases: PKDD 2006. Lecture Notes in Computer Science. Vol. 4213. pp. 536–544. doi:10.1007/11871637_53. ISBN 978-3-540-45374-1. ISSN 0302-9743. S2CID 14132755.
^ Treeratpituk, Pucktada; Giles, C. Lee (2009). Disambiguating authors in academic publications using random forests (PDF). Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM. pp. 39–48. CiteSeerX 10.1.1.147.3500. doi:10.1145/1555400.1555408.
^ Jie Tang; A.C.M. Fong; Bo Wang; Jing Zhang (2012). "A Unified Probabilistic Framework for Name Disambiguation in Digital Library". IEEE Transactions on Knowledge and Data Engineering. 24 (6). IEEE: 975–987. doi:10.1109/TKDE.2011.13. S2CID 1032074.
^ ^a ^b Xuezhi Wang; Jie Tang; Hong Cheng; Philip S. Yu (2011). ADANA: Active Name Disambiguation. Proceedings of 2011 IEEE International Conference on Data Mining. Vancouver: IEEE. pp. 794–803. doi:10.1109/ICDM.2011.19. ISBN 978-1-4577-2075-8.
^ Zeyd Boukhers; Nagaraj Bahubali Asundi (2022). "Whois? Deep Author Name Disambiguation Using Bibliographic Data". Linking Theory and Practice of Digital Libraries. Lecture Notes in Computer Science. Vol. 13541. Padua: Springer. pp. 201–215. arXiv:2207.04772. doi:10.1007/978-3-031-16802-4_16. ISBN 978-3-031-16801-7.
^ ^a ^b ^c Yutao Zhang; Fanjin Zhang; Peiran Yao; Jie Tang (2018). Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London: ACM. pp. 1002–1011.
^ Baichuan Zhang; Mohammad Al Hasan (2017). Name disambiguation in anonymized graphs using network embedding. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Singapore: ACM. pp. 1239–1248.
^ Bryan Perozzi; Rami Al-Rfou; Steven Skiena (2014). Deepwalk: Online learning of social representations. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM. pp. 701–710.
^ Jiezhong Qiu; Yuxiao Dong; Hao Ma; Jian Li; Kuansan Wang; Jie Tang (2018). Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. Marina Del Rey: ACM. pp. 459–467.
^ Smalheiser, Neil R.; Torvik, Vetle I. (2009). "Author name disambiguation". Annual Review of Information Science and Technology. 43: 1–43. doi:10.1002/aris.2009.1440430113.
^ Morrison, Greg; Riccaboni, Massimo; Pammolli, Fabio (16 May 2017). "Disambiguation of patent inventors and assignees using high-resolution geolocation data". Scientific Data. 4: 170064. Bibcode:2017NatSD...470064M. doi:10.1038/sdata.2017.64. PMC 5433392. PMID 28509897.
^ Jie Tang; Jing Zhang; Limin Yao; Juanzi Li; Li Zhang; Zhong Su (2008). ArnetMiner: extraction and mining of academic social networks. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM. pp. 990–998.
^ Zhang, Ziqi; Nuzzolese, Andrea; Gentile, Anna Lisa (2017). Entity Deduplication on ScholarlyData. Proceedings of the Extended Semantic Web Conference. Springer-Verlag. pp. 85–100. doi:10.1007/978-3-319-58068-5_6.
^ Subramanian, Shivashankar; King, Daniel; Downey, Doug; Feldman, Sergey (21 Mar 2021). "S2AND: A Benchmark and Evaluation System for Author Name Disambiguation". arXiv:2103.07534 [cs.DL].

[KhabsaTreeratpituk2015-1] Khabsa, Madian; Treeratpituk, Pucktada; Giles, C. Lee (2015). Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries - JCDL '15. pp. 37–46. doi:10.1145/2756406.2756915. ISBN 9781450335942. S2CID 14068285.

[MannYarowsky2003-2] Mann, Gideon S.; Yarowsky, David (2003). "Unsupervised personal name disambiguation". Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 -. Vol. 4. pp. 33–40. doi:10.3115/1119176.1119181. S2CID 29759924.

[3] Han, Hui; Giles, Lee; Zha, Hongyuan; Li, Cheng; Tsioutsiouliklis, Kostas (2004). "Two supervised learning approaches for name disambiguation in author citations". Proceedings of the 2004 joint ACM/IEEE conference on Digital libraries - JCDL '04. p. 296. doi:10.1145/996350.996419. ISBN 1581138326. S2CID 1089260.

[HuangErtekin2006-4] Huang, Jian; Ertekin, Seyda; Giles, C. Lee (2006). Knowledge Discovery in Databases: PKDD 2006. Lecture Notes in Computer Science. Vol. 4213. pp. 536–544. doi:10.1007/11871637_53. ISBN 978-3-540-45374-1. ISSN 0302-9743. S2CID 14132755.

[5] Treeratpituk, Pucktada; Giles, C. Lee (2009). Disambiguating authors in academic publications using random forests (PDF). Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM. pp. 39–48. CiteSeerX 10.1.1.147.3500. doi:10.1145/1555400.1555408.

[6] Jie Tang; A.C.M. Fong; Bo Wang; Jing Zhang (2012). "A Unified Probabilistic Framework for Name Disambiguation in Digital Library". IEEE Transactions on Knowledge and Data Engineering. 24 (6). IEEE: 975–987. doi:10.1109/TKDE.2011.13. S2CID 1032074.

[wang2011adana-7] Xuezhi Wang; Jie Tang; Hong Cheng; Philip S. Yu (2011). ADANA: Active Name Disambiguation. Proceedings of 2011 IEEE International Conference on Data Mining. Vancouver: IEEE. pp. 794–803. doi:10.1109/ICDM.2011.19. ISBN 978-1-4577-2075-8.

[whois-8] Zeyd Boukhers; Nagaraj Bahubali Asundi (2022). "Whois? Deep Author Name Disambiguation Using Bibliographic Data". Linking Theory and Practice of Digital Libraries. Lecture Notes in Computer Science. Vol. 13541. Padua: Springer. pp. 201–215. arXiv:2207.04772. doi:10.1007/978-3-031-16802-4_16. ISBN 978-3-031-16801-7.

[zhang2018name-9] Yutao Zhang; Fanjin Zhang; Peiran Yao; Jie Tang (2018). Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. London: ACM. pp. 1002–1011.

[10] Baichuan Zhang; Mohammad Al Hasan (2017). Name disambiguation in anonymized graphs using network embedding. Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. Singapore: ACM. pp. 1239–1248.

[11] Bryan Perozzi; Rami Al-Rfou; Steven Skiena (2014). Deepwalk: Online learning of social representations. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM. pp. 701–710.

[12] Jiezhong Qiu; Yuxiao Dong; Hao Ma; Jian Li; Kuansan Wang; Jie Tang (2018). Network Embedding as Matrix Factorization: Unifying DeepWalk, LINE, PTE, and node2vec. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. Marina Del Rey: ACM. pp. 459–467.

[13] Smalheiser, Neil R.; Torvik, Vetle I. (2009). "Author name disambiguation". Annual Review of Information Science and Technology. 43: 1–43. doi:10.1002/aris.2009.1440430113.

[14] Morrison, Greg; Riccaboni, Massimo; Pammolli, Fabio (16 May 2017). "Disambiguation of patent inventors and assignees using high-resolution geolocation data". Scientific Data. 4: 170064. Bibcode:2017NatSD...470064M. doi:10.1038/sdata.2017.64. PMC 5433392. PMID 28509897.

[15] Jie Tang; Jing Zhang; Limin Yao; Juanzi Li; Li Zhang; Zhong Su (2008). ArnetMiner: extraction and mining of academic social networks. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM. pp. 990–998.

[16] Zhang, Ziqi; Nuzzolese, Andrea; Gentile, Anna Lisa (2017). Entity Deduplication on ScholarlyData. Proceedings of the Extended Semantic Web Conference. Springer-Verlag. pp. 85–100. doi:10.1007/978-3-319-58068-5_6.

[17] Subramanian, Shivashankar; King, Daniel; Downey, Doug; Feldman, Sergey (21 Mar 2021). "S2AND: A Benchmark and Evaluation System for Author Name Disambiguation". arXiv:2103.07534 [cs.DL].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]