There are many tasks that involve coding, for example, putting kids into groups according to their age, labeling the webpages about their kinds, or putting students in Hogwarts into four colleges… And researchers or lawyers need to code people, according to their filled-in information, into occupations. Melissa Friesen, an investigator in Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), saw the need of large-scale coding. Many researchers are dealing with big data concerning epidemiology. She led a research project, in collaboration with Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH), to develop an artificial intelligence system to cope with the problem. This leads to a publicly available tool called SOCcer, an acronym for “Standardized Occupation Coding for Computer-assisted Epidemiological Research.” (URL: http://soccer.nci.nih.gov/soccer/)
The system was initially developed in an attempt to find the correlation between the onset of cancers and other diseases and the occupation. “The application is not intended to replace expert coders, but rather to prioritize which job descriptions would benefit most from expert review,” said Friesen in an interview. She mainly works with Daniel Russ in CIT.
SOCcer takes job title, industry codes (in terms of SIC, Standard Industrial Classification), and job duties, and gives an occupational code called SOC 2010 (Standard Occupational Classification), used by U. S. federal government agencies. The data involves short text, often messy. There are 840 codes in SOC 2010 systems. Conventional natural language processing (NLP) methods may not apply. Friesen, Russ, and Kwan-Yuet (Stephen) Ho (also in OIR, CIT; a CSRA staff) use fuzzy logic, and maximum entropy (maxent) methods, with some feature engineering, to build various classifiers. These classifiers are aggregated together, as in stacked generalization (see my previous entry), using logistic regression, to give a final score.
SOCcer has a companion software, called SOCAssign, for expert coders to prioritize the codings. It was awarded with DCEG Informatics Tool Challenge 2015. SOCcer itself was awarded in 2016. And the SOCcer team was awarded for Scientific Award of Merit by CIT/OCIO in 2016 as well (see this). Their work was published in Occup. Environ. Med.
Related Reports and Articles:
- SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiological Research). [link]
- “Computer-based coding increases efficiency of risk assessments in studies of occupational exposures,” News, DCEG, NCI, NIH. [link]
- Daniel E. Russ, Kwan-Yuet Ho, Joanne S. Colt, Karla R. Armenti, Dalsu Baris, Wong-Ho Chow, Faith Davis, Alison Johnson, Mark P. Purdue, Margaret R. Karagas, Kendra Schwartz, Molly Schwenn, Debra T. Silverman, Patricia A. Stewart, Calvin A. Johnson, Melissa C. Friesen, “Computer-based coding of free-text job descriptions to efficiently and reliably incorporate occupational risk factors into large-scale epidemiologic studies,” Occup. Environ. Med. 73, 417-424 (2016). [PubMed]
- Daniel E. Russ, Kwan-Yuet Ho, Calvin A. Johnson, Melissa C. Friesen, “Computer-Based Coding of Occupational Codes for Epidemiological Analyses”, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems (CBMS), pp. 347-350 (2014). [IEEE]
Other Useful Links:
- SOC 2010. [link]
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH). [link]
- Office of Intramural Research (OIR), Center for Information Technology (CIT), National Institutes of Health (NIH). [link] (formerly known as Division of Computational Biosciences (DCB))
- CSRA, Inc. [link]
- DCEG Analysis Tools. [link]
- DCEG Informatics Tool Challenge 2015. [link]
- “Friesen and Colleagues Recognized with Scientific Award from NIH,” News, DCEG , NCI, NIH. [link]