.
diff -r 000000000000 -r 91438d32ed58 README.md
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/README.md Wed Sep 14 10:39:29 2022 -0400
@@ -0,0 +1,79 @@
+# LexMapr2
+> Updated [LexMapr](https://github.com/cidgoh/LexMapr) with added functionalities to:
+> - pull ontology accession ids and definitions from [EMBL-EBI](https://www.ebi.ac.uk/ols/ontologies)
+> via the API
+> - group mapped results by parent ontologies
+> - visualize mapping results
+
+LexMapr2 will attempt to match short free-form text to terms and exact synonyms already existing in the specified ontologies without much contextualization. It is important to ensure that the chosen ontologies are relevant to the input. For example, 'ground' can match to food (ground):FOODON_00002713 or earth:EOL_0001587. 'Turkey' can match to the country (GAZ_00000558) or the bird (NCBITaxon_9103). 'Pet food for dog' will match Canis lupus familiaris:NCBITaxon_9615.
+
+
+## Table of Contents
+* [Setup](#setup)
+* [Usage](#usage)
+* [Customization](#customization)
+* [Ontology map](#ontology-map)
+* [Authors](#authors)
+
+
+## Setup
+The code will run as is when retrieved from GitHub with Python >= 3.7
+
+The following Python packages are required: argparse, collections, copy, csv, datetime, dateutil, inflection, itertools, json, logging, matplotlib, nltk, pandas, permutations, pickle, pygraphviz, requests, seaborn, shutil, sqlite3, time, unicodedata. If any are missing, they can be installed with pip.
+
+LexMapr2 will eventually be uploaded to PyPI as a package.
+
+
+## Usage
+Input and output CSV/TSV formats are the same as in [LexMapr v 0.7](https://github.com/cidgoh/LexMapr). The current version will also generate graphs and a log. A stable connection to the Internet is required.
+
+
+usage: lexmapr2.py [-h] [-o] [-a] [-b] [-e] [-f] [-g] [-j] [-r] [-u] [-v] input
+
+positional arguments:
+ input input CSV or TSV file; required
+
+optional arguments:
+ -h, --help show this help message and exit
+ -o, --output output TSV file path; default is stdout
+ -a, --no_ancestors remove ancestral terms from output
+ -b, --bin classify samples into default bins
+ -e, --embl_ontol user-defined comma-separated ontology short names
+ -f, --full full output format
+ -g, --graph visualize summaries of mapping and binning
+ -j, --graph_only only perform visualization with LexMapr2 output
+ -r, --remake_cache remake cached resources
+ -u, --user_bin path to JSON file with user-defined bins
+ -v, --version show program's version number and exit
+
+Flags -a, -b, -g may substantially add to the runtime.
+
+## Customization
+By default, the FOODON and NCBITaxon ontologies are used. Users can define a comma-delimited list of [ontology short names](https://www.ebi.ac.uk/ols/ontologies) flagged with '-e'. Bins are used to categorize matched ontologies by their parent ontologies. Users can override the default bins by flagging a JSON file with the '-u' option.
+
+Example JSON format to use a bin titled 'ncbi_taxon':
+
+{
+ "ncbi_taxon":{
+ "Actinopterygii":"NCBITaxon_7898",
+ "Ecdysozoa":"NCBITaxon_1206794",
+ "Echinodermata":"NCBITaxon_7586",
+ "Fungi":"NCBITaxon_4751",
+ "Mammalia":"NCBITaxon_40674",
+ "Sauropsida":"NCBITaxon_8457",
+ "Spiralia":"NCBITaxon_2697495",
+ "Viridiplantae":"NCBITaxon_33090"
+ }
+}
+
+
+## Ontology map
+Binning results will be mapped as shown in the example below if the graph flag, '-g', is used. Yellow nodes are parent bins. Blue nodes are terms that were identified during matching. The map will not be drawn if there are more than 100 nodes or more than 150 edges. If refused, the program will print a list of either nodes or edges in the log. An attempt will not even be made if there are more than 1000 rows. It is receommended that the user curates the LexMapr2 output file to include rows of interest and use the graph_only flag, '-j', with the shortened output file as the input file.
+
+
+
+
+## Authors
+Kayla K. Pennerman,
+Maria Balkey,
+Ruth E. Timme
diff -r 000000000000 -r 91438d32ed58 cfsan_lexmapr2.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/cfsan_lexmapr2.xml Wed Sep 14 10:39:29 2022 -0400
@@ -0,0 +1,129 @@
+
+ A Lexicon and Rule-Based Tool for Translating Short Biomedical Specimen Descriptions into Semantic Web Ontology Terms
+
+ python
+ python-dateutil
+ inflection
+ matplotlib
+ pandas
+ nltk
+ requests
+ seaborn
+ pygraphviz
+
+ python $__tool_directory__/lexmapr2.py --version
+
+
+ /tool/tool-data/cfsan_lexmapr2/0/nltk_data
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ @misc{github,
+ author = {Penn, Kayla},
+ year = {2022},
+ title = {LexMapr2},
+ publisher = {GitHub},
+ journal = {GitHub repository},
+ url = {https://github.com/CFSAN-Biostatistics/LexMapr2}}
+
+
+ @misc{GalaxyToolWrapper,
+ author = {Konganti, Kranti},
+ year = {2022}}
+
+
+
diff -r 000000000000 -r 91438d32ed58 img/example_map.png
Binary file img/example_map.png has changed
diff -r 000000000000 -r 91438d32ed58 lexmapr2.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/lexmapr2.py Wed Sep 14 10:39:29 2022 -0400
@@ -0,0 +1,86 @@
+"""Entry script"""
+
+__version__ = '1.0.0'
+import argparse, datetime, json, logging, os, pandas, sys
+import lexmapr.pipeline, lexmapr.run_summary
+from lexmapr.definitions import arg_bins
+
+
+def valid_input(file_path):
+ '''Exits if input file is invalid'''
+ _, file_ext = os.path.splitext(file_path)
+ if file_ext.lower() != '.csv' and file_ext.lower() != '.tsv':
+ sys.exit('Please supply a CSV or TSV input file with the correct file extension')
+ if not os.path.exists(file_path):
+ sys.exit(f'Input file named \"{file_path}\" not found')
+ return(file_path.strip())
+
+def valid_json(file_path):
+ '''Outputs read JSON file and exits if file is invalid'''
+ try:
+ with open(file_path, 'r') as JSON_file:
+ try:
+ return(json.load(JSON_file))
+ except(json.decoder.JSONDecodeError):
+ sys.exit(f'User-defined bins not in readable JSON format')
+ except(FileNotFoundError):
+ sys.exit(f'File named \"{file_path}\" not found')
+
+def valid_list(list_str):
+ '''Return list of user-defined ontologies'''
+ return([x.strip().upper() for x in list_str.split(',')])
+
+if __name__ == "__main__":
+ # Parse arguments, initiate log file and start run
+ arg_parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
+ arg_parser.add_argument('input', help='input CSV or TSV file; required', type=valid_input)
+ arg_parser.add_argument('-o', '--output', metavar='\b',
+ help=' output TSV file path; default is stdout')
+ arg_parser.add_argument('-a', '--no_ancestors', action='store_true',
+ help='remove ancestral terms from output')
+ arg_parser.add_argument('-b', '--bin', action='store_true',
+ help='classify samples into default bins')
+ arg_parser.add_argument('-e', '--embl_ontol', metavar='\b', type=valid_list,
+ help=' user-defined comma-separated ontology short names')
+ arg_parser.add_argument('-f', '--full', action='store_true', help='full output format')
+ arg_parser.add_argument('-g', '--graph', action='store_true',
+ help='visualize summaries of mapping and binning')
+ arg_parser.add_argument('-j', '--graph_only', action='store_true',
+ help='only perform visualization with LexMapr output')
+ arg_parser.add_argument('-r', '--remake_cache', action='store_true',
+ help='remake cached resources')
+ arg_parser.add_argument('-u', '--user_bin', metavar='\b', type=valid_json,
+ help=' path to JSON file with user-defined bins')
+ arg_parser.add_argument('-v', '--version', action='version',
+ version='%(prog)s '+__version__)
+
+ # TODO: encoding argument addded to logging.basicConfig in Python 3.9; now defaults to open()
+ run_args = arg_parser.parse_args()
+ if run_args.user_bin is not None:
+ run_args.bin = True
+ arg_bins = run_args.user_bin
+
+ logging.basicConfig(filename='lexmapr_run.log', level=logging.DEBUG)
+
+ if run_args.graph_only:
+ try:
+ mapping_results = pandas.read_csv(run_args.input, delimiter='\t')
+ except:
+ sys.exit('Input file not readable or not in expected format')
+ needed_columns = ['Matched_Components','Match_Status (Macro Level)']+list(arg_bins.keys())
+ missing_columns = set(needed_columns).difference(set(mapping_results.columns))
+ if missing_columns:
+ sys.exit(f'Missing column(s) {missing_columns} from input file')
+ t0 = datetime.datetime.now()
+ logging.info(f'Run start: {t0}')
+ logging.info('Graphing only')
+ print('\nGraphing only...')
+ lexmapr.run_summary.figure_folder()
+ lexmapr.run_summary.report_results(run_args.input, list(arg_bins.keys()))
+ lexmapr.run_summary.visualize_results(run_args.input, list(arg_bins.keys()))
+ print('\t'+f'Done! {datetime.datetime.now()-t0} passed'.ljust(60)+'\n')
+ else:
+ logging.info(f'Run start: {datetime.datetime.now()}')
+ lexmapr.pipeline.run(run_args)
+
+ logging.info(f'Run end: {datetime.datetime.now()}\n')
diff -r 000000000000 -r 91438d32ed58 o.tsv
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/o.tsv Wed Sep 14 10:39:29 2022 -0400
@@ -0,0 +1,8 @@
+Sample_Id Sample_Desc Processed_Sample Annotated_Sample Matched_Components Match_Status (Macro Level) Match_Status (Micro Level) Sample_Transformations fo_product fo_quality fo_organism ncbi_taxon
+small_simple1 Chicken Breast Not annotated chicken breast:FOODON_00002703 Full Term Match [] {} poultry food product:FOODON_00001283|vertebrate animal food product:FOODON_00001092
+small_simple2 Baked Potato Not annotated potato (whole, baked):FOODON_03302196 Full Term Match ['Used Processed Sample', 'Suffix Addition'] {} plant food product:FOODON_00001015 food (baked):FOODON_00002456|food (cooked):FOODON_00001181
+small_simple3 Canned Corn Not annotated corn (canned):FOODON_03302665 Full Term Match [] {} plant food product:FOODON_00001015
+small_simple4 Frozen Yogurt Not annotated frozen yogurt:FOODON_03307445 Full Term Match [] {} dairy food product:FOODON_00001256|vertebrate animal food product:FOODON_00001092
+small_simple5 Apple Pie Not annotated apple pie:FOODON_00002475 Full Term Match [] {} bakery food product:FOODON_00001626|plant food product:FOODON_00001015|prepared food product:FOODON_00001180
+small_simple5 Apple cider apple cider Not annotated apple food product:FOODON_00001611 Component Match ['Inflection (Plural) Treatment: apple', 'Inflection (Plural) Treatment: cider', 'Suffix Addition'] {} plant food product:FOODON_00001015
+small_simple6 Chicken egg chicken egg Not annotated chicken:FOODON_03411457|egg food product:FOODON_00001274 Component Match ['Inflection (Plural) Treatment: chicken', 'Inflection (Plural) Treatment: egg', 'Suffix Addition'] {} poultry food product:FOODON_00001283|vertebrate animal food product:FOODON_00001092 Aves:NCBITaxon_8782
\ No newline at end of file
diff -r 000000000000 -r 91438d32ed58 t.csv
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/t.csv Wed Sep 14 10:39:29 2022 -0400
@@ -0,0 +1,8 @@
+SampleId,Sample
+small_simple1,Chicken Breast
+small_simple2,Baked Potato
+small_simple3,Canned Corn
+small_simple4,Frozen Yogurt
+small_simple5,Apple Pie
+small_simple5,Apple cider
+small_simple6,Chicken egg
\ No newline at end of file
diff -r 000000000000 -r 91438d32ed58 t.json
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/t.json Wed Sep 14 10:39:29 2022 -0400
@@ -0,0 +1,12 @@
+{
+ "ncbi_taxon":{
+ "Actinopterygii":"NCBITaxon_7898",
+ "Ecdysozoa":"NCBITaxon_1206794",
+ "Echinodermata":"NCBITaxon_7586",
+ "Fungi":"NCBITaxon_4751",
+ "Mammalia":"NCBITaxon_40674",
+ "Sauropsida":"NCBITaxon_8457",
+ "Spiralia":"NCBITaxon_2697495",
+ "Viridiplantae":"NCBITaxon_33090"
+ }
+}
\ No newline at end of file