# HG changeset patch # User jpayne # Date 1613766500 18000 # Node ID a90a883f88f976dd0df9dd394c0ed881391d4c91 # Parent 504004e783632293c3dc6c37b5b519982e8a2758 "planemo upload for repository https://toolrepo.galaxytrakr.org/" diff -r 504004e78363 -r a90a883f88f9 subsamplr.py --- a/subsamplr.py Fri Feb 19 14:57:58 2021 -0500 +++ b/subsamplr.py Fri Feb 19 15:28:20 2021 -0500 @@ -69,15 +69,20 @@ inns = [iter(grouper(inn, 4)) for inn in ins] # stateful 4-ply iterator over lines in the input outs = [stack.enter_context(openn(path, 'w')) for openn, path in zip(file_openers, outs)] # opened output files + for file in ins: + print(file.name) + # https://en.m.wikipedia.org/wiki/Reservoir_sampling reservoir = [] # this is going to be 1 or 2-tuples of 4-tuples representing the 4 lines of the fastq file # we determine its current coverage (and thus its reservoir size) to fill it, which consumes reads # from the open files - for readpair in zip(*inns): + reads = 0 + for i, readpair in enumerate(zip(*inns)): + reads += len(readpair[0][1]) reservoir.append(readpair) - if coverage(reservoir, gen_size) > cov: + if reads / gen_size > cov: break k = len(reservoir) # this is about how big the reservoir needs to be to get cov coverage