bioproject_to_srr_2: charset_normalizer-3.3.2.dist-info/METADATA annotate

annotate charset_normalizer-3.3.2.dist-info/METADATA @ 16:dc2c003078e9 tip

planemo upload for repository https://toolrepo.galaxytrakr.org/view/jpayne/bioproject_to_srr_2/556cac4fb538

author	jpayne
date	Tue, 21 May 2024 01:09:25 -0400
parents	5eb2d5e3bf22
children

rev	line source
jpayne@7	1 Metadata-Version: 2.1
jpayne@7	2 Name: charset-normalizer
jpayne@7	3 Version: 3.3.2
jpayne@7	4 Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.
jpayne@7	5 Home-page: https://github.com/Ousret/charset_normalizer
jpayne@7	6 Author: Ahmed TAHRI
jpayne@7	7 Author-email: ahmed.tahri@cloudnursery.dev
jpayne@7	8 License: MIT
jpayne@7	9 Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues
jpayne@7	10 Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest
jpayne@7	11 Keywords: encoding,charset,charset-detector,detector,normalization,unicode,chardet,detect
jpayne@7	12 Classifier: Development Status :: 5 - Production/Stable
jpayne@7	13 Classifier: License :: OSI Approved :: MIT License
jpayne@7	14 Classifier: Intended Audience :: Developers
jpayne@7	15 Classifier: Topic :: Software Development :: Libraries :: Python Modules
jpayne@7	16 Classifier: Operating System :: OS Independent
jpayne@7	17 Classifier: Programming Language :: Python
jpayne@7	18 Classifier: Programming Language :: Python :: 3
jpayne@7	19 Classifier: Programming Language :: Python :: 3.7
jpayne@7	20 Classifier: Programming Language :: Python :: 3.8
jpayne@7	21 Classifier: Programming Language :: Python :: 3.9
jpayne@7	22 Classifier: Programming Language :: Python :: 3.10
jpayne@7	23 Classifier: Programming Language :: Python :: 3.11
jpayne@7	24 Classifier: Programming Language :: Python :: 3.12
jpayne@7	25 Classifier: Programming Language :: Python :: Implementation :: PyPy
jpayne@7	26 Classifier: Topic :: Text Processing :: Linguistic
jpayne@7	27 Classifier: Topic :: Utilities
jpayne@7	28 Classifier: Typing :: Typed
jpayne@7	29 Requires-Python: >=3.7.0
jpayne@7	30 Description-Content-Type: text/markdown
jpayne@7	31 License-File: LICENSE
jpayne@7	32 Provides-Extra: unicode_backport
jpayne@7	33
jpayne@7	34 <h1 align="center">Charset Detection, for Everyone 👋</h1>
jpayne@7	35
jpayne@7	36 <p align="center">
jpayne@7	37 <sup>The Real First Universal Charset Detector</sup><br>
jpayne@7	38 <a href="https://pypi.org/project/charset-normalizer">
jpayne@7	39 <img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" />
jpayne@7	40 </a>
jpayne@7	41 <a href="https://pepy.tech/project/charset-normalizer/">
jpayne@7	42 <img alt="Download Count Total" src="https://static.pepy.tech/badge/charset-normalizer/month" />
jpayne@7	43 </a>
jpayne@7	44 <a href="https://bestpractices.coreinfrastructure.org/projects/7297">
jpayne@7	45 <img src="https://bestpractices.coreinfrastructure.org/projects/7297/badge">
jpayne@7	46 </a>
jpayne@7	47 </p>
jpayne@7	48 <p align="center">
jpayne@7	49 <sup><i>Featured Packages</i></sup><br>
jpayne@7	50 <a href="https://github.com/jawah/niquests">
jpayne@7	51 <img alt="Static Badge" src="https://img.shields.io/badge/Niquests-HTTP_1.1%2C%202%2C_and_3_Client-cyan">
jpayne@7	52 </a>
jpayne@7	53 <a href="https://github.com/jawah/wassima">
jpayne@7	54 <img alt="Static Badge" src="https://img.shields.io/badge/Wassima-Certifi_Killer-cyan">
jpayne@7	55 </a>
jpayne@7	56 </p>
jpayne@7	57 <p align="center">
jpayne@7	58 <sup><i>In other language (unofficial port - by the community)</i></sup><br>
jpayne@7	59 <a href="https://github.com/nickspring/charset-normalizer-rs">
jpayne@7	60 <img alt="Static Badge" src="https://img.shields.io/badge/Rust-red">
jpayne@7	61 </a>
jpayne@7	62 </p>
jpayne@7	63
jpayne@7	64 > A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`,
jpayne@7	65 > I'm trying to resolve the issue by taking a new approach.
jpayne@7	66 > All IANA character set names for which the Python core library provides codecs are supported.
jpayne@7	67
jpayne@7	68 <p align="center">
jpayne@7	69 >>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">👉 Try Me Online Now, Then Adopt Me 👈 </a> <<<<<
jpayne@7	70 </p>
jpayne@7	71
jpayne@7	72 This project offers you an alternative to Universal Charset Encoding Detector, also known as Chardet.
jpayne@7	73
jpayne@7	74 \| Feature \| [Chardet](https://github.com/chardet/chardet) \| Charset Normalizer \| [cChardet](https://github.com/PyYoshi/cChardet) \|
jpayne@7	75 \|--------------------------------------------------\|:---------------------------------------------:\|:--------------------------------------------------------------------------------------------------:\|:-----------------------------------------------:\|
jpayne@7	76 \| `Fast` \| ❌ \| ✅ \| ✅ \|
jpayne@7	77 \| `Universal**` \| ❌ \| ✅ \| ❌ \|
jpayne@7	78 \| `Reliable` without distinguishable standards \| ❌ \| ✅ \| ✅ \|
jpayne@7	79 \| `Reliable` with distinguishable standards \| ✅ \| ✅ \| ✅ \|
jpayne@7	80 \| `License` \| LGPL-2.1<br>_restrictive_ \| MIT \| MPL-1.1<br>_restrictive_ \|
jpayne@7	81 \| `Native Python` \| ✅ \| ✅ \| ❌ \|
jpayne@7	82 \| `Detect spoken language` \| ❌ \| ✅ \| N/A \|
jpayne@7	83 \| `UnicodeDecodeError Safety` \| ❌ \| ✅ \| ❌ \|
jpayne@7	84 \| `Whl Size (min)` \| 193.6 kB \| 42 kB \| ~200 kB \|
jpayne@7	85 \| `Supported Encoding` \| 33 \| 🎉 [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) \| 40 \|
jpayne@7	86
jpayne@7	87 <p align="center">
jpayne@7	88 <img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/>
jpayne@7	89 </p>
jpayne@7	90
jpayne@7	91 \\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br>
jpayne@7	92 Did you got there because of the logs? See [https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html](https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html)
jpayne@7	93
jpayne@7	94 ## ⚡ Performance
jpayne@7	95
jpayne@7	96 This package offer better performance than its counterpart Chardet. Here are some numbers.
jpayne@7	97
jpayne@7	98 \| Package \| Accuracy \| Mean per file (ms) \| File per sec (est) \|
jpayne@7	99 \|-----------------------------------------------\|:--------:\|:------------------:\|:------------------:\|
jpayne@7	100 \| [chardet](https://github.com/chardet/chardet) \| 86 % \| 200 ms \| 5 file/sec \|
jpayne@7	101 \| charset-normalizer \| 98 % \| 10 ms \| 100 file/sec \|
jpayne@7	102
jpayne@7	103 \| Package \| 99th percentile \| 95th percentile \| 50th percentile \|
jpayne@7	104 \|-----------------------------------------------\|:---------------:\|:---------------:\|:---------------:\|
jpayne@7	105 \| [chardet](https://github.com/chardet/chardet) \| 1200 ms \| 287 ms \| 23 ms \|
jpayne@7	106 \| charset-normalizer \| 100 ms \| 50 ms \| 5 ms \|
jpayne@7	107
jpayne@7	108 Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload.
jpayne@7	109
jpayne@7	110 > Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows.
jpayne@7	111 > And yes, these results might change at any time. The dataset can be updated to include more files.
jpayne@7	112 > The actual delays heavily depends on your CPU capabilities. The factors should remain the same.
jpayne@7	113 > Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability
jpayne@7	114 > (eg. Supported Encoding) Challenge-them if you want.
jpayne@7	115
jpayne@7	116 ## ✨ Installation
jpayne@7	117
jpayne@7	118 Using pip:
jpayne@7	119
jpayne@7	120 ```sh
jpayne@7	121 pip install charset-normalizer -U
jpayne@7	122 ```
jpayne@7	123
jpayne@7	124 ## 🚀 Basic Usage
jpayne@7	125
jpayne@7	126 ### CLI
jpayne@7	127 This package comes with a CLI.
jpayne@7	128
jpayne@7	129 ```
jpayne@7	130 usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD]
jpayne@7	131 file [file ...]
jpayne@7	132
jpayne@7	133 The Real First Universal Charset Detector. Discover originating encoding used
jpayne@7	134 on text file. Normalize text to unicode.
jpayne@7	135
jpayne@7	136 positional arguments:
jpayne@7	137 files File(s) to be analysed
jpayne@7	138
jpayne@7	139 optional arguments:
jpayne@7	140 -h, --help show this help message and exit
jpayne@7	141 -v, --verbose Display complementary information about file if any.
jpayne@7	142 Stdout will contain logs about the detection process.
jpayne@7	143 -a, --with-alternative
jpayne@7	144 Output complementary possibilities if any. Top-level
jpayne@7	145 JSON WILL be a list.
jpayne@7	146 -n, --normalize Permit to normalize input file. If not set, program
jpayne@7	147 does not write anything.
jpayne@7	148 -m, --minimal Only output the charset detected to STDOUT. Disabling
jpayne@7	149 JSON output.
jpayne@7	150 -r, --replace Replace file when trying to normalize it instead of
jpayne@7	151 creating a new one.
jpayne@7	152 -f, --force Replace file without asking if you are sure, use this
jpayne@7	153 flag with caution.
jpayne@7	154 -t THRESHOLD, --threshold THRESHOLD
jpayne@7	155 Define a custom maximum amount of chaos allowed in
jpayne@7	156 decoded content. 0. <= chaos <= 1.
jpayne@7	157 --version Show version information and exit.
jpayne@7	158 ```
jpayne@7	159
jpayne@7	160 ```bash
jpayne@7	161 normalizer ./data/sample.1.fr.srt
jpayne@7	162 ```
jpayne@7	163
jpayne@7	164 or
jpayne@7	165
jpayne@7	166 ```bash
jpayne@7	167 python -m charset_normalizer ./data/sample.1.fr.srt
jpayne@7	168 ```
jpayne@7	169
jpayne@7	170 🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format.
jpayne@7	171
jpayne@7	172 ```json
jpayne@7	173 {
jpayne@7	174 "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt",
jpayne@7	175 "encoding": "cp1252",
jpayne@7	176 "encoding_aliases": [
jpayne@7	177 "1252",
jpayne@7	178 "windows_1252"
jpayne@7	179 ],
jpayne@7	180 "alternative_encodings": [
jpayne@7	181 "cp1254",
jpayne@7	182 "cp1256",
jpayne@7	183 "cp1258",
jpayne@7	184 "iso8859_14",
jpayne@7	185 "iso8859_15",
jpayne@7	186 "iso8859_16",
jpayne@7	187 "iso8859_3",
jpayne@7	188 "iso8859_9",
jpayne@7	189 "latin_1",
jpayne@7	190 "mbcs"
jpayne@7	191 ],
jpayne@7	192 "language": "French",
jpayne@7	193 "alphabets": [
jpayne@7	194 "Basic Latin",
jpayne@7	195 "Latin-1 Supplement"
jpayne@7	196 ],
jpayne@7	197 "has_sig_or_bom": false,
jpayne@7	198 "chaos": 0.149,
jpayne@7	199 "coherence": 97.152,
jpayne@7	200 "unicode_path": null,
jpayne@7	201 "is_preferred": true
jpayne@7	202 }
jpayne@7	203 ```
jpayne@7	204
jpayne@7	205 ### Python
jpayne@7	206 Just print out normalized text
jpayne@7	207 ```python
jpayne@7	208 from charset_normalizer import from_path
jpayne@7	209
jpayne@7	210 results = from_path('./my_subtitle.srt')
jpayne@7	211
jpayne@7	212 print(str(results.best()))
jpayne@7	213 ```
jpayne@7	214
jpayne@7	215 Upgrade your code without effort
jpayne@7	216 ```python
jpayne@7	217 from charset_normalizer import detect
jpayne@7	218 ```
jpayne@7	219
jpayne@7	220 The above code will behave the same as chardet. We ensure that we offer the best (reasonable) BC result possible.
jpayne@7	221
jpayne@7	222 See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/)
jpayne@7	223
jpayne@7	224 ## 😇 Why
jpayne@7	225
jpayne@7	226 When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a
jpayne@7	227 reliable alternative using a completely different method. Also! I never back down on a good challenge!
jpayne@7	228
jpayne@7	229 I don't care about the originating charset encoding, because two different tables can
jpayne@7	230 produce two identical rendered string.
jpayne@7	231 What I want is to get readable text, the best I can.
jpayne@7	232
jpayne@7	233 In a way, I'm brute forcing text decoding. How cool is that ? 😎
jpayne@7	234
jpayne@7	235 Don't confuse package ftfy with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.
jpayne@7	236
jpayne@7	237 ## 🍰 How
jpayne@7	238
jpayne@7	239 - Discard all charset encoding table that could not fit the binary content.
jpayne@7	240 - Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding.
jpayne@7	241 - Extract matches with the lowest mess detected.
jpayne@7	242 - Additionally, we measure coherence / probe for a language.
jpayne@7	243
jpayne@7	244 Wait a minute, what is noise/mess and coherence according to YOU ?
jpayne@7	245
jpayne@7	246 Noise : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then
jpayne@7	247 I established some ground rules about what is obvious when it seems like a mess.
jpayne@7	248 I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to
jpayne@7	249 improve or rewrite it.
jpayne@7	250
jpayne@7	251 Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought
jpayne@7	252 that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.
jpayne@7	253
jpayne@7	254 ## ⚡ Known limitations
jpayne@7	255
jpayne@7	256 - Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters))
jpayne@7	257 - Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content.
jpayne@7	258
jpayne@7	259 ## ⚠️ About Python EOLs
jpayne@7	260
jpayne@7	261 If you are running:
jpayne@7	262
jpayne@7	263 - Python >=2.7,<3.5: Unsupported
jpayne@7	264 - Python 3.5: charset-normalizer < 2.1
jpayne@7	265 - Python 3.6: charset-normalizer < 3.1
jpayne@7	266 - Python 3.7: charset-normalizer < 4.0
jpayne@7	267
jpayne@7	268 Upgrade your Python interpreter as soon as possible.
jpayne@7	269
jpayne@7	270 ## 👤 Contributing
jpayne@7	271
jpayne@7	272 Contributions, issues and feature requests are very much welcome.<br />
jpayne@7	273 Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute.
jpayne@7	274
jpayne@7	275 ## 📝 License
jpayne@7	276
jpayne@7	277 Copyright © [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br />
jpayne@7	278 This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed.
jpayne@7	279
jpayne@7	280 Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/)
jpayne@7	281
jpayne@7	282 ## 💼 For Enterprise
jpayne@7	283
jpayne@7	284 Professional support for charset-normalizer is available as part of the [Tidelift
jpayne@7	285 Subscription][1]. Tidelift gives software development teams a single source for
jpayne@7	286 purchasing and maintaining their software, with professional grade assurances
jpayne@7	287 from the experts who know it best, while seamlessly integrating with existing
jpayne@7	288 tools.
jpayne@7	289
jpayne@7	290 [1]: https://tidelift.com/subscription/pkg/pypi-charset-normalizer?utm_source=pypi-charset-normalizer&utm_medium=readme
jpayne@7	291
jpayne@7	292 # Changelog
jpayne@7	293 All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
jpayne@7	294 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
jpayne@7	295
jpayne@7	296 ## [3.3.2](https://github.com/Ousret/charset_normalizer/compare/3.3.1...3.3.2) (2023-10-31)
jpayne@7	297
jpayne@7	298 ### Fixed
jpayne@7	299 - Unintentional memory usage regression when using large payload that match several encoding (#376)
jpayne@7	300 - Regression on some detection case showcased in the documentation (#371)
jpayne@7	301
jpayne@7	302 ### Added
jpayne@7	303 - Noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife)
jpayne@7	304
jpayne@7	305 ## [3.3.1](https://github.com/Ousret/charset_normalizer/compare/3.3.0...3.3.1) (2023-10-22)
jpayne@7	306
jpayne@7	307 ### Changed
jpayne@7	308 - Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8
jpayne@7	309 - Improved the general detection reliability based on reports from the community
jpayne@7	310
jpayne@7	311 ## [3.3.0](https://github.com/Ousret/charset_normalizer/compare/3.2.0...3.3.0) (2023-09-30)
jpayne@7	312
jpayne@7	313 ### Added
jpayne@7	314 - Allow to execute the CLI (e.g. normalizer) through `python -m charset_normalizer.cli` or `python -m charset_normalizer`
jpayne@7	315 - Support for 9 forgotten encoding that are supported by Python but unlisted in `encoding.aliases` as they have no alias (#323)
jpayne@7	316
jpayne@7	317 ### Removed
jpayne@7	318 - (internal) Redundant utils.is_ascii function and unused function is_private_use_only
jpayne@7	319 - (internal) charset_normalizer.assets is moved inside charset_normalizer.constant
jpayne@7	320
jpayne@7	321 ### Changed
jpayne@7	322 - (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection
jpayne@7	323 - Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8
jpayne@7	324
jpayne@7	325 ### Fixed
jpayne@7	326 - Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in \_\_lt\_\_ (#350)
jpayne@7	327
jpayne@7	328 ## [3.2.0](https://github.com/Ousret/charset_normalizer/compare/3.1.0...3.2.0) (2023-06-07)
jpayne@7	329
jpayne@7	330 ### Changed
jpayne@7	331 - Typehint for function `from_path` no longer enforce `PathLike` as its first argument
jpayne@7	332 - Minor improvement over the global detection reliability
jpayne@7	333
jpayne@7	334 ### Added
jpayne@7	335 - Introduce function `is_binary` that relies on main capabilities, and optimized to detect binaries
jpayne@7	336 - Propagate `enable_fallback` argument throughout `from_bytes`, `from_path`, and `from_fp` that allow a deeper control over the detection (default True)
jpayne@7	337 - Explicit support for Python 3.12
jpayne@7	338
jpayne@7	339 ### Fixed
jpayne@7	340 - Edge case detection failure where a file would contain 'very-long' camel cased word (Issue #289)
jpayne@7	341
jpayne@7	342 ## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06)
jpayne@7	343
jpayne@7	344 ### Added
jpayne@7	345 - Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262)
jpayne@7	346
jpayne@7	347 ### Removed
jpayne@7	348 - Support for Python 3.6 (PR #260)
jpayne@7	349
jpayne@7	350 ### Changed
jpayne@7	351 - Optional speedup provided by mypy/c 1.0.1
jpayne@7	352
jpayne@7	353 ## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18)
jpayne@7	354
jpayne@7	355 ### Fixed
jpayne@7	356 - Multi-bytes cutter/chunk generator did not always cut correctly (PR #233)
jpayne@7	357
jpayne@7	358 ### Changed
jpayne@7	359 - Speedup provided by mypy/c 0.990 on Python >= 3.7
jpayne@7	360
jpayne@7	361 ## [3.0.0](https://github.com/Ousret/charset_normalizer/compare/2.1.1...3.0.0) (2022-10-20)
jpayne@7	362
jpayne@7	363 ### Added
jpayne@7	364 - Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
jpayne@7	365 - Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
jpayne@7	366 - Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
jpayne@7	367 - `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
jpayne@7	368
jpayne@7	369 ### Changed
jpayne@7	370 - Build with static metadata using 'build' frontend
jpayne@7	371 - Make the language detection stricter
jpayne@7	372 - Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
jpayne@7	373
jpayne@7	374 ### Fixed
jpayne@7	375 - CLI with opt --normalize fail when using full path for files
jpayne@7	376 - TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
jpayne@7	377 - Sphinx warnings when generating the documentation
jpayne@7	378
jpayne@7	379 ### Removed
jpayne@7	380 - Coherence detector no longer return 'Simple English' instead return 'English'
jpayne@7	381 - Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
jpayne@7	382 - Breaking: Method `first()` and `best()` from CharsetMatch
jpayne@7	383 - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
jpayne@7	384 - Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
jpayne@7	385 - Breaking: Top-level function `normalize`
jpayne@7	386 - Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
jpayne@7	387 - Support for the backport `unicodedata2`
jpayne@7	388
jpayne@7	389 ## [3.0.0rc1](https://github.com/Ousret/charset_normalizer/compare/3.0.0b2...3.0.0rc1) (2022-10-18)
jpayne@7	390
jpayne@7	391 ### Added
jpayne@7	392 - Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
jpayne@7	393 - Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
jpayne@7	394 - Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio
jpayne@7	395
jpayne@7	396 ### Changed
jpayne@7	397 - Build with static metadata using 'build' frontend
jpayne@7	398 - Make the language detection stricter
jpayne@7	399
jpayne@7	400 ### Fixed
jpayne@7	401 - CLI with opt --normalize fail when using full path for files
jpayne@7	402 - TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it
jpayne@7	403
jpayne@7	404 ### Removed
jpayne@7	405 - Coherence detector no longer return 'Simple English' instead return 'English'
jpayne@7	406 - Coherence detector no longer return 'Classical Chinese' instead return 'Chinese'
jpayne@7	407
jpayne@7	408 ## [3.0.0b2](https://github.com/Ousret/charset_normalizer/compare/3.0.0b1...3.0.0b2) (2022-08-21)
jpayne@7	409
jpayne@7	410 ### Added
jpayne@7	411 - `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl)
jpayne@7	412
jpayne@7	413 ### Removed
jpayne@7	414 - Breaking: Method `first()` and `best()` from CharsetMatch
jpayne@7	415 - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
jpayne@7	416
jpayne@7	417 ### Fixed
jpayne@7	418 - Sphinx warnings when generating the documentation
jpayne@7	419
jpayne@7	420 ## [3.0.0b1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...3.0.0b1) (2022-08-15)
jpayne@7	421
jpayne@7	422 ### Changed
jpayne@7	423 - Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
jpayne@7	424
jpayne@7	425 ### Removed
jpayne@7	426 - Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
jpayne@7	427 - Breaking: Top-level function `normalize`
jpayne@7	428 - Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch
jpayne@7	429 - Support for the backport `unicodedata2`
jpayne@7	430
jpayne@7	431 ## [2.1.1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...2.1.1) (2022-08-19)
jpayne@7	432
jpayne@7	433 ### Deprecated
jpayne@7	434 - Function `normalize` scheduled for removal in 3.0
jpayne@7	435
jpayne@7	436 ### Changed
jpayne@7	437 - Removed useless call to decode in fn is_unprintable (#206)
jpayne@7	438
jpayne@7	439 ### Fixed
jpayne@7	440 - Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) (#204)
jpayne@7	441
jpayne@7	442 ## [2.1.0](https://github.com/Ousret/charset_normalizer/compare/2.0.12...2.1.0) (2022-06-19)
jpayne@7	443
jpayne@7	444 ### Added
jpayne@7	445 - Output the Unicode table version when running the CLI with `--version` (PR #194)
jpayne@7	446
jpayne@7	447 ### Changed
jpayne@7	448 - Re-use decoded buffer for single byte character sets from [@nijel](https://github.com/nijel) (PR #175)
jpayne@7	449 - Fixing some performance bottlenecks from [@deedy5](https://github.com/deedy5) (PR #183)
jpayne@7	450
jpayne@7	451 ### Fixed
jpayne@7	452 - Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)
jpayne@7	453 - CLI default threshold aligned with the API threshold from [@oleksandr-kuzmenko](https://github.com/oleksandr-kuzmenko) (PR #181)
jpayne@7	454
jpayne@7	455 ### Removed
jpayne@7	456 - Support for Python 3.5 (PR #192)
jpayne@7	457
jpayne@7	458 ### Deprecated
jpayne@7	459 - Use of backport unicodedata from `unicodedata2` as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)
jpayne@7	460
jpayne@7	461 ## [2.0.12](https://github.com/Ousret/charset_normalizer/compare/2.0.11...2.0.12) (2022-02-12)
jpayne@7	462
jpayne@7	463 ### Fixed
jpayne@7	464 - ASCII miss-detection on rare cases (PR #170)
jpayne@7	465
jpayne@7	466 ## [2.0.11](https://github.com/Ousret/charset_normalizer/compare/2.0.10...2.0.11) (2022-01-30)
jpayne@7	467
jpayne@7	468 ### Added
jpayne@7	469 - Explicit support for Python 3.11 (PR #164)
jpayne@7	470
jpayne@7	471 ### Changed
jpayne@7	472 - The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165)
jpayne@7	473
jpayne@7	474 ## [2.0.10](https://github.com/Ousret/charset_normalizer/compare/2.0.9...2.0.10) (2022-01-04)
jpayne@7	475
jpayne@7	476 ### Fixed
jpayne@7	477 - Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154)
jpayne@7	478
jpayne@7	479 ### Changed
jpayne@7	480 - Skipping the language-detection (CD) on ASCII (PR #155)
jpayne@7	481
jpayne@7	482 ## [2.0.9](https://github.com/Ousret/charset_normalizer/compare/2.0.8...2.0.9) (2021-12-03)
jpayne@7	483
jpayne@7	484 ### Changed
jpayne@7	485 - Moderating the logging impact (since 2.0.8) for specific environments (PR #147)
jpayne@7	486
jpayne@7	487 ### Fixed
jpayne@7	488 - Wrong logging level applied when setting kwarg `explain` to True (PR #146)
jpayne@7	489
jpayne@7	490 ## [2.0.8](https://github.com/Ousret/charset_normalizer/compare/2.0.7...2.0.8) (2021-11-24)
jpayne@7	491 ### Changed
jpayne@7	492 - Improvement over Vietnamese detection (PR #126)
jpayne@7	493 - MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)
jpayne@7	494 - Efficiency improvements in cd/alphabet_languages from [@adbar](https://github.com/adbar) (PR #122)
jpayne@7	495 - call sum() without an intermediary list following PEP 289 recommendations from [@adbar](https://github.com/adbar) (PR #129)
jpayne@7	496 - Code style as refactored by Sourcery-AI (PR #131)
jpayne@7	497 - Minor adjustment on the MD around european words (PR #133)
jpayne@7	498 - Remove and replace SRTs from assets / tests (PR #139)
jpayne@7	499 - Initialize the library logger with a `NullHandler` by default from [@nmaynes](https://github.com/nmaynes) (PR #135)
jpayne@7	500 - Setting kwarg `explain` to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)
jpayne@7	501
jpayne@7	502 ### Fixed
jpayne@7	503 - Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)
jpayne@7	504 - Avoid using too insignificant chunk (PR #137)
jpayne@7	505
jpayne@7	506 ### Added
jpayne@7	507 - Add and expose function `set_logging_handler` to configure a specific StreamHandler from [@nmaynes](https://github.com/nmaynes) (PR #135)
jpayne@7	508 - Add `CHANGELOG.md` entries, format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) (PR #141)
jpayne@7	509
jpayne@7	510 ## [2.0.7](https://github.com/Ousret/charset_normalizer/compare/2.0.6...2.0.7) (2021-10-11)
jpayne@7	511 ### Added
jpayne@7	512 - Add support for Kazakh (Cyrillic) language detection (PR #109)
jpayne@7	513
jpayne@7	514 ### Changed
jpayne@7	515 - Further, improve inferring the language from a given single-byte code page (PR #112)
jpayne@7	516 - Vainly trying to leverage PEP263 when PEP3120 is not supported (PR #116)
jpayne@7	517 - Refactoring for potential performance improvements in loops from [@adbar](https://github.com/adbar) (PR #113)
jpayne@7	518 - Various detection improvement (MD+CD) (PR #117)
jpayne@7	519
jpayne@7	520 ### Removed
jpayne@7	521 - Remove redundant logging entry about detected language(s) (PR #115)
jpayne@7	522
jpayne@7	523 ### Fixed
jpayne@7	524 - Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR #117 #102)
jpayne@7	525
jpayne@7	526 ## [2.0.6](https://github.com/Ousret/charset_normalizer/compare/2.0.5...2.0.6) (2021-09-18)
jpayne@7	527 ### Fixed
jpayne@7	528 - Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR #100)
jpayne@7	529 - Fix CLI crash when using --minimal output in certain cases (PR #103)
jpayne@7	530
jpayne@7	531 ### Changed
jpayne@7	532 - Minor improvement to the detection efficiency (less than 1%) (PR #106 #101)
jpayne@7	533
jpayne@7	534 ## [2.0.5](https://github.com/Ousret/charset_normalizer/compare/2.0.4...2.0.5) (2021-09-14)
jpayne@7	535 ### Changed
jpayne@7	536 - The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR #81)
jpayne@7	537 - The BC-support with v1.x was improved, the old staticmethods are restored (PR #82)
jpayne@7	538 - The Unicode detection is slightly improved (PR #93)
jpayne@7	539 - Add syntax sugar \_\_bool\_\_ for results CharsetMatches list-container (PR #91)
jpayne@7	540
jpayne@7	541 ### Removed
jpayne@7	542 - The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR #92)
jpayne@7	543
jpayne@7	544 ### Fixed
jpayne@7	545 - In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR #95)
jpayne@7	546 - Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR #96)
jpayne@7	547 - The MANIFEST.in was not exhaustive (PR #78)
jpayne@7	548
jpayne@7	549 ## [2.0.4](https://github.com/Ousret/charset_normalizer/compare/2.0.3...2.0.4) (2021-07-30)
jpayne@7	550 ### Fixed
jpayne@7	551 - The CLI no longer raise an unexpected exception when no encoding has been found (PR #70)
jpayne@7	552 - Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR #68)
jpayne@7	553 - The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR #72)
jpayne@7	554 - Submatch factoring could be wrong in rare edge cases (PR #72)
jpayne@7	555 - Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR #72)
jpayne@7	556 - Fix line endings from CRLF to LF for certain project files (PR #67)
jpayne@7	557
jpayne@7	558 ### Changed
jpayne@7	559 - Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR #69 #76)
jpayne@7	560 - Allow fallback on specified encoding if any (PR #71)
jpayne@7	561
jpayne@7	562 ## [2.0.3](https://github.com/Ousret/charset_normalizer/compare/2.0.2...2.0.3) (2021-07-16)
jpayne@7	563 ### Changed
jpayne@7	564 - Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR #63)
jpayne@7	565 - According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR #64)
jpayne@7	566
jpayne@7	567 ## [2.0.2](https://github.com/Ousret/charset_normalizer/compare/2.0.1...2.0.2) (2021-07-15)
jpayne@7	568 ### Fixed
jpayne@7	569 - Empty/Too small JSON payload miss-detection fixed. Report from [@tseaver](https://github.com/tseaver) (PR #59)
jpayne@7	570
jpayne@7	571 ### Changed
jpayne@7	572 - Don't inject unicodedata2 into sys.modules from [@akx](https://github.com/akx) (PR #57)
jpayne@7	573
jpayne@7	574 ## [2.0.1](https://github.com/Ousret/charset_normalizer/compare/2.0.0...2.0.1) (2021-07-13)
jpayne@7	575 ### Fixed
jpayne@7	576 - Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from [@sethmlarson](https://github.com/sethmlarson). (PR #55)
jpayne@7	577 - Using explain=False permanently disable the verbose output in the current runtime (PR #47)
jpayne@7	578 - One log entry (language target preemptive) was not show in logs when using explain=True (PR #47)
jpayne@7	579 - Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR #52)
jpayne@7	580
jpayne@7	581 ### Changed
jpayne@7	582 - Public function normalize default args values were not aligned with from_bytes (PR #53)
jpayne@7	583
jpayne@7	584 ### Added
jpayne@7	585 - You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR #47)
jpayne@7	586
jpayne@7	587 ## [2.0.0](https://github.com/Ousret/charset_normalizer/compare/1.4.1...2.0.0) (2021-07-02)
jpayne@7	588 ### Changed
jpayne@7	589 - 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet.
jpayne@7	590 - Accent has been made on UTF-8 detection, should perform rather instantaneous.
jpayne@7	591 - The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible.
jpayne@7	592 - The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time)
jpayne@7	593 - The program has been rewritten to ease the readability and maintainability. (+Using static typing)+
jpayne@7	594 - utf_7 detection has been reinstated.
jpayne@7	595
jpayne@7	596 ### Removed
jpayne@7	597 - This package no longer require anything when used with Python 3.5 (Dropped cached_property)
jpayne@7	598 - Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian.
jpayne@7	599 - The exception hook on UnicodeDecodeError has been removed.
jpayne@7	600
jpayne@7	601 ### Deprecated
jpayne@7	602 - Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0
jpayne@7	603
jpayne@7	604 ### Fixed
jpayne@7	605 - The CLI output used the relative path of the file(s). Should be absolute.
jpayne@7	606
jpayne@7	607 ## [1.4.1](https://github.com/Ousret/charset_normalizer/compare/1.4.0...1.4.1) (2021-05-28)
jpayne@7	608 ### Fixed
jpayne@7	609 - Logger configuration/usage no longer conflict with others (PR #44)
jpayne@7	610
jpayne@7	611 ## [1.4.0](https://github.com/Ousret/charset_normalizer/compare/1.3.9...1.4.0) (2021-05-21)
jpayne@7	612 ### Removed
jpayne@7	613 - Using standard logging instead of using the package loguru.
jpayne@7	614 - Dropping nose test framework in favor of the maintained pytest.
jpayne@7	615 - Choose to not use dragonmapper package to help with gibberish Chinese/CJK text.
jpayne@7	616 - Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version.
jpayne@7	617 - Stop support for UTF-7 that does not contain a SIG.
jpayne@7	618 - Dropping PrettyTable, replaced with pure JSON output in CLI.
jpayne@7	619
jpayne@7	620 ### Fixed
jpayne@7	621 - BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process.
jpayne@7	622 - Not searching properly for the BOM when trying utf32/16 parent codec.
jpayne@7	623
jpayne@7	624 ### Changed
jpayne@7	625 - Improving the package final size by compressing frequencies.json.
jpayne@7	626 - Huge improvement over the larges payload.
jpayne@7	627
jpayne@7	628 ### Added
jpayne@7	629 - CLI now produces JSON consumable output.
jpayne@7	630 - Return ASCII if given sequences fit. Given reasonable confidence.
jpayne@7	631
jpayne@7	632 ## [1.3.9](https://github.com/Ousret/charset_normalizer/compare/1.3.8...1.3.9) (2021-05-13)
jpayne@7	633
jpayne@7	634 ### Fixed
jpayne@7	635 - In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR #40)
jpayne@7	636
jpayne@7	637 ## [1.3.8](https://github.com/Ousret/charset_normalizer/compare/1.3.7...1.3.8) (2021-05-12)
jpayne@7	638
jpayne@7	639 ### Fixed
jpayne@7	640 - Empty given payload for detection may cause an exception if trying to access the `alphabets` property. (PR #39)
jpayne@7	641
jpayne@7	642 ## [1.3.7](https://github.com/Ousret/charset_normalizer/compare/1.3.6...1.3.7) (2021-05-12)
jpayne@7	643
jpayne@7	644 ### Fixed
jpayne@7	645 - The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR #38)
jpayne@7	646
jpayne@7	647 ## [1.3.6](https://github.com/Ousret/charset_normalizer/compare/1.3.5...1.3.6) (2021-02-09)
jpayne@7	648
jpayne@7	649 ### Changed
jpayne@7	650 - Amend the previous release to allow prettytable 2.0 (PR #35)
jpayne@7	651
jpayne@7	652 ## [1.3.5](https://github.com/Ousret/charset_normalizer/compare/1.3.4...1.3.5) (2021-02-08)
jpayne@7	653
jpayne@7	654 ### Fixed
jpayne@7	655 - Fix error while using the package with a python pre-release interpreter (PR #33)
jpayne@7	656
jpayne@7	657 ### Changed
jpayne@7	658 - Dependencies refactoring, constraints revised.
jpayne@7	659
jpayne@7	660 ### Added
jpayne@7	661 - Add python 3.9 and 3.10 to the supported interpreters
jpayne@7	662
jpayne@7	663 MIT License
jpayne@7	664
jpayne@7	665 Copyright (c) 2019 TAHRI Ahmed R.
jpayne@7	666
jpayne@7	667 Permission is hereby granted, free of charge, to any person obtaining a copy
jpayne@7	668 of this software and associated documentation files (the "Software"), to deal
jpayne@7	669 in the Software without restriction, including without limitation the rights
jpayne@7	670 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
jpayne@7	671 copies of the Software, and to permit persons to whom the Software is
jpayne@7	672 furnished to do so, subject to the following conditions:
jpayne@7	673
jpayne@7	674 The above copyright notice and this permission notice shall be included in all
jpayne@7	675 copies or substantial portions of the Software.
jpayne@7	676
jpayne@7	677 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
jpayne@7	678 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
jpayne@7	679 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
jpayne@7	680 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
jpayne@7	681 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
jpayne@7	682 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
jpayne@7	683 SOFTWARE.

Mercurial > repos > jpayne > bioproject_to_srr_2

annotate charset_normalizer-3.3.2.dist-info/METADATA @ 16:dc2c003078e9 tip