jpayne@7: Metadata-Version: 2.1 jpayne@7: Name: charset-normalizer jpayne@7: Version: 3.3.2 jpayne@7: Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet. jpayne@7: Home-page: https://github.com/Ousret/charset_normalizer jpayne@7: Author: Ahmed TAHRI jpayne@7: Author-email: ahmed.tahri@cloudnursery.dev jpayne@7: License: MIT jpayne@7: Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues jpayne@7: Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest jpayne@7: Keywords: encoding,charset,charset-detector,detector,normalization,unicode,chardet,detect jpayne@7: Classifier: Development Status :: 5 - Production/Stable jpayne@7: Classifier: License :: OSI Approved :: MIT License jpayne@7: Classifier: Intended Audience :: Developers jpayne@7: Classifier: Topic :: Software Development :: Libraries :: Python Modules jpayne@7: Classifier: Operating System :: OS Independent jpayne@7: Classifier: Programming Language :: Python jpayne@7: Classifier: Programming Language :: Python :: 3 jpayne@7: Classifier: Programming Language :: Python :: 3.7 jpayne@7: Classifier: Programming Language :: Python :: 3.8 jpayne@7: Classifier: Programming Language :: Python :: 3.9 jpayne@7: Classifier: Programming Language :: Python :: 3.10 jpayne@7: Classifier: Programming Language :: Python :: 3.11 jpayne@7: Classifier: Programming Language :: Python :: 3.12 jpayne@7: Classifier: Programming Language :: Python :: Implementation :: PyPy jpayne@7: Classifier: Topic :: Text Processing :: Linguistic jpayne@7: Classifier: Topic :: Utilities jpayne@7: Classifier: Typing :: Typed jpayne@7: Requires-Python: >=3.7.0 jpayne@7: Description-Content-Type: text/markdown jpayne@7: License-File: LICENSE jpayne@7: Provides-Extra: unicode_backport jpayne@7: jpayne@7:

Charset Detection, for Everyone 👋

jpayne@7: jpayne@7:

jpayne@7: ^{The Real First Universal Charset Detector}
jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7:

jpayne@7:

jpayne@7: ^{Featured Packages}
jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7: jpayne@7:

jpayne@7:

jpayne@7: ^{In other language (unofficial port - by the community)}
jpayne@7: jpayne@7: jpayne@7: jpayne@7:

jpayne@7: jpayne@7: > A library that helps you read text from an unknown charset encoding.
Motivated by `chardet`, jpayne@7: > I'm trying to resolve the issue by taking a new approach. jpayne@7: > All IANA character set names for which the Python core library provides codecs are supported. jpayne@7: jpayne@7:

jpayne@7: >>>>> 👉 Try Me Online Now, Then Adopt Me 👈 <<<<< jpayne@7:

jpayne@7: jpayne@7: This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**. jpayne@7: jpayne@7: | Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) | jpayne@7: |--------------------------------------------------|:---------------------------------------------:|:--------------------------------------------------------------------------------------------------:|:-----------------------------------------------:| jpayne@7: | `Fast` | ❌ | ✅ | ✅ | jpayne@7: | `Universal**` | ❌ | ✅ | ❌ | jpayne@7: | `Reliable` **without** distinguishable standards | ❌ | ✅ | ✅ | jpayne@7: | `Reliable` **with** distinguishable standards | ✅ | ✅ | ✅ | jpayne@7: | `License` | LGPL-2.1
_restrictive_ | MIT | MPL-1.1
_restrictive_ | jpayne@7: | `Native Python` | ✅ | ✅ | ❌ | jpayne@7: | `Detect spoken language` | ❌ | ✅ | N/A | jpayne@7: | `UnicodeDecodeError Safety` | ❌ | ✅ | ❌ | jpayne@7: | `Whl Size (min)` | 193.6 kB | 42 kB | ~200 kB | jpayne@7: | `Supported Encoding` | 33 | 🎉 [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 | jpayne@7: jpayne@7:

jpayne@7: Reading Normalized Text Cat Reading Text jpayne@7:

jpayne@7: jpayne@7: *\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*
jpayne@7: Did you got there because of the logs? See [https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html](https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html) jpayne@7: jpayne@7: ## ⚡ Performance jpayne@7: jpayne@7: This package offer better performance than its counterpart Chardet. Here are some numbers. jpayne@7: jpayne@7: | Package | Accuracy | Mean per file (ms) | File per sec (est) | jpayne@7: |-----------------------------------------------|:--------:|:------------------:|:------------------:| jpayne@7: | [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec | jpayne@7: | charset-normalizer | **98 %** | **10 ms** | 100 file/sec | jpayne@7: jpayne@7: | Package | 99th percentile | 95th percentile | 50th percentile | jpayne@7: |-----------------------------------------------|:---------------:|:---------------:|:---------------:| jpayne@7: | [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms | jpayne@7: | charset-normalizer | 100 ms | 50 ms | 5 ms | jpayne@7: jpayne@7: Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload. jpayne@7: jpayne@7: > Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows. jpayne@7: > And yes, these results might change at any time. The dataset can be updated to include more files. jpayne@7: > The actual delays heavily depends on your CPU capabilities. The factors should remain the same. jpayne@7: > Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability jpayne@7: > (eg. Supported Encoding) Challenge-them if you want. jpayne@7: jpayne@7: ## ✨ Installation jpayne@7: jpayne@7: Using pip: jpayne@7: jpayne@7: ```sh jpayne@7: pip install charset-normalizer -U jpayne@7: ``` jpayne@7: jpayne@7: ## 🚀 Basic Usage jpayne@7: jpayne@7: ### CLI jpayne@7: This package comes with a CLI. jpayne@7: jpayne@7: ``` jpayne@7: usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD] jpayne@7: file [file ...] jpayne@7: jpayne@7: The Real First Universal Charset Detector. Discover originating encoding used jpayne@7: on text file. Normalize text to unicode. jpayne@7: jpayne@7: positional arguments: jpayne@7: files File(s) to be analysed jpayne@7: jpayne@7: optional arguments: jpayne@7: -h, --help show this help message and exit jpayne@7: -v, --verbose Display complementary information about file if any. jpayne@7: Stdout will contain logs about the detection process. jpayne@7: -a, --with-alternative jpayne@7: Output complementary possibilities if any. Top-level jpayne@7: JSON WILL be a list. jpayne@7: -n, --normalize Permit to normalize input file. If not set, program jpayne@7: does not write anything. jpayne@7: -m, --minimal Only output the charset detected to STDOUT. Disabling jpayne@7: JSON output. jpayne@7: -r, --replace Replace file when trying to normalize it instead of jpayne@7: creating a new one. jpayne@7: -f, --force Replace file without asking if you are sure, use this jpayne@7: flag with caution. jpayne@7: -t THRESHOLD, --threshold THRESHOLD jpayne@7: Define a custom maximum amount of chaos allowed in jpayne@7: decoded content. 0. <= chaos <= 1. jpayne@7: --version Show version information and exit. jpayne@7: ``` jpayne@7: jpayne@7: ```bash jpayne@7: normalizer ./data/sample.1.fr.srt jpayne@7: ``` jpayne@7: jpayne@7: or jpayne@7: jpayne@7: ```bash jpayne@7: python -m charset_normalizer ./data/sample.1.fr.srt jpayne@7: ``` jpayne@7: jpayne@7: 🎉 Since version 1.4.0 the CLI produce easily usable stdout result in JSON format. jpayne@7: jpayne@7: ```json jpayne@7: { jpayne@7: "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt", jpayne@7: "encoding": "cp1252", jpayne@7: "encoding_aliases": [ jpayne@7: "1252", jpayne@7: "windows_1252" jpayne@7: ], jpayne@7: "alternative_encodings": [ jpayne@7: "cp1254", jpayne@7: "cp1256", jpayne@7: "cp1258", jpayne@7: "iso8859_14", jpayne@7: "iso8859_15", jpayne@7: "iso8859_16", jpayne@7: "iso8859_3", jpayne@7: "iso8859_9", jpayne@7: "latin_1", jpayne@7: "mbcs" jpayne@7: ], jpayne@7: "language": "French", jpayne@7: "alphabets": [ jpayne@7: "Basic Latin", jpayne@7: "Latin-1 Supplement" jpayne@7: ], jpayne@7: "has_sig_or_bom": false, jpayne@7: "chaos": 0.149, jpayne@7: "coherence": 97.152, jpayne@7: "unicode_path": null, jpayne@7: "is_preferred": true jpayne@7: } jpayne@7: ``` jpayne@7: jpayne@7: ### Python jpayne@7: *Just print out normalized text* jpayne@7: ```python jpayne@7: from charset_normalizer import from_path jpayne@7: jpayne@7: results = from_path('./my_subtitle.srt') jpayne@7: jpayne@7: print(str(results.best())) jpayne@7: ``` jpayne@7: jpayne@7: *Upgrade your code without effort* jpayne@7: ```python jpayne@7: from charset_normalizer import detect jpayne@7: ``` jpayne@7: jpayne@7: The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible. jpayne@7: jpayne@7: See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/) jpayne@7: jpayne@7: ## 😇 Why jpayne@7: jpayne@7: When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a jpayne@7: reliable alternative using a completely different method. Also! I never back down on a good challenge! jpayne@7: jpayne@7: I **don't care** about the **originating charset** encoding, because **two different tables** can jpayne@7: produce **two identical rendered string.** jpayne@7: What I want is to get readable text, the best I can. jpayne@7: jpayne@7: In a way, **I'm brute forcing text decoding.** How cool is that ? 😎 jpayne@7: jpayne@7: Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode. jpayne@7: jpayne@7: ## 🍰 How jpayne@7: jpayne@7: - Discard all charset encoding table that could not fit the binary content. jpayne@7: - Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding. jpayne@7: - Extract matches with the lowest mess detected. jpayne@7: - Additionally, we measure coherence / probe for a language. jpayne@7: jpayne@7: **Wait a minute**, what is noise/mess and coherence according to **YOU ?** jpayne@7: jpayne@7: *Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then jpayne@7: **I established** some ground rules about **what is obvious** when **it seems like** a mess. jpayne@7: I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to jpayne@7: improve or rewrite it. jpayne@7: jpayne@7: *Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought jpayne@7: that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design. jpayne@7: jpayne@7: ## ⚡ Known limitations jpayne@7: jpayne@7: - Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters)) jpayne@7: - Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content. jpayne@7: jpayne@7: ## ⚠️ About Python EOLs jpayne@7: jpayne@7: **If you are running:** jpayne@7: jpayne@7: - Python >=2.7,<3.5: Unsupported jpayne@7: - Python 3.5: charset-normalizer < 2.1 jpayne@7: - Python 3.6: charset-normalizer < 3.1 jpayne@7: - Python 3.7: charset-normalizer < 4.0 jpayne@7: jpayne@7: Upgrade your Python interpreter as soon as possible. jpayne@7: jpayne@7: ## 👤 Contributing jpayne@7: jpayne@7: Contributions, issues and feature requests are very much welcome.
jpayne@7: Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute. jpayne@7: jpayne@7: ## 📝 License jpayne@7: jpayne@7: Copyright © [Ahmed TAHRI @Ousret](https://github.com/Ousret).
jpayne@7: This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed. jpayne@7: jpayne@7: Characters frequencies used in this project © 2012 [Denny Vrandečić](http://simia.net/letters/) jpayne@7: jpayne@7: ## 💼 For Enterprise jpayne@7: jpayne@7: Professional support for charset-normalizer is available as part of the [Tidelift jpayne@7: Subscription][1]. Tidelift gives software development teams a single source for jpayne@7: purchasing and maintaining their software, with professional grade assurances jpayne@7: from the experts who know it best, while seamlessly integrating with existing jpayne@7: tools. jpayne@7: jpayne@7: [1]: https://tidelift.com/subscription/pkg/pypi-charset-normalizer?utm_source=pypi-charset-normalizer&utm_medium=readme jpayne@7: jpayne@7: # Changelog jpayne@7: All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). jpayne@7: The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). jpayne@7: jpayne@7: ## [3.3.2](https://github.com/Ousret/charset_normalizer/compare/3.3.1...3.3.2) (2023-10-31) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Unintentional memory usage regression when using large payload that match several encoding (#376) jpayne@7: - Regression on some detection case showcased in the documentation (#371) jpayne@7: jpayne@7: ### Added jpayne@7: - Noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife) jpayne@7: jpayne@7: ## [3.3.1](https://github.com/Ousret/charset_normalizer/compare/3.3.0...3.3.1) (2023-10-22) jpayne@7: jpayne@7: ### Changed jpayne@7: - Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8 jpayne@7: - Improved the general detection reliability based on reports from the community jpayne@7: jpayne@7: ## [3.3.0](https://github.com/Ousret/charset_normalizer/compare/3.2.0...3.3.0) (2023-09-30) jpayne@7: jpayne@7: ### Added jpayne@7: - Allow to execute the CLI (e.g. normalizer) through `python -m charset_normalizer.cli` or `python -m charset_normalizer` jpayne@7: - Support for 9 forgotten encoding that are supported by Python but unlisted in `encoding.aliases` as they have no alias (#323) jpayne@7: jpayne@7: ### Removed jpayne@7: - (internal) Redundant utils.is_ascii function and unused function is_private_use_only jpayne@7: - (internal) charset_normalizer.assets is moved inside charset_normalizer.constant jpayne@7: jpayne@7: ### Changed jpayne@7: - (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection jpayne@7: - Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8 jpayne@7: jpayne@7: ### Fixed jpayne@7: - Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in \_\_lt\_\_ (#350) jpayne@7: jpayne@7: ## [3.2.0](https://github.com/Ousret/charset_normalizer/compare/3.1.0...3.2.0) (2023-06-07) jpayne@7: jpayne@7: ### Changed jpayne@7: - Typehint for function `from_path` no longer enforce `PathLike` as its first argument jpayne@7: - Minor improvement over the global detection reliability jpayne@7: jpayne@7: ### Added jpayne@7: - Introduce function `is_binary` that relies on main capabilities, and optimized to detect binaries jpayne@7: - Propagate `enable_fallback` argument throughout `from_bytes`, `from_path`, and `from_fp` that allow a deeper control over the detection (default True) jpayne@7: - Explicit support for Python 3.12 jpayne@7: jpayne@7: ### Fixed jpayne@7: - Edge case detection failure where a file would contain 'very-long' camel cased word (Issue #289) jpayne@7: jpayne@7: ## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06) jpayne@7: jpayne@7: ### Added jpayne@7: - Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262) jpayne@7: jpayne@7: ### Removed jpayne@7: - Support for Python 3.6 (PR #260) jpayne@7: jpayne@7: ### Changed jpayne@7: - Optional speedup provided by mypy/c 1.0.1 jpayne@7: jpayne@7: ## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Multi-bytes cutter/chunk generator did not always cut correctly (PR #233) jpayne@7: jpayne@7: ### Changed jpayne@7: - Speedup provided by mypy/c 0.990 on Python >= 3.7 jpayne@7: jpayne@7: ## [3.0.0](https://github.com/Ousret/charset_normalizer/compare/2.1.1...3.0.0) (2022-10-20) jpayne@7: jpayne@7: ### Added jpayne@7: - Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results jpayne@7: - Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES jpayne@7: - Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio jpayne@7: - `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl) jpayne@7: jpayne@7: ### Changed jpayne@7: - Build with static metadata using 'build' frontend jpayne@7: - Make the language detection stricter jpayne@7: - Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1 jpayne@7: jpayne@7: ### Fixed jpayne@7: - CLI with opt --normalize fail when using full path for files jpayne@7: - TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it jpayne@7: - Sphinx warnings when generating the documentation jpayne@7: jpayne@7: ### Removed jpayne@7: - Coherence detector no longer return 'Simple English' instead return 'English' jpayne@7: - Coherence detector no longer return 'Classical Chinese' instead return 'Chinese' jpayne@7: - Breaking: Method `first()` and `best()` from CharsetMatch jpayne@7: - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII) jpayne@7: - Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches jpayne@7: - Breaking: Top-level function `normalize` jpayne@7: - Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch jpayne@7: - Support for the backport `unicodedata2` jpayne@7: jpayne@7: ## [3.0.0rc1](https://github.com/Ousret/charset_normalizer/compare/3.0.0b2...3.0.0rc1) (2022-10-18) jpayne@7: jpayne@7: ### Added jpayne@7: - Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results jpayne@7: - Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES jpayne@7: - Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio jpayne@7: jpayne@7: ### Changed jpayne@7: - Build with static metadata using 'build' frontend jpayne@7: - Make the language detection stricter jpayne@7: jpayne@7: ### Fixed jpayne@7: - CLI with opt --normalize fail when using full path for files jpayne@7: - TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it jpayne@7: jpayne@7: ### Removed jpayne@7: - Coherence detector no longer return 'Simple English' instead return 'English' jpayne@7: - Coherence detector no longer return 'Classical Chinese' instead return 'Chinese' jpayne@7: jpayne@7: ## [3.0.0b2](https://github.com/Ousret/charset_normalizer/compare/3.0.0b1...3.0.0b2) (2022-08-21) jpayne@7: jpayne@7: ### Added jpayne@7: - `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl) jpayne@7: jpayne@7: ### Removed jpayne@7: - Breaking: Method `first()` and `best()` from CharsetMatch jpayne@7: - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Sphinx warnings when generating the documentation jpayne@7: jpayne@7: ## [3.0.0b1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...3.0.0b1) (2022-08-15) jpayne@7: jpayne@7: ### Changed jpayne@7: - Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1 jpayne@7: jpayne@7: ### Removed jpayne@7: - Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches jpayne@7: - Breaking: Top-level function `normalize` jpayne@7: - Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch jpayne@7: - Support for the backport `unicodedata2` jpayne@7: jpayne@7: ## [2.1.1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...2.1.1) (2022-08-19) jpayne@7: jpayne@7: ### Deprecated jpayne@7: - Function `normalize` scheduled for removal in 3.0 jpayne@7: jpayne@7: ### Changed jpayne@7: - Removed useless call to decode in fn is_unprintable (#206) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) (#204) jpayne@7: jpayne@7: ## [2.1.0](https://github.com/Ousret/charset_normalizer/compare/2.0.12...2.1.0) (2022-06-19) jpayne@7: jpayne@7: ### Added jpayne@7: - Output the Unicode table version when running the CLI with `--version` (PR #194) jpayne@7: jpayne@7: ### Changed jpayne@7: - Re-use decoded buffer for single byte character sets from [@nijel](https://github.com/nijel) (PR #175) jpayne@7: - Fixing some performance bottlenecks from [@deedy5](https://github.com/deedy5) (PR #183) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175) jpayne@7: - CLI default threshold aligned with the API threshold from [@oleksandr-kuzmenko](https://github.com/oleksandr-kuzmenko) (PR #181) jpayne@7: jpayne@7: ### Removed jpayne@7: - Support for Python 3.5 (PR #192) jpayne@7: jpayne@7: ### Deprecated jpayne@7: - Use of backport unicodedata from `unicodedata2` as Python is quickly catching up, scheduled for removal in 3.0 (PR #194) jpayne@7: jpayne@7: ## [2.0.12](https://github.com/Ousret/charset_normalizer/compare/2.0.11...2.0.12) (2022-02-12) jpayne@7: jpayne@7: ### Fixed jpayne@7: - ASCII miss-detection on rare cases (PR #170) jpayne@7: jpayne@7: ## [2.0.11](https://github.com/Ousret/charset_normalizer/compare/2.0.10...2.0.11) (2022-01-30) jpayne@7: jpayne@7: ### Added jpayne@7: - Explicit support for Python 3.11 (PR #164) jpayne@7: jpayne@7: ### Changed jpayne@7: - The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165) jpayne@7: jpayne@7: ## [2.0.10](https://github.com/Ousret/charset_normalizer/compare/2.0.9...2.0.10) (2022-01-04) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154) jpayne@7: jpayne@7: ### Changed jpayne@7: - Skipping the language-detection (CD) on ASCII (PR #155) jpayne@7: jpayne@7: ## [2.0.9](https://github.com/Ousret/charset_normalizer/compare/2.0.8...2.0.9) (2021-12-03) jpayne@7: jpayne@7: ### Changed jpayne@7: - Moderating the logging impact (since 2.0.8) for specific environments (PR #147) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Wrong logging level applied when setting kwarg `explain` to True (PR #146) jpayne@7: jpayne@7: ## [2.0.8](https://github.com/Ousret/charset_normalizer/compare/2.0.7...2.0.8) (2021-11-24) jpayne@7: ### Changed jpayne@7: - Improvement over Vietnamese detection (PR #126) jpayne@7: - MD improvement on trailing data and long foreign (non-pure latin) data (PR #124) jpayne@7: - Efficiency improvements in cd/alphabet_languages from [@adbar](https://github.com/adbar) (PR #122) jpayne@7: - call sum() without an intermediary list following PEP 289 recommendations from [@adbar](https://github.com/adbar) (PR #129) jpayne@7: - Code style as refactored by Sourcery-AI (PR #131) jpayne@7: - Minor adjustment on the MD around european words (PR #133) jpayne@7: - Remove and replace SRTs from assets / tests (PR #139) jpayne@7: - Initialize the library logger with a `NullHandler` by default from [@nmaynes](https://github.com/nmaynes) (PR #135) jpayne@7: - Setting kwarg `explain` to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Fix large (misleading) sequence giving UnicodeDecodeError (PR #137) jpayne@7: - Avoid using too insignificant chunk (PR #137) jpayne@7: jpayne@7: ### Added jpayne@7: - Add and expose function `set_logging_handler` to configure a specific StreamHandler from [@nmaynes](https://github.com/nmaynes) (PR #135) jpayne@7: - Add `CHANGELOG.md` entries, format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) (PR #141) jpayne@7: jpayne@7: ## [2.0.7](https://github.com/Ousret/charset_normalizer/compare/2.0.6...2.0.7) (2021-10-11) jpayne@7: ### Added jpayne@7: - Add support for Kazakh (Cyrillic) language detection (PR #109) jpayne@7: jpayne@7: ### Changed jpayne@7: - Further, improve inferring the language from a given single-byte code page (PR #112) jpayne@7: - Vainly trying to leverage PEP263 when PEP3120 is not supported (PR #116) jpayne@7: - Refactoring for potential performance improvements in loops from [@adbar](https://github.com/adbar) (PR #113) jpayne@7: - Various detection improvement (MD+CD) (PR #117) jpayne@7: jpayne@7: ### Removed jpayne@7: - Remove redundant logging entry about detected language(s) (PR #115) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR #117 #102) jpayne@7: jpayne@7: ## [2.0.6](https://github.com/Ousret/charset_normalizer/compare/2.0.5...2.0.6) (2021-09-18) jpayne@7: ### Fixed jpayne@7: - Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR #100) jpayne@7: - Fix CLI crash when using --minimal output in certain cases (PR #103) jpayne@7: jpayne@7: ### Changed jpayne@7: - Minor improvement to the detection efficiency (less than 1%) (PR #106 #101) jpayne@7: jpayne@7: ## [2.0.5](https://github.com/Ousret/charset_normalizer/compare/2.0.4...2.0.5) (2021-09-14) jpayne@7: ### Changed jpayne@7: - The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR #81) jpayne@7: - The BC-support with v1.x was improved, the old staticmethods are restored (PR #82) jpayne@7: - The Unicode detection is slightly improved (PR #93) jpayne@7: - Add syntax sugar \_\_bool\_\_ for results CharsetMatches list-container (PR #91) jpayne@7: jpayne@7: ### Removed jpayne@7: - The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR #92) jpayne@7: jpayne@7: ### Fixed jpayne@7: - In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR #95) jpayne@7: - Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR #96) jpayne@7: - The MANIFEST.in was not exhaustive (PR #78) jpayne@7: jpayne@7: ## [2.0.4](https://github.com/Ousret/charset_normalizer/compare/2.0.3...2.0.4) (2021-07-30) jpayne@7: ### Fixed jpayne@7: - The CLI no longer raise an unexpected exception when no encoding has been found (PR #70) jpayne@7: - Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR #68) jpayne@7: - The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR #72) jpayne@7: - Submatch factoring could be wrong in rare edge cases (PR #72) jpayne@7: - Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR #72) jpayne@7: - Fix line endings from CRLF to LF for certain project files (PR #67) jpayne@7: jpayne@7: ### Changed jpayne@7: - Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR #69 #76) jpayne@7: - Allow fallback on specified encoding if any (PR #71) jpayne@7: jpayne@7: ## [2.0.3](https://github.com/Ousret/charset_normalizer/compare/2.0.2...2.0.3) (2021-07-16) jpayne@7: ### Changed jpayne@7: - Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR #63) jpayne@7: - According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR #64) jpayne@7: jpayne@7: ## [2.0.2](https://github.com/Ousret/charset_normalizer/compare/2.0.1...2.0.2) (2021-07-15) jpayne@7: ### Fixed jpayne@7: - Empty/Too small JSON payload miss-detection fixed. Report from [@tseaver](https://github.com/tseaver) (PR #59) jpayne@7: jpayne@7: ### Changed jpayne@7: - Don't inject unicodedata2 into sys.modules from [@akx](https://github.com/akx) (PR #57) jpayne@7: jpayne@7: ## [2.0.1](https://github.com/Ousret/charset_normalizer/compare/2.0.0...2.0.1) (2021-07-13) jpayne@7: ### Fixed jpayne@7: - Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from [@sethmlarson](https://github.com/sethmlarson). (PR #55) jpayne@7: - Using explain=False permanently disable the verbose output in the current runtime (PR #47) jpayne@7: - One log entry (language target preemptive) was not show in logs when using explain=True (PR #47) jpayne@7: - Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR #52) jpayne@7: jpayne@7: ### Changed jpayne@7: - Public function normalize default args values were not aligned with from_bytes (PR #53) jpayne@7: jpayne@7: ### Added jpayne@7: - You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR #47) jpayne@7: jpayne@7: ## [2.0.0](https://github.com/Ousret/charset_normalizer/compare/1.4.1...2.0.0) (2021-07-02) jpayne@7: ### Changed jpayne@7: - 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet. jpayne@7: - Accent has been made on UTF-8 detection, should perform rather instantaneous. jpayne@7: - The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible. jpayne@7: - The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time) jpayne@7: - The program has been rewritten to ease the readability and maintainability. (+Using static typing)+ jpayne@7: - utf_7 detection has been reinstated. jpayne@7: jpayne@7: ### Removed jpayne@7: - This package no longer require anything when used with Python 3.5 (Dropped cached_property) jpayne@7: - Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, Volapük, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian. jpayne@7: - The exception hook on UnicodeDecodeError has been removed. jpayne@7: jpayne@7: ### Deprecated jpayne@7: - Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0 jpayne@7: jpayne@7: ### Fixed jpayne@7: - The CLI output used the relative path of the file(s). Should be absolute. jpayne@7: jpayne@7: ## [1.4.1](https://github.com/Ousret/charset_normalizer/compare/1.4.0...1.4.1) (2021-05-28) jpayne@7: ### Fixed jpayne@7: - Logger configuration/usage no longer conflict with others (PR #44) jpayne@7: jpayne@7: ## [1.4.0](https://github.com/Ousret/charset_normalizer/compare/1.3.9...1.4.0) (2021-05-21) jpayne@7: ### Removed jpayne@7: - Using standard logging instead of using the package loguru. jpayne@7: - Dropping nose test framework in favor of the maintained pytest. jpayne@7: - Choose to not use dragonmapper package to help with gibberish Chinese/CJK text. jpayne@7: - Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version. jpayne@7: - Stop support for UTF-7 that does not contain a SIG. jpayne@7: - Dropping PrettyTable, replaced with pure JSON output in CLI. jpayne@7: jpayne@7: ### Fixed jpayne@7: - BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process. jpayne@7: - Not searching properly for the BOM when trying utf32/16 parent codec. jpayne@7: jpayne@7: ### Changed jpayne@7: - Improving the package final size by compressing frequencies.json. jpayne@7: - Huge improvement over the larges payload. jpayne@7: jpayne@7: ### Added jpayne@7: - CLI now produces JSON consumable output. jpayne@7: - Return ASCII if given sequences fit. Given reasonable confidence. jpayne@7: jpayne@7: ## [1.3.9](https://github.com/Ousret/charset_normalizer/compare/1.3.8...1.3.9) (2021-05-13) jpayne@7: jpayne@7: ### Fixed jpayne@7: - In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR #40) jpayne@7: jpayne@7: ## [1.3.8](https://github.com/Ousret/charset_normalizer/compare/1.3.7...1.3.8) (2021-05-12) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Empty given payload for detection may cause an exception if trying to access the `alphabets` property. (PR #39) jpayne@7: jpayne@7: ## [1.3.7](https://github.com/Ousret/charset_normalizer/compare/1.3.6...1.3.7) (2021-05-12) jpayne@7: jpayne@7: ### Fixed jpayne@7: - The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR #38) jpayne@7: jpayne@7: ## [1.3.6](https://github.com/Ousret/charset_normalizer/compare/1.3.5...1.3.6) (2021-02-09) jpayne@7: jpayne@7: ### Changed jpayne@7: - Amend the previous release to allow prettytable 2.0 (PR #35) jpayne@7: jpayne@7: ## [1.3.5](https://github.com/Ousret/charset_normalizer/compare/1.3.4...1.3.5) (2021-02-08) jpayne@7: jpayne@7: ### Fixed jpayne@7: - Fix error while using the package with a python pre-release interpreter (PR #33) jpayne@7: jpayne@7: ### Changed jpayne@7: - Dependencies refactoring, constraints revised. jpayne@7: jpayne@7: ### Added jpayne@7: - Add python 3.9 and 3.10 to the supported interpreters jpayne@7: jpayne@7: MIT License jpayne@7: jpayne@7: Copyright (c) 2019 TAHRI Ahmed R. jpayne@7: jpayne@7: Permission is hereby granted, free of charge, to any person obtaining a copy jpayne@7: of this software and associated documentation files (the "Software"), to deal jpayne@7: in the Software without restriction, including without limitation the rights jpayne@7: to use, copy, modify, merge, publish, distribute, sublicense, and/or sell jpayne@7: copies of the Software, and to permit persons to whom the Software is jpayne@7: furnished to do so, subject to the following conditions: jpayne@7: jpayne@7: The above copyright notice and this permission notice shall be included in all jpayne@7: copies or substantial portions of the Software. jpayne@7: jpayne@7: THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR jpayne@7: IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, jpayne@7: FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE jpayne@7: AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER jpayne@7: LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, jpayne@7: OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE jpayne@7: SOFTWARE.