Mercurial > repos > jpayne > bioproject_to_srr_2
comparison charset_normalizer-3.3.2.dist-info/METADATA @ 7:5eb2d5e3bf22
planemo upload for repository https://toolrepo.galaxytrakr.org/view/jpayne/bioproject_to_srr_2/556cac4fb538
author | jpayne |
---|---|
date | Sun, 05 May 2024 23:32:17 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
6:b2745907b1eb | 7:5eb2d5e3bf22 |
---|---|
1 Metadata-Version: 2.1 | |
2 Name: charset-normalizer | |
3 Version: 3.3.2 | |
4 Summary: The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet. | |
5 Home-page: https://github.com/Ousret/charset_normalizer | |
6 Author: Ahmed TAHRI | |
7 Author-email: ahmed.tahri@cloudnursery.dev | |
8 License: MIT | |
9 Project-URL: Bug Reports, https://github.com/Ousret/charset_normalizer/issues | |
10 Project-URL: Documentation, https://charset-normalizer.readthedocs.io/en/latest | |
11 Keywords: encoding,charset,charset-detector,detector,normalization,unicode,chardet,detect | |
12 Classifier: Development Status :: 5 - Production/Stable | |
13 Classifier: License :: OSI Approved :: MIT License | |
14 Classifier: Intended Audience :: Developers | |
15 Classifier: Topic :: Software Development :: Libraries :: Python Modules | |
16 Classifier: Operating System :: OS Independent | |
17 Classifier: Programming Language :: Python | |
18 Classifier: Programming Language :: Python :: 3 | |
19 Classifier: Programming Language :: Python :: 3.7 | |
20 Classifier: Programming Language :: Python :: 3.8 | |
21 Classifier: Programming Language :: Python :: 3.9 | |
22 Classifier: Programming Language :: Python :: 3.10 | |
23 Classifier: Programming Language :: Python :: 3.11 | |
24 Classifier: Programming Language :: Python :: 3.12 | |
25 Classifier: Programming Language :: Python :: Implementation :: PyPy | |
26 Classifier: Topic :: Text Processing :: Linguistic | |
27 Classifier: Topic :: Utilities | |
28 Classifier: Typing :: Typed | |
29 Requires-Python: >=3.7.0 | |
30 Description-Content-Type: text/markdown | |
31 License-File: LICENSE | |
32 Provides-Extra: unicode_backport | |
33 | |
34 <h1 align="center">Charset Detection, for Everyone π</h1> | |
35 | |
36 <p align="center"> | |
37 <sup>The Real First Universal Charset Detector</sup><br> | |
38 <a href="https://pypi.org/project/charset-normalizer"> | |
39 <img src="https://img.shields.io/pypi/pyversions/charset_normalizer.svg?orange=blue" /> | |
40 </a> | |
41 <a href="https://pepy.tech/project/charset-normalizer/"> | |
42 <img alt="Download Count Total" src="https://static.pepy.tech/badge/charset-normalizer/month" /> | |
43 </a> | |
44 <a href="https://bestpractices.coreinfrastructure.org/projects/7297"> | |
45 <img src="https://bestpractices.coreinfrastructure.org/projects/7297/badge"> | |
46 </a> | |
47 </p> | |
48 <p align="center"> | |
49 <sup><i>Featured Packages</i></sup><br> | |
50 <a href="https://github.com/jawah/niquests"> | |
51 <img alt="Static Badge" src="https://img.shields.io/badge/Niquests-HTTP_1.1%2C%202%2C_and_3_Client-cyan"> | |
52 </a> | |
53 <a href="https://github.com/jawah/wassima"> | |
54 <img alt="Static Badge" src="https://img.shields.io/badge/Wassima-Certifi_Killer-cyan"> | |
55 </a> | |
56 </p> | |
57 <p align="center"> | |
58 <sup><i>In other language (unofficial port - by the community)</i></sup><br> | |
59 <a href="https://github.com/nickspring/charset-normalizer-rs"> | |
60 <img alt="Static Badge" src="https://img.shields.io/badge/Rust-red"> | |
61 </a> | |
62 </p> | |
63 | |
64 > A library that helps you read text from an unknown charset encoding.<br /> Motivated by `chardet`, | |
65 > I'm trying to resolve the issue by taking a new approach. | |
66 > All IANA character set names for which the Python core library provides codecs are supported. | |
67 | |
68 <p align="center"> | |
69 >>>>> <a href="https://charsetnormalizerweb.ousret.now.sh" target="_blank">π Try Me Online Now, Then Adopt Me π </a> <<<<< | |
70 </p> | |
71 | |
72 This project offers you an alternative to **Universal Charset Encoding Detector**, also known as **Chardet**. | |
73 | |
74 | Feature | [Chardet](https://github.com/chardet/chardet) | Charset Normalizer | [cChardet](https://github.com/PyYoshi/cChardet) | | |
75 |--------------------------------------------------|:---------------------------------------------:|:--------------------------------------------------------------------------------------------------:|:-----------------------------------------------:| | |
76 | `Fast` | β | β | β | | |
77 | `Universal**` | β | β | β | | |
78 | `Reliable` **without** distinguishable standards | β | β | β | | |
79 | `Reliable` **with** distinguishable standards | β | β | β | | |
80 | `License` | LGPL-2.1<br>_restrictive_ | MIT | MPL-1.1<br>_restrictive_ | | |
81 | `Native Python` | β | β | β | | |
82 | `Detect spoken language` | β | β | N/A | | |
83 | `UnicodeDecodeError Safety` | β | β | β | | |
84 | `Whl Size (min)` | 193.6 kB | 42 kB | ~200 kB | | |
85 | `Supported Encoding` | 33 | π [99](https://charset-normalizer.readthedocs.io/en/latest/user/support.html#supported-encodings) | 40 | | |
86 | |
87 <p align="center"> | |
88 <img src="https://i.imgflip.com/373iay.gif" alt="Reading Normalized Text" width="226"/><img src="https://media.tenor.com/images/c0180f70732a18b4965448d33adba3d0/tenor.gif" alt="Cat Reading Text" width="200"/> | |
89 </p> | |
90 | |
91 *\*\* : They are clearly using specific code for a specific encoding even if covering most of used one*<br> | |
92 Did you got there because of the logs? See [https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html](https://charset-normalizer.readthedocs.io/en/latest/user/miscellaneous.html) | |
93 | |
94 ## β‘ Performance | |
95 | |
96 This package offer better performance than its counterpart Chardet. Here are some numbers. | |
97 | |
98 | Package | Accuracy | Mean per file (ms) | File per sec (est) | | |
99 |-----------------------------------------------|:--------:|:------------------:|:------------------:| | |
100 | [chardet](https://github.com/chardet/chardet) | 86 % | 200 ms | 5 file/sec | | |
101 | charset-normalizer | **98 %** | **10 ms** | 100 file/sec | | |
102 | |
103 | Package | 99th percentile | 95th percentile | 50th percentile | | |
104 |-----------------------------------------------|:---------------:|:---------------:|:---------------:| | |
105 | [chardet](https://github.com/chardet/chardet) | 1200 ms | 287 ms | 23 ms | | |
106 | charset-normalizer | 100 ms | 50 ms | 5 ms | | |
107 | |
108 Chardet's performance on larger file (1MB+) are very poor. Expect huge difference on large payload. | |
109 | |
110 > Stats are generated using 400+ files using default parameters. More details on used files, see GHA workflows. | |
111 > And yes, these results might change at any time. The dataset can be updated to include more files. | |
112 > The actual delays heavily depends on your CPU capabilities. The factors should remain the same. | |
113 > Keep in mind that the stats are generous and that Chardet accuracy vs our is measured using Chardet initial capability | |
114 > (eg. Supported Encoding) Challenge-them if you want. | |
115 | |
116 ## β¨ Installation | |
117 | |
118 Using pip: | |
119 | |
120 ```sh | |
121 pip install charset-normalizer -U | |
122 ``` | |
123 | |
124 ## π Basic Usage | |
125 | |
126 ### CLI | |
127 This package comes with a CLI. | |
128 | |
129 ``` | |
130 usage: normalizer [-h] [-v] [-a] [-n] [-m] [-r] [-f] [-t THRESHOLD] | |
131 file [file ...] | |
132 | |
133 The Real First Universal Charset Detector. Discover originating encoding used | |
134 on text file. Normalize text to unicode. | |
135 | |
136 positional arguments: | |
137 files File(s) to be analysed | |
138 | |
139 optional arguments: | |
140 -h, --help show this help message and exit | |
141 -v, --verbose Display complementary information about file if any. | |
142 Stdout will contain logs about the detection process. | |
143 -a, --with-alternative | |
144 Output complementary possibilities if any. Top-level | |
145 JSON WILL be a list. | |
146 -n, --normalize Permit to normalize input file. If not set, program | |
147 does not write anything. | |
148 -m, --minimal Only output the charset detected to STDOUT. Disabling | |
149 JSON output. | |
150 -r, --replace Replace file when trying to normalize it instead of | |
151 creating a new one. | |
152 -f, --force Replace file without asking if you are sure, use this | |
153 flag with caution. | |
154 -t THRESHOLD, --threshold THRESHOLD | |
155 Define a custom maximum amount of chaos allowed in | |
156 decoded content. 0. <= chaos <= 1. | |
157 --version Show version information and exit. | |
158 ``` | |
159 | |
160 ```bash | |
161 normalizer ./data/sample.1.fr.srt | |
162 ``` | |
163 | |
164 or | |
165 | |
166 ```bash | |
167 python -m charset_normalizer ./data/sample.1.fr.srt | |
168 ``` | |
169 | |
170 π Since version 1.4.0 the CLI produce easily usable stdout result in JSON format. | |
171 | |
172 ```json | |
173 { | |
174 "path": "/home/default/projects/charset_normalizer/data/sample.1.fr.srt", | |
175 "encoding": "cp1252", | |
176 "encoding_aliases": [ | |
177 "1252", | |
178 "windows_1252" | |
179 ], | |
180 "alternative_encodings": [ | |
181 "cp1254", | |
182 "cp1256", | |
183 "cp1258", | |
184 "iso8859_14", | |
185 "iso8859_15", | |
186 "iso8859_16", | |
187 "iso8859_3", | |
188 "iso8859_9", | |
189 "latin_1", | |
190 "mbcs" | |
191 ], | |
192 "language": "French", | |
193 "alphabets": [ | |
194 "Basic Latin", | |
195 "Latin-1 Supplement" | |
196 ], | |
197 "has_sig_or_bom": false, | |
198 "chaos": 0.149, | |
199 "coherence": 97.152, | |
200 "unicode_path": null, | |
201 "is_preferred": true | |
202 } | |
203 ``` | |
204 | |
205 ### Python | |
206 *Just print out normalized text* | |
207 ```python | |
208 from charset_normalizer import from_path | |
209 | |
210 results = from_path('./my_subtitle.srt') | |
211 | |
212 print(str(results.best())) | |
213 ``` | |
214 | |
215 *Upgrade your code without effort* | |
216 ```python | |
217 from charset_normalizer import detect | |
218 ``` | |
219 | |
220 The above code will behave the same as **chardet**. We ensure that we offer the best (reasonable) BC result possible. | |
221 | |
222 See the docs for advanced usage : [readthedocs.io](https://charset-normalizer.readthedocs.io/en/latest/) | |
223 | |
224 ## π Why | |
225 | |
226 When I started using Chardet, I noticed that it was not suited to my expectations, and I wanted to propose a | |
227 reliable alternative using a completely different method. Also! I never back down on a good challenge! | |
228 | |
229 I **don't care** about the **originating charset** encoding, because **two different tables** can | |
230 produce **two identical rendered string.** | |
231 What I want is to get readable text, the best I can. | |
232 | |
233 In a way, **I'm brute forcing text decoding.** How cool is that ? π | |
234 | |
235 Don't confuse package **ftfy** with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode. | |
236 | |
237 ## π° How | |
238 | |
239 - Discard all charset encoding table that could not fit the binary content. | |
240 - Measure noise, or the mess once opened (by chunks) with a corresponding charset encoding. | |
241 - Extract matches with the lowest mess detected. | |
242 - Additionally, we measure coherence / probe for a language. | |
243 | |
244 **Wait a minute**, what is noise/mess and coherence according to **YOU ?** | |
245 | |
246 *Noise :* I opened hundred of text files, **written by humans**, with the wrong encoding table. **I observed**, then | |
247 **I established** some ground rules about **what is obvious** when **it seems like** a mess. | |
248 I know that my interpretation of what is noise is probably incomplete, feel free to contribute in order to | |
249 improve or rewrite it. | |
250 | |
251 *Coherence :* For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought | |
252 that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design. | |
253 | |
254 ## β‘ Known limitations | |
255 | |
256 - Language detection is unreliable when text contains two or more languages sharing identical letters. (eg. HTML (english tags) + Turkish content (Sharing Latin characters)) | |
257 - Every charset detector heavily depends on sufficient content. In common cases, do not bother run detection on very tiny content. | |
258 | |
259 ## β οΈ About Python EOLs | |
260 | |
261 **If you are running:** | |
262 | |
263 - Python >=2.7,<3.5: Unsupported | |
264 - Python 3.5: charset-normalizer < 2.1 | |
265 - Python 3.6: charset-normalizer < 3.1 | |
266 - Python 3.7: charset-normalizer < 4.0 | |
267 | |
268 Upgrade your Python interpreter as soon as possible. | |
269 | |
270 ## π€ Contributing | |
271 | |
272 Contributions, issues and feature requests are very much welcome.<br /> | |
273 Feel free to check [issues page](https://github.com/ousret/charset_normalizer/issues) if you want to contribute. | |
274 | |
275 ## π License | |
276 | |
277 Copyright Β© [Ahmed TAHRI @Ousret](https://github.com/Ousret).<br /> | |
278 This project is [MIT](https://github.com/Ousret/charset_normalizer/blob/master/LICENSE) licensed. | |
279 | |
280 Characters frequencies used in this project Β© 2012 [Denny VrandeΔiΔ](http://simia.net/letters/) | |
281 | |
282 ## πΌ For Enterprise | |
283 | |
284 Professional support for charset-normalizer is available as part of the [Tidelift | |
285 Subscription][1]. Tidelift gives software development teams a single source for | |
286 purchasing and maintaining their software, with professional grade assurances | |
287 from the experts who know it best, while seamlessly integrating with existing | |
288 tools. | |
289 | |
290 [1]: https://tidelift.com/subscription/pkg/pypi-charset-normalizer?utm_source=pypi-charset-normalizer&utm_medium=readme | |
291 | |
292 # Changelog | |
293 All notable changes to charset-normalizer will be documented in this file. This project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). | |
294 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/). | |
295 | |
296 ## [3.3.2](https://github.com/Ousret/charset_normalizer/compare/3.3.1...3.3.2) (2023-10-31) | |
297 | |
298 ### Fixed | |
299 - Unintentional memory usage regression when using large payload that match several encoding (#376) | |
300 - Regression on some detection case showcased in the documentation (#371) | |
301 | |
302 ### Added | |
303 - Noise (md) probe that identify malformed arabic representation due to the presence of letters in isolated form (credit to my wife) | |
304 | |
305 ## [3.3.1](https://github.com/Ousret/charset_normalizer/compare/3.3.0...3.3.1) (2023-10-22) | |
306 | |
307 ### Changed | |
308 - Optional mypyc compilation upgraded to version 1.6.1 for Python >= 3.8 | |
309 - Improved the general detection reliability based on reports from the community | |
310 | |
311 ## [3.3.0](https://github.com/Ousret/charset_normalizer/compare/3.2.0...3.3.0) (2023-09-30) | |
312 | |
313 ### Added | |
314 - Allow to execute the CLI (e.g. normalizer) through `python -m charset_normalizer.cli` or `python -m charset_normalizer` | |
315 - Support for 9 forgotten encoding that are supported by Python but unlisted in `encoding.aliases` as they have no alias (#323) | |
316 | |
317 ### Removed | |
318 - (internal) Redundant utils.is_ascii function and unused function is_private_use_only | |
319 - (internal) charset_normalizer.assets is moved inside charset_normalizer.constant | |
320 | |
321 ### Changed | |
322 - (internal) Unicode code blocks in constants are updated using the latest v15.0.0 definition to improve detection | |
323 - Optional mypyc compilation upgraded to version 1.5.1 for Python >= 3.8 | |
324 | |
325 ### Fixed | |
326 - Unable to properly sort CharsetMatch when both chaos/noise and coherence were close due to an unreachable condition in \_\_lt\_\_ (#350) | |
327 | |
328 ## [3.2.0](https://github.com/Ousret/charset_normalizer/compare/3.1.0...3.2.0) (2023-06-07) | |
329 | |
330 ### Changed | |
331 - Typehint for function `from_path` no longer enforce `PathLike` as its first argument | |
332 - Minor improvement over the global detection reliability | |
333 | |
334 ### Added | |
335 - Introduce function `is_binary` that relies on main capabilities, and optimized to detect binaries | |
336 - Propagate `enable_fallback` argument throughout `from_bytes`, `from_path`, and `from_fp` that allow a deeper control over the detection (default True) | |
337 - Explicit support for Python 3.12 | |
338 | |
339 ### Fixed | |
340 - Edge case detection failure where a file would contain 'very-long' camel cased word (Issue #289) | |
341 | |
342 ## [3.1.0](https://github.com/Ousret/charset_normalizer/compare/3.0.1...3.1.0) (2023-03-06) | |
343 | |
344 ### Added | |
345 - Argument `should_rename_legacy` for legacy function `detect` and disregard any new arguments without errors (PR #262) | |
346 | |
347 ### Removed | |
348 - Support for Python 3.6 (PR #260) | |
349 | |
350 ### Changed | |
351 - Optional speedup provided by mypy/c 1.0.1 | |
352 | |
353 ## [3.0.1](https://github.com/Ousret/charset_normalizer/compare/3.0.0...3.0.1) (2022-11-18) | |
354 | |
355 ### Fixed | |
356 - Multi-bytes cutter/chunk generator did not always cut correctly (PR #233) | |
357 | |
358 ### Changed | |
359 - Speedup provided by mypy/c 0.990 on Python >= 3.7 | |
360 | |
361 ## [3.0.0](https://github.com/Ousret/charset_normalizer/compare/2.1.1...3.0.0) (2022-10-20) | |
362 | |
363 ### Added | |
364 - Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results | |
365 - Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES | |
366 - Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio | |
367 - `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl) | |
368 | |
369 ### Changed | |
370 - Build with static metadata using 'build' frontend | |
371 - Make the language detection stricter | |
372 - Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1 | |
373 | |
374 ### Fixed | |
375 - CLI with opt --normalize fail when using full path for files | |
376 - TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it | |
377 - Sphinx warnings when generating the documentation | |
378 | |
379 ### Removed | |
380 - Coherence detector no longer return 'Simple English' instead return 'English' | |
381 - Coherence detector no longer return 'Classical Chinese' instead return 'Chinese' | |
382 - Breaking: Method `first()` and `best()` from CharsetMatch | |
383 - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII) | |
384 - Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches | |
385 - Breaking: Top-level function `normalize` | |
386 - Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch | |
387 - Support for the backport `unicodedata2` | |
388 | |
389 ## [3.0.0rc1](https://github.com/Ousret/charset_normalizer/compare/3.0.0b2...3.0.0rc1) (2022-10-18) | |
390 | |
391 ### Added | |
392 - Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results | |
393 - Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES | |
394 - Add parameter `language_threshold` in `from_bytes`, `from_path` and `from_fp` to adjust the minimum expected coherence ratio | |
395 | |
396 ### Changed | |
397 - Build with static metadata using 'build' frontend | |
398 - Make the language detection stricter | |
399 | |
400 ### Fixed | |
401 - CLI with opt --normalize fail when using full path for files | |
402 - TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha character have been fed to it | |
403 | |
404 ### Removed | |
405 - Coherence detector no longer return 'Simple English' instead return 'English' | |
406 - Coherence detector no longer return 'Classical Chinese' instead return 'Chinese' | |
407 | |
408 ## [3.0.0b2](https://github.com/Ousret/charset_normalizer/compare/3.0.0b1...3.0.0b2) (2022-08-21) | |
409 | |
410 ### Added | |
411 - `normalizer --version` now specify if current version provide extra speedup (meaning mypyc compilation whl) | |
412 | |
413 ### Removed | |
414 - Breaking: Method `first()` and `best()` from CharsetMatch | |
415 - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII) | |
416 | |
417 ### Fixed | |
418 - Sphinx warnings when generating the documentation | |
419 | |
420 ## [3.0.0b1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...3.0.0b1) (2022-08-15) | |
421 | |
422 ### Changed | |
423 - Optional: Module `md.py` can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1 | |
424 | |
425 ### Removed | |
426 - Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches | |
427 - Breaking: Top-level function `normalize` | |
428 - Breaking: Properties `chaos_secondary_pass`, `coherence_non_latin` and `w_counter` from CharsetMatch | |
429 - Support for the backport `unicodedata2` | |
430 | |
431 ## [2.1.1](https://github.com/Ousret/charset_normalizer/compare/2.1.0...2.1.1) (2022-08-19) | |
432 | |
433 ### Deprecated | |
434 - Function `normalize` scheduled for removal in 3.0 | |
435 | |
436 ### Changed | |
437 - Removed useless call to decode in fn is_unprintable (#206) | |
438 | |
439 ### Fixed | |
440 - Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from [@aleksandernovikov](https://github.com/aleksandernovikov) (#204) | |
441 | |
442 ## [2.1.0](https://github.com/Ousret/charset_normalizer/compare/2.0.12...2.1.0) (2022-06-19) | |
443 | |
444 ### Added | |
445 - Output the Unicode table version when running the CLI with `--version` (PR #194) | |
446 | |
447 ### Changed | |
448 - Re-use decoded buffer for single byte character sets from [@nijel](https://github.com/nijel) (PR #175) | |
449 - Fixing some performance bottlenecks from [@deedy5](https://github.com/deedy5) (PR #183) | |
450 | |
451 ### Fixed | |
452 - Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175) | |
453 - CLI default threshold aligned with the API threshold from [@oleksandr-kuzmenko](https://github.com/oleksandr-kuzmenko) (PR #181) | |
454 | |
455 ### Removed | |
456 - Support for Python 3.5 (PR #192) | |
457 | |
458 ### Deprecated | |
459 - Use of backport unicodedata from `unicodedata2` as Python is quickly catching up, scheduled for removal in 3.0 (PR #194) | |
460 | |
461 ## [2.0.12](https://github.com/Ousret/charset_normalizer/compare/2.0.11...2.0.12) (2022-02-12) | |
462 | |
463 ### Fixed | |
464 - ASCII miss-detection on rare cases (PR #170) | |
465 | |
466 ## [2.0.11](https://github.com/Ousret/charset_normalizer/compare/2.0.10...2.0.11) (2022-01-30) | |
467 | |
468 ### Added | |
469 - Explicit support for Python 3.11 (PR #164) | |
470 | |
471 ### Changed | |
472 - The logging behavior have been completely reviewed, now using only TRACE and DEBUG levels (PR #163 #165) | |
473 | |
474 ## [2.0.10](https://github.com/Ousret/charset_normalizer/compare/2.0.9...2.0.10) (2022-01-04) | |
475 | |
476 ### Fixed | |
477 - Fallback match entries might lead to UnicodeDecodeError for large bytes sequence (PR #154) | |
478 | |
479 ### Changed | |
480 - Skipping the language-detection (CD) on ASCII (PR #155) | |
481 | |
482 ## [2.0.9](https://github.com/Ousret/charset_normalizer/compare/2.0.8...2.0.9) (2021-12-03) | |
483 | |
484 ### Changed | |
485 - Moderating the logging impact (since 2.0.8) for specific environments (PR #147) | |
486 | |
487 ### Fixed | |
488 - Wrong logging level applied when setting kwarg `explain` to True (PR #146) | |
489 | |
490 ## [2.0.8](https://github.com/Ousret/charset_normalizer/compare/2.0.7...2.0.8) (2021-11-24) | |
491 ### Changed | |
492 - Improvement over Vietnamese detection (PR #126) | |
493 - MD improvement on trailing data and long foreign (non-pure latin) data (PR #124) | |
494 - Efficiency improvements in cd/alphabet_languages from [@adbar](https://github.com/adbar) (PR #122) | |
495 - call sum() without an intermediary list following PEP 289 recommendations from [@adbar](https://github.com/adbar) (PR #129) | |
496 - Code style as refactored by Sourcery-AI (PR #131) | |
497 - Minor adjustment on the MD around european words (PR #133) | |
498 - Remove and replace SRTs from assets / tests (PR #139) | |
499 - Initialize the library logger with a `NullHandler` by default from [@nmaynes](https://github.com/nmaynes) (PR #135) | |
500 - Setting kwarg `explain` to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135) | |
501 | |
502 ### Fixed | |
503 - Fix large (misleading) sequence giving UnicodeDecodeError (PR #137) | |
504 - Avoid using too insignificant chunk (PR #137) | |
505 | |
506 ### Added | |
507 - Add and expose function `set_logging_handler` to configure a specific StreamHandler from [@nmaynes](https://github.com/nmaynes) (PR #135) | |
508 - Add `CHANGELOG.md` entries, format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) (PR #141) | |
509 | |
510 ## [2.0.7](https://github.com/Ousret/charset_normalizer/compare/2.0.6...2.0.7) (2021-10-11) | |
511 ### Added | |
512 - Add support for Kazakh (Cyrillic) language detection (PR #109) | |
513 | |
514 ### Changed | |
515 - Further, improve inferring the language from a given single-byte code page (PR #112) | |
516 - Vainly trying to leverage PEP263 when PEP3120 is not supported (PR #116) | |
517 - Refactoring for potential performance improvements in loops from [@adbar](https://github.com/adbar) (PR #113) | |
518 - Various detection improvement (MD+CD) (PR #117) | |
519 | |
520 ### Removed | |
521 - Remove redundant logging entry about detected language(s) (PR #115) | |
522 | |
523 ### Fixed | |
524 - Fix a minor inconsistency between Python 3.5 and other versions regarding language detection (PR #117 #102) | |
525 | |
526 ## [2.0.6](https://github.com/Ousret/charset_normalizer/compare/2.0.5...2.0.6) (2021-09-18) | |
527 ### Fixed | |
528 - Unforeseen regression with the loss of the backward-compatibility with some older minor of Python 3.5.x (PR #100) | |
529 - Fix CLI crash when using --minimal output in certain cases (PR #103) | |
530 | |
531 ### Changed | |
532 - Minor improvement to the detection efficiency (less than 1%) (PR #106 #101) | |
533 | |
534 ## [2.0.5](https://github.com/Ousret/charset_normalizer/compare/2.0.4...2.0.5) (2021-09-14) | |
535 ### Changed | |
536 - The project now comply with: flake8, mypy, isort and black to ensure a better overall quality (PR #81) | |
537 - The BC-support with v1.x was improved, the old staticmethods are restored (PR #82) | |
538 - The Unicode detection is slightly improved (PR #93) | |
539 - Add syntax sugar \_\_bool\_\_ for results CharsetMatches list-container (PR #91) | |
540 | |
541 ### Removed | |
542 - The project no longer raise warning on tiny content given for detection, will be simply logged as warning instead (PR #92) | |
543 | |
544 ### Fixed | |
545 - In some rare case, the chunks extractor could cut in the middle of a multi-byte character and could mislead the mess detection (PR #95) | |
546 - Some rare 'space' characters could trip up the UnprintablePlugin/Mess detection (PR #96) | |
547 - The MANIFEST.in was not exhaustive (PR #78) | |
548 | |
549 ## [2.0.4](https://github.com/Ousret/charset_normalizer/compare/2.0.3...2.0.4) (2021-07-30) | |
550 ### Fixed | |
551 - The CLI no longer raise an unexpected exception when no encoding has been found (PR #70) | |
552 - Fix accessing the 'alphabets' property when the payload contains surrogate characters (PR #68) | |
553 - The logger could mislead (explain=True) on detected languages and the impact of one MBCS match (PR #72) | |
554 - Submatch factoring could be wrong in rare edge cases (PR #72) | |
555 - Multiple files given to the CLI were ignored when publishing results to STDOUT. (After the first path) (PR #72) | |
556 - Fix line endings from CRLF to LF for certain project files (PR #67) | |
557 | |
558 ### Changed | |
559 - Adjust the MD to lower the sensitivity, thus improving the global detection reliability (PR #69 #76) | |
560 - Allow fallback on specified encoding if any (PR #71) | |
561 | |
562 ## [2.0.3](https://github.com/Ousret/charset_normalizer/compare/2.0.2...2.0.3) (2021-07-16) | |
563 ### Changed | |
564 - Part of the detection mechanism has been improved to be less sensitive, resulting in more accurate detection results. Especially ASCII. (PR #63) | |
565 - According to the community wishes, the detection will fall back on ASCII or UTF-8 in a last-resort case. (PR #64) | |
566 | |
567 ## [2.0.2](https://github.com/Ousret/charset_normalizer/compare/2.0.1...2.0.2) (2021-07-15) | |
568 ### Fixed | |
569 - Empty/Too small JSON payload miss-detection fixed. Report from [@tseaver](https://github.com/tseaver) (PR #59) | |
570 | |
571 ### Changed | |
572 - Don't inject unicodedata2 into sys.modules from [@akx](https://github.com/akx) (PR #57) | |
573 | |
574 ## [2.0.1](https://github.com/Ousret/charset_normalizer/compare/2.0.0...2.0.1) (2021-07-13) | |
575 ### Fixed | |
576 - Make it work where there isn't a filesystem available, dropping assets frequencies.json. Report from [@sethmlarson](https://github.com/sethmlarson). (PR #55) | |
577 - Using explain=False permanently disable the verbose output in the current runtime (PR #47) | |
578 - One log entry (language target preemptive) was not show in logs when using explain=True (PR #47) | |
579 - Fix undesired exception (ValueError) on getitem of instance CharsetMatches (PR #52) | |
580 | |
581 ### Changed | |
582 - Public function normalize default args values were not aligned with from_bytes (PR #53) | |
583 | |
584 ### Added | |
585 - You may now use charset aliases in cp_isolation and cp_exclusion arguments (PR #47) | |
586 | |
587 ## [2.0.0](https://github.com/Ousret/charset_normalizer/compare/1.4.1...2.0.0) (2021-07-02) | |
588 ### Changed | |
589 - 4x to 5 times faster than the previous 1.4.0 release. At least 2x faster than Chardet. | |
590 - Accent has been made on UTF-8 detection, should perform rather instantaneous. | |
591 - The backward compatibility with Chardet has been greatly improved. The legacy detect function returns an identical charset name whenever possible. | |
592 - The detection mechanism has been slightly improved, now Turkish content is detected correctly (most of the time) | |
593 - The program has been rewritten to ease the readability and maintainability. (+Using static typing)+ | |
594 - utf_7 detection has been reinstated. | |
595 | |
596 ### Removed | |
597 - This package no longer require anything when used with Python 3.5 (Dropped cached_property) | |
598 - Removed support for these languages: Catalan, Esperanto, Kazakh, Baque, VolapΓΌk, Azeri, Galician, Nynorsk, Macedonian, and Serbocroatian. | |
599 - The exception hook on UnicodeDecodeError has been removed. | |
600 | |
601 ### Deprecated | |
602 - Methods coherence_non_latin, w_counter, chaos_secondary_pass of the class CharsetMatch are now deprecated and scheduled for removal in v3.0 | |
603 | |
604 ### Fixed | |
605 - The CLI output used the relative path of the file(s). Should be absolute. | |
606 | |
607 ## [1.4.1](https://github.com/Ousret/charset_normalizer/compare/1.4.0...1.4.1) (2021-05-28) | |
608 ### Fixed | |
609 - Logger configuration/usage no longer conflict with others (PR #44) | |
610 | |
611 ## [1.4.0](https://github.com/Ousret/charset_normalizer/compare/1.3.9...1.4.0) (2021-05-21) | |
612 ### Removed | |
613 - Using standard logging instead of using the package loguru. | |
614 - Dropping nose test framework in favor of the maintained pytest. | |
615 - Choose to not use dragonmapper package to help with gibberish Chinese/CJK text. | |
616 - Require cached_property only for Python 3.5 due to constraint. Dropping for every other interpreter version. | |
617 - Stop support for UTF-7 that does not contain a SIG. | |
618 - Dropping PrettyTable, replaced with pure JSON output in CLI. | |
619 | |
620 ### Fixed | |
621 - BOM marker in a CharsetNormalizerMatch instance could be False in rare cases even if obviously present. Due to the sub-match factoring process. | |
622 - Not searching properly for the BOM when trying utf32/16 parent codec. | |
623 | |
624 ### Changed | |
625 - Improving the package final size by compressing frequencies.json. | |
626 - Huge improvement over the larges payload. | |
627 | |
628 ### Added | |
629 - CLI now produces JSON consumable output. | |
630 - Return ASCII if given sequences fit. Given reasonable confidence. | |
631 | |
632 ## [1.3.9](https://github.com/Ousret/charset_normalizer/compare/1.3.8...1.3.9) (2021-05-13) | |
633 | |
634 ### Fixed | |
635 - In some very rare cases, you may end up getting encode/decode errors due to a bad bytes payload (PR #40) | |
636 | |
637 ## [1.3.8](https://github.com/Ousret/charset_normalizer/compare/1.3.7...1.3.8) (2021-05-12) | |
638 | |
639 ### Fixed | |
640 - Empty given payload for detection may cause an exception if trying to access the `alphabets` property. (PR #39) | |
641 | |
642 ## [1.3.7](https://github.com/Ousret/charset_normalizer/compare/1.3.6...1.3.7) (2021-05-12) | |
643 | |
644 ### Fixed | |
645 - The legacy detect function should return UTF-8-SIG if sig is present in the payload. (PR #38) | |
646 | |
647 ## [1.3.6](https://github.com/Ousret/charset_normalizer/compare/1.3.5...1.3.6) (2021-02-09) | |
648 | |
649 ### Changed | |
650 - Amend the previous release to allow prettytable 2.0 (PR #35) | |
651 | |
652 ## [1.3.5](https://github.com/Ousret/charset_normalizer/compare/1.3.4...1.3.5) (2021-02-08) | |
653 | |
654 ### Fixed | |
655 - Fix error while using the package with a python pre-release interpreter (PR #33) | |
656 | |
657 ### Changed | |
658 - Dependencies refactoring, constraints revised. | |
659 | |
660 ### Added | |
661 - Add python 3.9 and 3.10 to the supported interpreters | |
662 | |
663 MIT License | |
664 | |
665 Copyright (c) 2019 TAHRI Ahmed R. | |
666 | |
667 Permission is hereby granted, free of charge, to any person obtaining a copy | |
668 of this software and associated documentation files (the "Software"), to deal | |
669 in the Software without restriction, including without limitation the rights | |
670 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
671 copies of the Software, and to permit persons to whom the Software is | |
672 furnished to do so, subject to the following conditions: | |
673 | |
674 The above copyright notice and this permission notice shall be included in all | |
675 copies or substantial portions of the Software. | |
676 | |
677 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
678 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
679 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
680 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
681 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
682 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
683 SOFTWARE. |