IT. Expert System.

REGEX

Unicode


Unicode

Unicode is an encoding standard that provides the basis for processing, storage, and interchange of text data in any language in all modern software and information technology protocols.

Many modern regular expression engines offer at least some support for Unicode.

Basic support for Unicode includes the ability to match a literal string of Unicode characters. Some regular expression engines include advanced support that includes character classes and other constructs that incorporate characters from all Unicode-supported languages.

Match a Specific Code Point

Use \uFFFF to match a specific Unicode code point where FFFF is the hexadecimal number of the code point to match.

Unicode Properties, Scripts, Blocks

Unicode defines classes of characters that have a particular property, belong to a script, or exist within a block.

  • A property is a charcter's defining characteristics (eg. being a letter or number)
  • A script is a writing system (eg. Latin, Hebrew, etc.)
  • A block is a range of characters in the Unicode character map

Unicode Properties

The following table defines the Standard Unicode Properties:

Property Definition
\p{L} or
\p{Letter}
any kind of letter from any language
\p{Ll} or
\p{Lowercase_Letter}
a lowercase letter that has an uppercase variant
\p{Lu} or
\p{Uppercase_Letter}
an uppercase letter that has a lowercase variant
\p{Lt} or
\p{Titlecase_Letter}
a letter that appears at the start of a word when only the first letter of the word is capitalized
\p{L&} or
\p{Letter&}
a letter that exists in lowercase and uppercase variants (combination of Ll, Lu and Lt)
\p{Lm} or
\p{Modifier_Letter}
a special character that is used like a letter
\p{Lo} or
\p{Other_Letter}
a letter or ideograph that does not have lowercase and uppercase variants
\p{M} or
\p{Mark}
a character intended to be combined with another character (e.g. accents, umlauts, enclosing boxes, etc.)
\p{Mn} or
\p{Non_Spacing_Mark}
a character intended to be combined with another character without taking up extra space (e.g. accents, umlauts, etc.)
\p{Mc} or
\p{Spacing_Combining_Mark}
a character intended to be combined with another character that takes up extra space (vowel signs in many Eastern languages)
\p{Me} or
\p{Enclosing_Mark}
a character that encloses the character is is combined with (circle, square, keycap, etc.)
\p{Z} or
\p{Separator}
any kind of whitespace or invisible separator
\p{Zs} or
\p{Space_Separator}
a whitespace character that is invisible, but does take up space
\p{Zl} or
\p{Line_Separator}
line separator character U+2028
\p{Zp} or
\p{Paragraph_Separator}
paragraph separator character U+2029
\p{S} or
\p{Symbol}
math symbols, currency signs, dingbats, box-drawing characters, etc.
\p{Sm} or
\p{Math_Symbol}
any mathematical symbol
\p{Sc} or
\p{Currency_Symbol}
any currency sign
\p{Sk} or
\p{Modifier_Symbol}
a combining character (mark) as a full character on its own
\p{So} or
\p{Other_Symbol}
various symbols that are not math symbols, currency signs, or combining characters
\p{N} or
\p{Number}
any kind of numeric character in any script
\p{Nd} or
\p{Decimal_Digit_Number}
a digit zero through nine in any script except ideographic scripts
\p{Nl} or
\p{Letter_Number}
a number that looks like a letter, such as a Roman numeral
\p{No} or
\p{Other_Number}
a superscript or subscript digit, or a number that is not a digit 0..9 (excluding numbers from ideographic scripts)
\p{P} or
\p{Punctuation}
any kind of punctuation character
\p{Pd} or
\p{Dash_Punctuation}
any kind of hyphen or dash
\p{Ps} or
\p{Open_Punctuation}
any kind of opening bracket
\p{Pe} or
\p{Close_Punctuation}
any kind of closing bracket
\p{Pi} or
\p{Initial_Punctuation}
any kind of opening quote
\p{Pf} or
\p{Final_Punctuation}
any kind of closing quote
\p{Pc} or
\p{Connector_Punctuation}
a punctuation character such as an underscore that connects words
\p{Po} or
\p{Other_Punctuation}
any kind of punctuation character that is not a dash, bracket, quote or connector
\p{C} or
\p{Other}
invisible control characters and unused code points
\p{Cc} or
\p{Control}
an ASCII 0x00..0x1F or Latin-1 0x80..0x9F control character
\p{Cf} or
\p{Format}
invisible formatting indicator
\p{Co} or
\p{Private_Use}
any code point reserved for private use
\p{Cs} or
\p{Surrogate}
one half of a surrogate pair in UTF-16 encoding
\p{Cn} or
\p{Unassigned}
any code point to which no character has been assigned

Unicode Scripts

Each assigned code point is placed into a script, which is a collection of symbols used to represent textual information in one or more writing systems.

There is also a special script called the Common script, which contains many characters that are widespread among a range of scripts and includes puctuation, whitespace and other symbols.

Note: Unicode scripts are currently only support by PCRE and Perl engines.

The following table defines the Unicode Scripts:

Script
\p{Common}
\p{Arabic}
\p{Armenian}
\p{Bengali}
\p{Bopomofo}
\p{Braille}
\p{Buhid}
\p{CanadianAboriginal}
\p{Cherokee}
\p{Cyrillic}
\p{Devanagari}
\p{Ethiopic}
\p{Georgian}
\p{Greek}
\p{Gujarati}
\p{Gurmukhi}
\p{Han}
\p{Hangul}
\p{Hanunoo}
\p{Hebrew}
\p{Hiragana}
\p{Inherited}
\p{Kannada}
\p{Katakana}
\p{Khmer}
\p{Lao}
\p{Latin} or
\p{IsLatin}
\p{Limbu}
\p{Malayalam}
\p{Mongolian}
\p{Myanmar}
\p{Ogham}
\p{Oriya}
\p{Runic}
\p{Sinhala}
\p{Syriac}
\p{Tagalog}
\p{Tagbanwa}
\p{TaiLe}
\p{Tamil}
\p{Telugu}
\p{Thaana}
\p{Thai}
\p{Tibetan}
\p{Yi}

Unicode Blocks

Unicode characters are also divided into non-overlapping ranges called blocks, many of which have the same name as one of the scripts because characters of that script are primarily encoded in that block. However, blocks and scripts differ in the following ways:

  1. Blocks are ranges and often contain code points that are unassigned
  2. Characters from the same script may be in several different blocks
  3. Characters from different scripts may be in the same block
Value Definition
\p{InBasic_Latin} U+0000..U+007F
\p{InLatin-1_Supplement} U+0080..U+00FF
\p{InLatin_Extended-A} U+0100..U+017F
\p{InLatin_Extended-B} U+0180..U+024F
\p{InIPA_Extensions} U+0250..U+02AF
\p{InSpacing_Modifier_Letters} U+02B0..U+02FF
\p{InCombining_Diacritical_Marks} U+0300..U+036F
\p{InGreek_and_Coptic} U+0370..U+03FF
\p{InCyrillic} U+0400..U+04FF
\p{InCyrillic_Supplementary} U+0500..U+052F
\p{InArmenian} U+0530..U+058F
\p{InHebrew} U+0590..U+05FF
\p{InArabic} U+0600..U+06FF
\p{InSyriac} U+0700..U+074F
\p{InThaana} U+0780..U+07BF
\p{InDevanagari} U+0900..U+097F
\p{InBengali} U+0980..U+09FF
\p{InGurmukhi} U+0A00..U+0A7F
\p{InGujarati} U+0A80..U+0AFF
\p{InOriya} U+0B00..U+0B7F
\p{InTamil} U+0B80..U+0BFF
\p{InTelugu} U+0C00..U+0C7F
\p{InKannada} U+0C80..U+0CFF
\p{InMalayalam} U+0D00..U+0D7F
\p{InSinhala} U+0D80..U+0DFF
\p{InThai} U+0E00..U+0E7F
\p{InLao} U+0E80..U+0EFF
\p{InTibetan} U+0F00..U+0FFF
\p{InMyanmar} U+1000..U+109F
\p{InGeorgian} U+10A0..U+10FF
\p{InHangul_Jamo} U+1100..U+11FF
\p{InEthiopic} U+1200..U+137F
\p{InCherokee} U+13A0..U+13FF
\p{InUnified_Canadian_Aboriginal_Syllabics} U+1400..U+167F
\p{InOgham} U+1680..U+169F
\p{InRunic} U+16A0..U+16FF
\p{InTagalog} U+1700..U+171F
\p{InHanunoo} U+1720..U+173F
\p{InBuhid} U+1740..U+175F
\p{InTagbanwa} U+1760..U+177F
\p{InKhmer} U+1780..U+17FF
\p{InMongolian} U+1800..U+18AF
\p{InLimbu} U+1900..U+194F
\p{InTai_Le} U+1950..U+197F
\p{InKhmer_Symbols} U+19E0..U+19FF
\p{InPhonetic_Extensions} U+1D00..U+1D7F
\p{InLatin_Extended_Additional} U+1E00..U+1EFF
\p{InGreek_Extended} U+1F00..U+1FFF
\p{InGeneral_Punctuation} U+2000..U+206F
\p{InSuperscripts_and_Subscripts} U+2070..U+209F
\p{InCurrency_Symbols} U+20A0..U+20CF
\p{InCombining_Diacritical_Marks_for_Symbols} U+20D0..U+20FF
\p{InLetterlike_Symbols} U+2100..U+214F
\p{InNumber_Forms} U+2150..U+218F
\p{InArrows} U+2190..U+21FF
\p{InMathematical_Operators} U+2200..U+22FF
\p{InMiscellaneous_Technical} U+2300..U+23FF
\p{InControl_Pictures} U+2400..U+243F
\p{InOptical_Character_Recognition} U+2440..U+245F
\p{InEnclosed_Alphanumerics} U+2460..U+24FF
\p{InBox_Drawing} U+2500..U+257F
\p{InBlock_Elements} U+2580..U+259F
\p{InGeometric_Shapes} U+25A0..U+25FF
\p{InMiscellaneous_Symbols} U+2600..U+26FF
\p{InDingbats} U+2700..U+27BF
\p{InMiscellaneous_Mathematical_Symbols-A} U+27C0..U+27EF
\p{InSupplemental_Arrows-A} U+27F0..U+27FF
\p{InBraille_Patterns} U+2800..U+28FF
\p{InSupplemental_Arrows-B} U+2900..U+297F
\p{InMiscellaneous_Mathematical_Symbols-B} U+2980..U+29FF
\p{InSupplemental_Mathematical_Operators} U+2A00..U+2AFF
\p{InMiscellaneous_Symbols_and_Arrows} U+2B00..U+2BFF
\p{InCJK_Radicals_Supplement} U+2E80..U+2EFF
\p{InKangxi_Radicals} U+2F00..U+2FDF
\p{InIdeographic_Description_Characters} U+2FF0..U+2FFF
\p{InCJK_Symbols_and_Punctuation} U+3000..U+303F
\p{InHiragana} U+3040..U+309F
\p{InKatakana} U+30A0..U+30FF
\p{InBopomofo} U+3100..U+312F
\p{InHangul_Compatibility_Jamo} U+3130..U+318F
\p{InKanbun} U+3190..U+319F
\p{InBopomofo_Extended} U+31A0..U+31BF
\p{InKatakana_Phonetic_Extensions} U+31F0..U+31FF
\p{InEnclosed_CJK_Letters_and_Months} U+3200..U+32FF
\p{InCJK_Compatibility} U+3300..U+33FF
\p{InCJK_Unified_Ideographs_Extension_A} U+3400..U+4DBF
\p{InYijing_Hexagram_Symbols} U+4DC0..U+4DFF
\p{InCJK_Unified_Ideographs} U+4E00..U+9FFF
\p{InYi_Syllables} U+A000..U+A48F
\p{InYi_Radicals} U+A490..U+A4CF
\p{InHangul_Syllables} U+AC00..U+D7AF
\p{InHigh_Surrogates} U+D800..U+DB7F
\p{InHigh_Private_Use_Surrogates} U+DB80..U+DBFF
\p{InLow_Surrogates} U+DC00..U+DFFF
\p{InPrivate_Use_Area} U+E000..U+F8FF
\p{InCJK_Compatibility_Ideographs} U+F900..U+FAFF
\p{InAlphabetic_Presentation_Forms} U+FB00..U+FB4F
\p{InArabic_Presentation_Forms-A} U+FB50..U+FDFF
\p{InVariation_Selectors} U+FE00..U+FE0F
\p{InCombining_Half_Marks} U+FE20..U+FE2F
\p{InCJK_Compatibility_Forms} U+FE30..U+FE4F
\p{InSmall_Form_Variants} U+FE50..U+FE6F
\p{InArabic_Presentation_Forms-B} U+FE70..U+FEFF
\p{InHalfwidth_and_Fullwidth_Forms} U+FF00..U+FFEF
\p{InSpecials} U+FFF0..U+FFFF

Notes

Some regex engines use different syntax to match Unicode blocks. Use \p{InBlock} for Perl and Java. Use \p{IsBlock} for .NET.

To keep hyphens in the block name, use \p{IsLatinExtended-A} in .NET. With Perl and Java, use \p{InLatin_Extended-A} with an underscore, hyphen, space or nothing for each underscore or hyphen in the block's name.

Unicode blocks are not supported in PCRE.



Content

Android Reference

Java basics

Java Enterprise Edition (EE)

Java Standard Edition (SE)

SQL

HTML

PHP

CSS

Java Script

MYSQL

JQUERY

VBS

REGEX

C

C++

C#

Design patterns

RFC (standard status)

RFC (proposed standard status)

RFC (draft standard status)

RFC (informational status)

RFC (experimental status)

RFC (best current practice status)

RFC (historic status)

RFC (unknown status)

IT dictionary

License.
All information of this service is derived from the free sources and is provided solely in the form of quotations. This service provides information and interfaces solely for the familiarization (not ownership) and under the "as is" condition.
Copyright 2016 © ELTASK.COM. All rights reserved.
Site is optimized for mobile devices.
Downloads: 131 / 158797819. Delta: 0.02665 с