regress/lib.rs
1/*!
2
3# regress - REGex in Rust with EcmaScript Syntax
4
5This crate provides a regular expression engine which targets EcmaScript (aka JavaScript) regular expression syntax.
6
7# Example: test if a string contains a match
8
9```rust
10use regress::Regex;
11let re = Regex::new(r"\d{4}").unwrap();
12let matched = re.find("2020-20-05").is_some();
13assert!(matched);
14```
15
16# Example: iterating over matches
17
18Here we use a backreference to find doubled characters:
19
20```rust
21use regress::Regex;
22let re = Regex::new(r"(\w)\1").unwrap();
23let text = "Frankly, Miss Piggy, I don't give a hoot!";
24for m in re.find_iter(text) {
25 println!("{}", &text[m.range()])
26}
27// Output: ss
28// Output: gg
29// Output: oo
30
31```
32
33# Example: using capture groups
34
35Capture groups are available in the `Match` object produced by a successful match.
36A capture group is a range of byte indexes into the original string.
37
38```rust
39use regress::Regex;
40let re = Regex::new(r"(\d{4})").unwrap();
41let text = "Today is 2020-20-05";
42let m = re.find(text).unwrap();
43let group = m.group(1).unwrap();
44println!("Year: {}", &text[group]);
45// Output: Year: 2020
46```
47
48# Example: using with Pattern trait (nightly only)
49
50When the `pattern` feature is enabled and using nightly Rust, `Regex` can be used with standard string methods:
51
52```rust,ignore
53#![feature(pattern)]
54use regress::Regex;
55let re = Regex::new(r"\d+").unwrap();
56let text = "abc123def456";
57
58// Use with str methods
59assert_eq!(text.find(&re), Some(3));
60assert!(text.contains(&re));
61let parts: Vec<&str> = text.split(&re).collect();
62assert_eq!(parts, vec!["abc", "def", ""]);
63```
64
65# Example: escaping strings for literal matching
66
67Use the `escape` function to escape special regex characters in a string:
68
69```rust
70use regress::{escape, Regex};
71let user_input = "How much $ do you have? (in dollars)";
72let escaped = escape(user_input);
73let re = Regex::new(&escaped).unwrap();
74assert!(re.find(user_input).is_some());
75```
76
77# Supported Syntax
78
79regress targets ES 2018 syntax. You can refer to the many resources about JavaScript regex syntax.
80
81There are some features which have yet to be implemented:
82
83- Named character classes liks `[[:alpha:]]`
84- Unicode property escapes like `\p{Sc}`
85
86Note the parser assumes the `u` (Unicode) flag, as the non-Unicode path is tied to JS's UCS-2 string encoding and the semantics cannot be usefully expressed in Rust.
87
88# Unicode remarks
89
90regress supports Unicode case folding. For example:
91
92```rust
93use regress::Regex;
94let re = Regex::with_flags("\u{00B5}", "i").unwrap();
95assert!(re.find("\u{03BC}").is_some());
96```
97
98Here the U+00B5 (micro sign) was case-insensitively matched against U+03BC (small letter mu).
99
100regress does NOT perform normalization. For example, e-with-accute-accent can be precomposed or decomposed, and these are treated as not equivalent:
101
102```rust
103use regress::{Regex, Flags};
104let re = Regex::new("\u{00E9}").unwrap();
105assert!(re.find("\u{0065}\u{0301}").is_none());
106```
107
108This agrees with JavaScript semantics. Perform any required normalization before regex matching.
109
110## Ascii matching
111
112regress has an "ASCII mode" which treats each 8-bit quantity as a separate character.
113This may provide improved performance if you do not need Unicode semantics, because it can avoid decoding UTF-8 and has simpler (ASCII-only) case-folding.
114
115Example:
116
117```rust
118use regress::Regex;
119let re = Regex::with_flags("BC", "i").unwrap();
120assert!(re.find("abcd").is_some());
121```
122
123# Comparison to regex crate
124
125regress supports features (required by the EcmaScript spec) that regex does not, including backreferences and zero-width lookaround assertions.
126However the regex crate provides linear-time matching guarantees, while regress does not. This difference is due
127to the architecture: regex uses finite automata while regress uses "classical backtracking."
128
129
130# Architecture
131
132regress has a parser, intermediate representation, optimizer which acts on the IR, bytecode emitter, and two bytecode interpreters, referred to as "backends".
133
134The major interpreter is the "classical backtracking" which uses an explicit backtracking stack, similar to JS implementations. There is also the "PikeVM" pseudo-toy backend which is mainly used for testing and verification.
135
136# Crate features
137
138- **utf16**. When enabled, additional APIs are made available that allow matching text formatted in UTF-16 and UCS-2 (`&[u16]`) without going through a conversion to and from UTF-8 (`&str`) first. This is particularly useful when interacting with and/or (re)implementing existing systems that use those encodings, such as JavaScript, Windows, and the JVM.
139
140- **pattern**. When enabled (nightly only), implements the `std::str::pattern::Pattern` trait for `Regex`, allowing it to be used with standard string methods like `str::find`, `str::contains`, `str::split`, etc.
141
142*/
143
144#![cfg_attr(not(feature = "std"), no_std)]
145#![cfg_attr(feature = "pattern", feature(pattern))]
146#![warn(clippy::all)]
147#![allow(
148 clippy::upper_case_acronyms,
149 clippy::match_like_matches_macro,
150 clippy::uninlined_format_args,
151 clippy::collapsible_if
152)]
153// Clippy's manual_range_contains suggestion produces worse codegen.
154#![allow(clippy::manual_range_contains)]
155
156#[cfg(not(feature = "std"))]
157#[macro_use]
158extern crate alloc;
159
160pub use crate::api::*;
161
162#[macro_use]
163mod util;
164
165mod api;
166mod bytesearch;
167mod charclasses;
168mod classicalbacktrack;
169mod codepointset;
170mod cursor;
171mod emit;
172mod exec;
173mod indexing;
174mod insn;
175mod ir;
176mod matchers;
177mod optimizer;
178mod parse;
179mod position;
180mod scm;
181mod startpredicate;
182mod types;
183mod unicode;
184mod unicodetables;
185
186#[cfg(feature = "backend-pikevm")]
187mod pikevm;