serpextract Package¶
serpextract.serpextract
Package¶
serpextract.serpextract
Module¶
Utilities for extracting keyword information from search engine referrers.
-
serpextract.serpextract.
get_parser
(referring_url)[source]¶ Utility function to find a parser for a referring URL if it is a SERP.
Parameters: referring_url ( str
orurlparse.ParseResult
) – Suspected SERP URL.Returns: SearchEngineParser
object if one exists for URL,None
otherwise.
-
serpextract.serpextract.
is_serp
(referring_url, parser=None, use_naive_method=False)[source]¶ Utility function to determine if a referring URL is a SERP.
Parameters: - referring_url (str or urlparse.ParseResult) – Suspected SERP URL.
- parser (
SearchEngineParser
instance orNone
.) – A search engine parser. - use_naive_method (
True
orFalse
) – Whether or not to use a naive method of search engine detection in the event that a parser does not exist for the givenreferring_url
. Seeextract()
for more information.
Returns: True
if SERP,False
otherwise.
-
serpextract.serpextract.
extract
(serp_url, parser=None, lower_case=True, trimmed=True, collapse_whitespace=True, use_naive_method=False)[source]¶ Parse a SERP URL and return information regarding the engine name, keyword and
SearchEngineParser
.Parameters: - serp_url (
str
orurlparse.ParseResult
) – Suspected SERP URL to extract a keyword from. - parser (
SearchEngineParser
) – Optionally pass in a parser if already determined via call to get_parser. - lower_case (
True
orFalse
) – Lower case the keyword. - trimmed (
True
orFalse
) – Trim keyword leading and trailing whitespace. - collapse_whitespace (
True
orFalse
) – Collapse 2 or more\s
characters into one space' '
. - use_naive_method (
True
orFalse
) – In the event that a parser doesn’t exist for the givenserp_url
, attempt to find an instance of_naive_re_pattern
in the netloc of theserp_url
. If found, try to extract a keyword using_naive_params
.
Returns: an
ExtractResult
instance ifserp_url
is valid,None
otherwise- serp_url (
-
serpextract.serpextract.
get_all_query_params
()[source]¶ Return all the possible query string params for all search engines.
Returns: a list
of all the unique query string parameters that are used across the search engine definitions.
-
serpextract.serpextract.
get_all_query_params_by_domain
()[source]¶ Return all the possible query string params for all search engines.
Returns: a list
of all the unique query string parameters that are used across the search engine definitions.
-
serpextract.serpextract.
add_custom_parser
(match_rule, parser)[source]¶ Add a custom search engine parser to the cached
_engines
list.Parameters: - match_rule (
unicode
) – A match rule which is used byget_parser()
to look up a parser for a given domain/path. - parser (
SearchEngineParser
) – A custom parser.
- match_rule (
-
class
serpextract.serpextract.
SearchEngineParser
(engine_name, keyword_extractor, link_macro, charsets, hidden_keyword_paths=None)[source]¶ Bases:
object
Handles persing logic for a single line in Piwik’s list of search engines.
Piwik’s list for reference:
https://raw.github.com/piwik/piwik/master/core/DataFiles/SearchEngines.php
This class is not used directly since it already assumes you know the exact search engine you want to use to parse a URL. The main interface for users of this module is the
extract()
method.-
charsets
¶
-
engine_name
¶
-
get_serp_url
(base_url, keyword)[source]¶ Get a URL to a SERP for a given keyword.
Parameters: - base_url (
str
) – String of format'<scheme>://<netloc>'
. - keyword (
str
) – Search engine keyword.
Returns: a URL that links directly to a SERP for the given keyword.
- base_url (
-
keyword_extractor
¶
-
link_macro
¶
-
parse
(url_parts)[source]¶ Parse a SERP URL to extract the search keyword.
Parameters: serp_url (A urlparse.ParseResult
with all elements as unicode) – The SERP URLReturns: An ExtractResult
instance.
-
-
class
serpextract.serpextract.
SearchEngineParser
(engine_name, keyword_extractor, link_macro, charsets, hidden_keyword_paths=None)[source] Handles persing logic for a single line in Piwik’s list of search engines.
Piwik’s list for reference:
https://raw.github.com/piwik/piwik/master/core/DataFiles/SearchEngines.php
This class is not used directly since it already assumes you know the exact search engine you want to use to parse a URL. The main interface for users of this module is the
extract()
method.-
get_serp_url
(base_url, keyword)[source] Get a URL to a SERP for a given keyword.
Parameters: - base_url (
str
) – String of format'<scheme>://<netloc>'
. - keyword (
str
) – Search engine keyword.
Returns: a URL that links directly to a SERP for the given keyword.
- base_url (
-
parse
(url_parts)[source] Parse a SERP URL to extract the search keyword.
Parameters: serp_url (A urlparse.ParseResult
with all elements as unicode) – The SERP URLReturns: An ExtractResult
instance.
-