-
Notifications
You must be signed in to change notification settings - Fork 203
added connector folder and HF file #313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Reviewed everything up to 68df99a in 53 seconds
More details
- Looked at
53lines of code in1files - Skipped
0files when reviewing. - Skipped posting
0drafted comments based on config settings.
Workflow ID: wflow_g9aSZ6jnsjudJEhx
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 152a99e in 36 seconds
More details
- Looked at
29lines of code in2files - Skipped
0files when reviewing. - Skipped posting
0drafted comments based on config settings.
Workflow ID: wflow_SGhnothmBoTYYbbD
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 2f783ca in 1 minute and 33 seconds
More details
- Looked at
54lines of code in3files - Skipped
0files when reviewing. - Skipped posting
0drafted comments based on config settings.
Workflow ID: wflow_qrqWYnwFpCtSGGBI
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
connectors/huggingface_connecter.py
Outdated
|
|
||
|
|
||
| # Function to fetch data from a Hugging Face dataset | ||
| def fetch_data_from_huggingface(dataset_identifier, dataset_split=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider adding error handling for network-related exceptions such as requests.exceptions.RequestException to ensure robustness when fetching data from external APIs.
connectors/huggingface_connecter.py
Outdated
| raise e # Re-raise other ValueErrors | ||
|
|
||
| # Load function to be used as a connector | ||
| def load(dataset_identifier, dataset_split=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider also stripping the dataset_split argument to handle potential spaces:
| def load(dataset_identifier, dataset_split=None): | |
| data = fetch_data_from_huggingface(dataset_identifier.strip(), dataset_split.strip() if dataset_split else None) |
examples/HF_example_usage.py
Outdated
| #Takes last two parts of url to get allenai/quartz | ||
| atlas_dataset = huggingface_connecter.load('allenai/quartz') | ||
|
|
||
| atlas_dataset.create_index(topic_model=True, embedding_model='NomicEmbed') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add error handling around the create_index method to manage potential failures gracefully:
RLesser
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, maybe some things to change
connectors/huggingface_connecter.py
Outdated
| entry[key] = str(value) | ||
|
|
||
|
|
||
| dataset.add_data(data=data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we making an index here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite; that part mainly just converts anything that isn't a string, like booleans or lists, into strings because it will error out if I don't. I didn't really like how it is just a bunch of conditional statements but I couldn't find a better way to resolve it.
connectors/huggingface_connecter.py
Outdated
|
|
||
|
|
||
| # Convert all booleans and lists to strings | ||
| for entry in data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two potential issues here:
- This seems like it would be extremely slow (and possibly crash) for large huggingface datasets.
- I'm a bit worried about assuming this is how people would want to handle these fields, but i guess they can edit it themselves if they want it done differently...
connectors/huggingface_connecter.py
Outdated
|
|
||
| # Processes dataset entries | ||
| data = [] | ||
| for split in dataset.keys(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here about large datasets - are we sure this will not break?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I solve this with batch processing and by passing streaming=True through load dataset? Aaron mentioned that so it seems like that could prevent that issue
The primary reason to avoid pure-random IDs or hash based IDs is that they cause worst case performance when used as keys in ordered data structures (such b-tree indexes in a database), ULIDs improve on this by making the beginning of the ID a timestamp so that IDs created around the same time have some locality to each other, but, like UUIDs, they are still kinda big Big (semi)random IDs like ULID are best used when you need uniqueness while also avoiding coordination, e.g. you have multiple processes inserting data into something and it would add a lot of complexity to make them cooperate to assign non-overlapping IDs - but in situations you where can use purely sequential IDs it is usually better to, as smaller ids are cheaper to store and look up When using Line 77 in 1f042be
I believe this will not currently work when the dataset size exceeds available RAM on the machine running this - HF datasets understands slice syntax when specifying a split so you can test with portions of a very large dataset with Making it work should be possible by working in chunks and using IterableDatasets https://huggingface.co/docs/datasets/v2.20.0/en/about_mapstyle_vs_iterable#downloading-and-streaming here is a notebook where I'm uploading from an iterabledataset in chunks (note, though, that because I call load_dataset and then to_iterable_dataset this still downloads the entire dataset - you can also pass |
|
Going off @apage43's comment, I feel strongly that we should be taking advantage of huggingface datasets use of Arrow to pass data to atlas, which also speaks fluent arrow. We should also be taking advantage of batching or chunking for arbitrarily large datasets. Using base python iterators means this will break for larger datasets. |
…ssing and getting config without parsing through error message
connectors/huggingface_connector.py
Outdated
| try: | ||
| # Loads dataset without specifying config | ||
| dataset = load_dataset(dataset_identifier) | ||
| except ValueError as e: | ||
| # Grabs available configs and loads dataset using it | ||
| configs = get_dataset_split_names(dataset_identifier) | ||
| config = configs[0] | ||
| dataset = load_dataset(dataset_identifier, config, trust_remote_code=True, streaming=True, split=config + "[:100000]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would, instead of this, just make the split (and max length) optional arguments to this function (and in turn, to the top level hf_atlasdataset as well) instead of silently checking for splits and grabbing the first one which may not be the one that a user intends - for example, on wikimedia/wikipedia which is split by language, that would be Abkhazian wikipedia, since ab is alphabetically first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also note that streaming=True makes load_dataset return an IterableDataset which has slightly different behaviors than a normal Dataset - should probably always using streaming if you're going to use it - otherwise you'd need to make sure to test both cases.
Also the row limit should probably be an optional argument as well rather than hardcoded - someone may want to upload a >100k row dataset or less - while using split='splitname[:1000] does work its also not the only way - the .take on a dataset will return a new dataset with only that many rows (and on IterableDatasets will do the right thing and only fetch that many): dataset = dataset.take(1000) - this is probably more sensible for exposing the limit as an argument.
another issue is that slicing this or using .take will get the beginning of the dataset - often times if you are wanting to map a sample of a dataset (because you want to quickly get a picture of what's in it without spending the time/compute to map the whole thing) you want a random sample, for example the wikipedia dataset is also in alphabetical order by title so articles near the beginning will just be ones with titles starting with A which - probably won't get a very representative map of the whole dataset.
Datasets also have a .shuffle method which works similarlu to .take and should be applied before .take. E.g. to get 1000 random rows from a dataset you want dataset = dataset.shuffle().take(1000) - it probably makes sense to use this any time a limit is specified, but its not needed if the whole dataset will be uploaded.
examples/HF_example_usage.py
Outdated
| import logging | ||
|
|
||
| if __name__ == "__main__": | ||
| dataset_identifier = input("Enter Hugging Face dataset identifier: ").strip() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about instead of making this an interactive script, use argparse: https://docs.python.org/3.10/library/argparse.html?highlight=argparse#module-argparse
that way it's easier to handle optional args like split and limit
connectors/huggingface_connector.py
Outdated
|
|
||
|
|
||
| # Convert the data list to an Arrow table | ||
| table = pa.Table.from_pandas(pd.DataFrame(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could probably do pa.Table.from_pylist(data) instead and avoid roundtripping through pandas here
connectors/huggingface_connector.py
Outdated
| def process_table(table): | ||
| # Converts columns with complex types to strings | ||
| for col in table.schema.names: | ||
| column = table[col].to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pyarrow.compute.cast may be able to handle some of this without having to go through pandas/pure python
https://arrow.apache.org/docs/python/generated/pyarrow.compute.cast.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My code seems to throw an Attribute error without it. There are some cases that it works but with some like this one it doesn't.
https://huggingface.co/datasets/Anthropic/hh-rlhf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can leave this as is for now then
connectors/huggingface_connector.py
Outdated
|
|
||
|
|
||
| # Adds data to the AtlasDataset | ||
| dataset.add_data(data=processed_table.to_pandas().to_dict(orient='records')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add_data accepts arrow tables directly, no need for this conversion (add_data will have to convert it back into an arrow table before uploading if you do this)
|
current version of this has no create_index calls so it'll only create an AtlasDataset with data in it but no map - is that intended? |
connectors/huggingface_connector.py
Outdated
|
|
||
| # Gets data from HF dataset | ||
| def get_hfdata(dataset_identifier, split="train", limit=100000): | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docstring to function definition
connectors/huggingface_connector.py
Outdated
|
|
||
|
|
||
| # Load the dataset | ||
| dataset = load_dataset(dataset_identifier, split=split, streaming=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move load_dataset outside of try/except
connectors/huggingface_connector.py
Outdated
|
|
||
| # Load the dataset | ||
| dataset = load_dataset(dataset_identifier, split=split, streaming=True) | ||
| except ValueError as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to handle this error, seems like handling config through the error message is too risky
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you can take out the try/except here
connectors/huggingface_connector.py
Outdated
| # Processes dataset entries using Arrow | ||
| id_counter = 0 | ||
| data = [] | ||
| if dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if no dataset exists, function should fail
| column = table[col] | ||
| if pa.types.is_boolean(column.type): | ||
| table = table.set_column(table.schema.get_field_index(col), col, pc.cast(column, pa.string())) | ||
| elif pa.types.is_list(column.type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you flatten the column as a list then cast as a string, the column will be too long and not match up with the other columns of the table. We may need to refactor this slightly by making sure column length is the same and structs are handled on a row by row basis
examples/HF_example_usage.py
Outdated
|
|
||
| if __name__ == "__main__": | ||
| parser = argparse.ArgumentParser(description='Create an AtlasDataset from a Hugging Face dataset.') | ||
| parser.add_argument('--dataset_identifier', type=str, required=True, help='The Hugging Face dataset identifier') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you may want to explicitly specify in the arg that it's a hf_dataset_identifier because atlas has its own datasets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❌ Changes requested. Incremental review on 9ae14f4 in 1 minute and 4 seconds
More details
- Looked at
140lines of code in2files - Skipped
0files when reviewing. - Skipped posting
0drafted comments based on config settings.
Workflow ID: wflow_5mSlb7zHjU1nAKlz
Want Ellipsis to fix these issues? Tag @ellipsis-dev in a comment. You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.
| # Gets data from HF dataset | ||
| def get_hfdata(dataset_identifier, split="train", limit=100000): | ||
| splits = get_dataset_split_names(dataset_identifier) | ||
| dataset = load_dataset(dataset_identifier, split=split, streaming=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation does not handle the case where the specified split is not available in the dataset. Previously, there was a mechanism to check available splits and use an alternative if the specified one was not found. Consider reintroducing this functionality to avoid runtime errors.
HF file takes in any Huggingface identifier and then returns an AtlasDataset
Updates:
Testing:
Limitations:
Summary:
Introduced a new connector for Hugging Face datasets, processed data using Apache Arrow, and provided an example usage script.
Key points:
connectors/huggingface_connector.py.connectors/huggingface_connector.get_hfdatato load datasets and handle configuration issues.connectors/huggingface_connector.hf_atlasdatasetto create anAtlasDataset.connectors/huggingface_connector.convert_to_stringandconnectors/huggingface_connector.process_table.connectors/huggingface_connector.py.connectors/__init__.pyandexamples/HF_example_usage.py.add_dataaccepts arrow tables directly.Generated with ❤️ by ellipsis.dev